This article describes fixing train-inference mismatch when migrating PipelineRL from vLLM V0 to V1 in reinforcement learning.
- •Four key fixes were needed: processed logprobs semantics, explicit runtime defaults, inflight weight update handling, and fp32 lm_head precision
- •The processed_logprobs setting fixed semantic mismatch by returning logprobs from the sampler's processed distribution
- •Runtime configs required explicit settings (enable-prefix-caching: false, async-scheduling: false) to match V0 behavior in online RL
- •fp32 lm_head computation was critical because small changes in logits become visible in policy ratios and KL divergence
This summary was automatically generated by AI based on the original article and may not be fully accurate.