pith. sign in

arxiv: 2605.31455 · v1 · pith:4GCOYZFZnew · submitted 2026-05-29 · 💻 cs.LG · cs.CL

DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization

Pith reviewed 2026-06-28 23:14 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords multi-turn optimizationimportance-weighted fine-tuningoffline reinforcement learninglanguage model fine-tuningKL-regularized RLdecoupled rolloutsbehavioral collapse
0
0 comments X

The pith

DRIFT achieves multi-turn RL performance by sampling fixed-policy trajectories and applying return-based importance weights in supervised fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models need optimization for multi-turn interactions where feedback arrives iteratively, but online reinforcement learning incurs high costs from generating correction trajectories at every step while standard supervised fine-tuning risks distribution shift and collapse. The paper establishes that the KL-regularized RL objective equals importance-weighted supervised learning, so DRIFT decouples the two by drawing trajectories once from a fixed reference policy, computing return-based weights, and running weighted SFT on that static dataset. A sympathetic reader cares because this promises to retain RL's handling of sequential dynamics while preserving the low cost and simplicity of ordinary fine-tuning. If correct, the method removes the need to regenerate full trajectories during policy updates.

Core claim

DRIFT operationalizes the equivalence between the KL-regularized RL objective and importance-weighted supervised learning by sampling offline interaction trajectories from a fixed reference policy, deriving return-based importance weights, and optimizing the policy via weighted SFT on the resulting dataset. Empirically this matches or exceeds the performance of multi-turn reinforcement learning baselines while maintaining the training efficiency and simplicity of standard supervised fine-tuning.

What carries the argument

Decoupled rollouts from a fixed reference policy with return-based importance weights fed into weighted supervised fine-tuning, which approximates the multi-turn value function offline.

If this is right

  • DRIFT matches or exceeds the performance of multi-turn reinforcement learning baselines.
  • DRIFT maintains the training efficiency and simplicity of standard supervised fine-tuning.
  • DRIFT mitigates distribution shift and behavioral collapse that arise in plain offline SFT for multi-turn settings.
  • The method supports optimization from lightweight iterative feedback without requiring repeated full-trajectory regeneration during updates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same offline weighting pattern could lower costs in other sequential decision domains where online sampling is expensive.
  • Fixed-reference sampling may still require periodic updates if the environment distribution drifts significantly over time.
  • Combining DRIFT weights with limited online corrections could serve as a hybrid that tests the necessity of full decoupling.

Load-bearing premise

Trajectories sampled from a fixed reference policy plus return-based importance weights are sufficient to approximate the multi-turn value function and avoid behavioral collapse without any online correction.

What would settle it

If DRIFT underperforms online multi-turn RL baselines or exhibits behavioral collapse when tested on tasks with long interaction horizons and strong environmental feedback, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.31455 by Chengwei Qin, Jian Mu, Tianyi Lin, Yao Shu, Zhongxiang Dai.

Figure 1
Figure 1. Figure 1: Multi-turn interaction. The user engages in a dialogue with the LLM. If the LLM provides an incorrect response, the user offers simple feedback to point out the error. The LLM then re-attempts the task until a correct answer is generated or the maximum number of turns is reached. cus predominantly on single-turn accuracy (Rafailov et al., 2023; Bai et al., 2025; Zheng et al., 2025b), real-world de￾ployment… view at source ↗
Figure 2
Figure 2. Figure 2: DRIFT overall framework overview. DRIFT consists of two stages: (1) an offline rollout stage, where a batch of trajectories is sampled once from the reference policy and trajectory weights are computed based on the return; and (2) a weighted supervised optimization stage, where the collected (x, y, w) tuples are used for weighted SFT. This fully decouples rollout from training, enabling DRIFT to achieve RL… view at source ↗
Figure 3
Figure 3. Figure 3: Empirical support for terminal-step retention. Both variants use the same offline trajectories and trajectory weights; the all-turn variant supervises every response, while terminal-only supervision uses only the final response conditioned on the full interaction history. 4.4. Protocol-Specific Motivation for Terminal-Step Retention Theorem 5 gives a full-trajectory weighted objective. In the practical imp… view at source ↗
Figure 4
Figure 4. Figure 4: For different γ values, report the proportion of problems cumulatively solved at each turn relative to the total number of problems solved by the end. For different β and feedback values, report the cumulative accuracy at each turn. correction transfers beyond the training domain rather than only improving MATH-style problem solving. We detail the benchmarks and experimental settings in Section D.3. Metric… view at source ↗
Figure 5
Figure 5. Figure 5: Cumulative success rate and correction rate per turn on MATH500 for Qwen2.5-3B-Instruct trained with different meth￾ods. In [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Training efficiency comparison in GPU time across two base models and two hardware configurations. Compare with multi-turn SFT (SFT-5Turn) and multi-turn RL (UFO-5Turn). training methods, RL-based approaches exhibit stronger turn-by-turn performance than SFT-based ones. However, as shown in [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Accuracy under different γ values over training steps. 0 20 40 60 80 100 120 140 160 180 200 Step 40% 42% 45% 48% 50% 52% 55% 58% multi@k β = 10.0 β = 1.0 β = 0.1 β = 0.01 β = 0.001 β = 0.0001 [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Accuracy under different β values over training steps. We plot the training curves under different rollout numbers K in [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Accuracy under different K values over training steps. E.2. Case 2 In [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Trajectory comparison. The Base Model (Qwen2.5-3B-Instruct) correctly sets up the inequality 3c ≤ 2 but falls into a reasoning loop, repeatedly miscalculating the integer solutions despite feedback. DRIFT initially errs but successfully uses feedback in Turn 3 to correct its analysis of the prime factor 5 (5 3 ∤ 10!), deriving the correct answer. adaptation, it does not spontaneously inject missing subjec… view at source ↗
Figure 11
Figure 11. Figure 11: Trajectory comparison on a modular arithmetic problem. The Base Model commits a calculation error in Turn 2, falsely believing that 400 is divisible by 23. It becomes confident in this incorrect verification and ignores subsequent negative feedback, repeating the answer 19. DRIFT treats the feedback as a signal to explore the solution space, moving from candidate 5 to 15, and finally verifying 17 correctl… view at source ↗
Figure 12
Figure 12. Figure 12: Risk Analysis on GPQA. UFO exposes the risk of blind guessing by cycling through options. Critically, the Base model and DRIFT also resort to guessing, relying on hallucinated mechanisms (e.g., ”carbocation”) despite selecting the correct option. This collective failure in reasoning reveals a fundamental capability deficit: without domain knowledge, models revert to various forms of guessing rather than g… view at source ↗
read the original abstract

Large language models are increasingly deployed in multi-turn interactive settings where users or environments can iteratively provide lightweight feedback. Unfortunately, optimizing such behavior presents a sharp dilemma in practice: online reinforcement learning is able to effectively address multi-turn dynamics but is prohibitively expensive due to the cost of generating full correction trajectories at every update, whereas offline supervised fine-tuning (SFT) is efficient but suffers from distribution shift and behavioral collapse. To this end, we novelly propose DRIFT (Decoupled Rollouts and Importance-Weighted Fine-Tuning), a framework that operationalizes the theoretical insight that the KL-regularized RL objective is equivalent to importance-weighted supervised learning. DRIFT decouples rollout from optimization by sampling offline interaction trajectories from a fixed reference policy, deriving return-based importance weights, and optimizing the policy via weighted SFT on the resulting dataset. Empirically, we demonstrate that DRIFT matches or exceeds the performance of multi-turn reinforcement learning baselines while maintaining the training efficiency and simplicity of standard supervised fine-tuning. Code is available at https://github.com/2020-qqtcg/DRIFT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes DRIFT, a framework that decouples rollout generation from optimization for multi-turn LLM fine-tuning. It samples full trajectories offline from a fixed reference policy, derives return-based importance weights, and performs weighted supervised fine-tuning on the resulting static dataset. The central claim is that this approach matches or exceeds the performance of online multi-turn RL baselines while retaining the training efficiency and simplicity of standard SFT.

Significance. If the empirical results and underlying assumptions hold, DRIFT would offer a practical bridge between the effectiveness of KL-regularized multi-turn RL and the computational simplicity of offline SFT, potentially enabling broader adoption of interactive optimization techniques without repeated online trajectory generation.

major comments (2)
  1. [Abstract] Abstract: The claim of empirical parity with multi-turn RL baselines is asserted without any reported experiment details, variance estimates, effective sample size, or analysis of importance-weight variance, leaving the central empirical result unverifiable from the provided text.
  2. [Theoretical Framework / Experiments] Theoretical and empirical sections: The method relies on the standard equivalence between KL-regularized RL and importance-weighted learning, but provides no analysis or metrics on reference-policy coverage of multi-turn state-action distributions, weight variance, or effective sample size. These quantities are load-bearing for the claim that fixed-reference trajectories plus return-based weights suffice to approximate the value function without online correction or behavioral collapse.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight opportunities to strengthen the presentation of empirical results and supporting analyses. We will revise the manuscript accordingly to improve verifiability while preserving the core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim of empirical parity with multi-turn RL baselines is asserted without any reported experiment details, variance estimates, effective sample size, or analysis of importance-weight variance, leaving the central empirical result unverifiable from the provided text.

    Authors: The abstract is space-constrained and therefore omits granular statistics. The full manuscript (Section 4 and Appendix) reports results over multiple random seeds with means and standard deviations, along with the experimental protocol. We will revise the abstract to briefly reference these details and will add a dedicated paragraph plus table summarizing importance-weight variance and effective sample size in the experiments section. revision: yes

  2. Referee: [Theoretical Framework / Experiments] Theoretical and empirical sections: The method relies on the standard equivalence between KL-regularized RL and importance-weighted learning, but provides no analysis or metrics on reference-policy coverage of multi-turn state-action distributions, weight variance, or effective sample size. These quantities are load-bearing for the claim that fixed-reference trajectories plus return-based weights suffice to approximate the value function without online correction or behavioral collapse.

    Authors: We agree that explicit metrics on coverage, weight variance, and effective sample size would strengthen the empirical grounding. The equivalence itself is standard (as cited), and the reference policy is the initial SFT model, which by construction covers the training distribution. In revision we will add quantitative analysis of these quantities, including effective sample size calculations and weight histograms, to the experiments section and a new appendix. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation rests on standard external equivalence between KL-regularized RL and importance-weighted SFT

full rationale

The paper invokes the known equivalence of the KL-regularized RL objective to importance-weighted supervised learning as a theoretical foundation, then decouples rollout (fixed reference policy) from optimization (weighted SFT). This equivalence is not derived or reduced within the paper's own equations; it is treated as an established insight that the method operationalizes. No parameters are fitted to a subset and then renamed as predictions, no self-citation chains bear the central claim, and the empirical performance comparison is presented as validation rather than a definitional consequence. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Ledger extracted from abstract only; no free parameters or invented entities named.

axioms (1)
  • domain assumption KL-regularized RL objective is equivalent to importance-weighted supervised learning
    Invoked as the theoretical foundation for the method.

pith-pipeline@v0.9.1-grok · 5734 in / 991 out tokens · 16196 ms · 2026-06-28T23:14:48.707999+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 24 canonical work pages · 16 internal anchors

  1. [1]

    On- line preference alignment for language models via count- based exploration.arXiv preprint arXiv:2501.12735,

    Bai, C., Zhang, Y ., Qiu, S., Zhang, Q., Xu, K., and Li, X. On- line preference alignment for language models via count- based exploration.arXiv preprint arXiv:2501.12735,

  2. [2]

    Theoremqa: A theorem- driven question answering dataset.arXiv preprint arXiv:2305.12524,

    Chen, W., Yin, M., Ku, M., Lu, P., Wan, Y ., Ma, X., Xu, J., Wang, X., and Xia, T. Theoremqa: A theorem- driven question answering dataset.arXiv preprint arXiv:2305.12524,

  3. [3]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

  4. [4]

    Gao, Z., Zhan, W., Chang, J

    URLhttps: //arxiv.org/abs/2502.11026. Gao, Z., Zhan, W., Chang, J. D., Swamy, G., Brantley, K., Lee, J. D., and Sun, W. Regressing the relative future: Efficient policy optimization for multi-turn rlhf.arXiv preprint arXiv:2410.04612,

  5. [5]

    P., Leang, J

    Gema, A. P., Leang, J. O. J., Hong, G., Devoto, A., Mancino, A. C. M., Saxena, R., He, X., Zhao, Y ., Du, X., Madani, M. R. G., et al. Are we done with mmlu? InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 5069–5096,

  6. [6]

    The Llama 3 Herd of Models

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  7. [7]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring math- ematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

  8. [8]

    Training Language Models to Self-Correct via Reinforcement Learning

    Kumar, A., Zhuang, V ., Agarwal, R., Su, Y ., Co-Reyes, J. D., Singh, A., Baumli, K., Iqbal, S., Bishop, C., Roelofs, R., et al. Training language models to self-correct via reinforcement learning.arXiv preprint arXiv:2409.12917,

  9. [9]

    LLMs Get Lost In Multi-Turn Conversation

    Laban, P., Hayashi, H., Zhou, Y ., and Neville, J. Llms get lost in multi-turn conversation.arXiv preprint arXiv:2505.06120,

  10. [10]

    Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

    Levine, S. Reinforcement learning and control as proba- bilistic inference: Tutorial and review.arXiv preprint arXiv:1805.00909,

  11. [11]

    Confidence matters: Revisiting intrinsic self- correction capabilities of large language models.arXiv preprint arXiv:2402.12563,

    Li, L., Chen, Z., Chen, G., Zhang, Y ., Su, Y ., Xing, E., and Zhang, K. Confidence matters: Revisiting intrinsic self- correction capabilities of large language models.arXiv preprint arXiv:2402.12563,

  12. [12]

    Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models

    Li, Y ., Shen, X., Yao, X., Ding, X., Miao, Y ., Krishnan, R., and Padman, R. Beyond single-turn: A survey on multi-turn interactions with large language models.arXiv preprint arXiv:2504.04717,

  13. [13]

    A simple” try again” can elicit multi-turn llm reasoning.arXiv preprint arXiv:2507.14295,

    Liu, L., Wang, Z., Li, L., Xu, C., Lu, Y ., Liu, H., Sil, A., and Li, M. A simple” try again” can elicit multi-turn llm reasoning.arXiv preprint arXiv:2507.14295,

  14. [14]

    Self-Refine: Iterative Refinement with Self-Feedback

    Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y ., et al. Self-refine: Iterative refinement with self- feedback, 2023.URL https://arxiv. org/abs/2303.17651,

  15. [15]

    AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

    Nair, A., Gupta, A., Dalal, M., and Levine, S. Awac: Accel- erating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359,

  16. [16]

    Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

    Peng, X. B., Kumar, A., Zhang, G., and Levine, S. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.arXiv preprint arXiv:1910.00177,

  17. [17]

    and Springenberg, J

    Qin, C. and Springenberg, J. T. Supervised fine tuning on curated data is reinforcement learning (and can be improved).arXiv preprint arXiv:2507.12856,

  18. [18]

    Qwen2.5 Technical Report

    URL https: //arxiv.org/abs/2412.15115. Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. Direct preference optimiza- tion: Your language model is secretly a reward model. Advances in neural information processing systems, 36: 53728–53741,

  19. [19]

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark

    Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y ., Dirani, J., Michael, J., and Bowman, S. R. Gpqa: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022,

  20. [20]

    Proximal Policy Optimization Algorithms

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

  21. [21]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  22. [22]

    D., Agarwal, R., Anand, A., Patil, P., Garcia, X., Liu, P

    Singh, A., Co-Reyes, J. D., Agarwal, R., Anand, A., Patil, P., Garcia, X., Liu, P. J., Harrison, J., Lee, J., Xu, K., et al. Beyond human data: Scaling self-training for problem-solving with language models.arXiv preprint arXiv:2312.06585,

  23. [23]

    Generating sequences by learning to self-correct.arXiv preprint arXiv:2211.00053,

    Welleck, S., Lu, X., West, P., Brahman, F., Shen, T., Khashabi, D., and Choi, Y . Generating sequences by learning to self-correct.arXiv preprint arXiv:2211.00053,

  24. [24]

    Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

    Wen, X., Liu, Z., Zheng, S., Ye, S., Wu, Z., Wang, Y ., Xu, Z., Liang, X., Li, J., Miao, Z., et al. Reinforcement learn- ing with verifiable rewards implicitly incentivizes correct reasoning in base llms.arXiv preprint arXiv:2506.14245,

  25. [25]

    Llm alignment through successive policy re-weighting (spr)

    Zhang, X., Zeng, S., Li, J., Lin, K., and Hong, M. Llm alignment through successive policy re-weighting (spr). InNeurIPS 2024 Workshop on Fine-Tuning in Modern Machine Learning: Principles and Scalability,

  26. [26]

    Group Sequence Policy Optimization

    Zheng, C., Liu, S., Li, M., Chen, X.-H., Yu, B., Gao, C., Dang, K., Liu, Y ., Men, R., Yang, A., Zhou, J., and Lin, J. Group sequence policy optimization, 2025a. URL https://arxiv.org/abs/2507.18071. Zheng, C., Liu, S., Li, M., Chen, X.-H., Yu, B., Gao, C., Dang, K., Liu, Y ., Men, R., Yang, A., et al. Group sequence policy optimization.arXiv preprint arX...

  27. [27]

    Complementary analyses study when RLVR with answer-only rewards can still promote correct reasoning and what normalization or supervision improves stability (Wen et al., 2025)

    and importance-weighted SFT (Qin & Springenberg, 2025), and successive policy reweighting schemes that target RL objectives with SFT-like compute (Zhang et al., 2024). Complementary analyses study when RLVR with answer-only rewards can still promote correct reasoning and what normalization or supervision improves stability (Wen et al., 2025). Although onl...

  28. [28]

    as the primary math benchmark with competition-style problems that require multi-step derivations. We also report MATH500 (Hendrycks et al., 2021), a 500-problem evaluation 20 DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization Table 3.Additional model results on Qwen2.5-7B-Instruct after training on MetaMat...