DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization

Chengwei Qin; Jian Mu; Tianyi Lin; Yao Shu; Zhongxiang Dai

arxiv: 2605.31455 · v1 · pith:4GCOYZFZnew · submitted 2026-05-29 · 💻 cs.LG · cs.CL

DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization

Jian Mu , Tianyi Lin , Chengwei Qin , Zhongxiang Dai , Yao Shu This is my paper

Pith reviewed 2026-06-28 23:14 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords multi-turn optimizationimportance-weighted fine-tuningoffline reinforcement learninglanguage model fine-tuningKL-regularized RLdecoupled rolloutsbehavioral collapse

0 comments

The pith

DRIFT achieves multi-turn RL performance by sampling fixed-policy trajectories and applying return-based importance weights in supervised fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models need optimization for multi-turn interactions where feedback arrives iteratively, but online reinforcement learning incurs high costs from generating correction trajectories at every step while standard supervised fine-tuning risks distribution shift and collapse. The paper establishes that the KL-regularized RL objective equals importance-weighted supervised learning, so DRIFT decouples the two by drawing trajectories once from a fixed reference policy, computing return-based weights, and running weighted SFT on that static dataset. A sympathetic reader cares because this promises to retain RL's handling of sequential dynamics while preserving the low cost and simplicity of ordinary fine-tuning. If correct, the method removes the need to regenerate full trajectories during policy updates.

Core claim

DRIFT operationalizes the equivalence between the KL-regularized RL objective and importance-weighted supervised learning by sampling offline interaction trajectories from a fixed reference policy, deriving return-based importance weights, and optimizing the policy via weighted SFT on the resulting dataset. Empirically this matches or exceeds the performance of multi-turn reinforcement learning baselines while maintaining the training efficiency and simplicity of standard supervised fine-tuning.

What carries the argument

Decoupled rollouts from a fixed reference policy with return-based importance weights fed into weighted supervised fine-tuning, which approximates the multi-turn value function offline.

If this is right

DRIFT matches or exceeds the performance of multi-turn reinforcement learning baselines.
DRIFT maintains the training efficiency and simplicity of standard supervised fine-tuning.
DRIFT mitigates distribution shift and behavioral collapse that arise in plain offline SFT for multi-turn settings.
The method supports optimization from lightweight iterative feedback without requiring repeated full-trajectory regeneration during updates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same offline weighting pattern could lower costs in other sequential decision domains where online sampling is expensive.
Fixed-reference sampling may still require periodic updates if the environment distribution drifts significantly over time.
Combining DRIFT weights with limited online corrections could serve as a hybrid that tests the necessity of full decoupling.

Load-bearing premise

Trajectories sampled from a fixed reference policy plus return-based importance weights are sufficient to approximate the multi-turn value function and avoid behavioral collapse without any online correction.

What would settle it

If DRIFT underperforms online multi-turn RL baselines or exhibits behavioral collapse when tested on tasks with long interaction horizons and strong environmental feedback, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.31455 by Chengwei Qin, Jian Mu, Tianyi Lin, Yao Shu, Zhongxiang Dai.

**Figure 1.** Figure 1: Multi-turn interaction. The user engages in a dialogue with the LLM. If the LLM provides an incorrect response, the user offers simple feedback to point out the error. The LLM then re-attempts the task until a correct answer is generated or the maximum number of turns is reached. cus predominantly on single-turn accuracy (Rafailov et al., 2023; Bai et al., 2025; Zheng et al., 2025b), real-world deployment… view at source ↗

**Figure 2.** Figure 2: DRIFT overall framework overview. DRIFT consists of two stages: (1) an offline rollout stage, where a batch of trajectories is sampled once from the reference policy and trajectory weights are computed based on the return; and (2) a weighted supervised optimization stage, where the collected (x, y, w) tuples are used for weighted SFT. This fully decouples rollout from training, enabling DRIFT to achieve RL… view at source ↗

**Figure 3.** Figure 3: Empirical support for terminal-step retention. Both variants use the same offline trajectories and trajectory weights; the all-turn variant supervises every response, while terminal-only supervision uses only the final response conditioned on the full interaction history. 4.4. Protocol-Specific Motivation for Terminal-Step Retention Theorem 5 gives a full-trajectory weighted objective. In the practical imp… view at source ↗

**Figure 4.** Figure 4: For different γ values, report the proportion of problems cumulatively solved at each turn relative to the total number of problems solved by the end. For different β and feedback values, report the cumulative accuracy at each turn. correction transfers beyond the training domain rather than only improving MATH-style problem solving. We detail the benchmarks and experimental settings in Section D.3. Metric… view at source ↗

**Figure 5.** Figure 5: Cumulative success rate and correction rate per turn on MATH500 for Qwen2.5-3B-Instruct trained with different methods. In [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Training efficiency comparison in GPU time across two base models and two hardware configurations. Compare with multi-turn SFT (SFT-5Turn) and multi-turn RL (UFO-5Turn). training methods, RL-based approaches exhibit stronger turn-by-turn performance than SFT-based ones. However, as shown in [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Accuracy under different γ values over training steps. 0 20 40 60 80 100 120 140 160 180 200 Step 40% 42% 45% 48% 50% 52% 55% 58% multi@k β = 10.0 β = 1.0 β = 0.1 β = 0.01 β = 0.001 β = 0.0001 [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: Accuracy under different β values over training steps. We plot the training curves under different rollout numbers K in [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: Accuracy under different K values over training steps. E.2. Case 2 In [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Trajectory comparison. The Base Model (Qwen2.5-3B-Instruct) correctly sets up the inequality 3c ≤ 2 but falls into a reasoning loop, repeatedly miscalculating the integer solutions despite feedback. DRIFT initially errs but successfully uses feedback in Turn 3 to correct its analysis of the prime factor 5 (5 3 ∤ 10!), deriving the correct answer. adaptation, it does not spontaneously inject missing subjec… view at source ↗

**Figure 11.** Figure 11: Trajectory comparison on a modular arithmetic problem. The Base Model commits a calculation error in Turn 2, falsely believing that 400 is divisible by 23. It becomes confident in this incorrect verification and ignores subsequent negative feedback, repeating the answer 19. DRIFT treats the feedback as a signal to explore the solution space, moving from candidate 5 to 15, and finally verifying 17 correctl… view at source ↗

**Figure 12.** Figure 12: Risk Analysis on GPQA. UFO exposes the risk of blind guessing by cycling through options. Critically, the Base model and DRIFT also resort to guessing, relying on hallucinated mechanisms (e.g., ”carbocation”) despite selecting the correct option. This collective failure in reasoning reveals a fundamental capability deficit: without domain knowledge, models revert to various forms of guessing rather than g… view at source ↗

read the original abstract

Large language models are increasingly deployed in multi-turn interactive settings where users or environments can iteratively provide lightweight feedback. Unfortunately, optimizing such behavior presents a sharp dilemma in practice: online reinforcement learning is able to effectively address multi-turn dynamics but is prohibitively expensive due to the cost of generating full correction trajectories at every update, whereas offline supervised fine-tuning (SFT) is efficient but suffers from distribution shift and behavioral collapse. To this end, we novelly propose DRIFT (Decoupled Rollouts and Importance-Weighted Fine-Tuning), a framework that operationalizes the theoretical insight that the KL-regularized RL objective is equivalent to importance-weighted supervised learning. DRIFT decouples rollout from optimization by sampling offline interaction trajectories from a fixed reference policy, deriving return-based importance weights, and optimizing the policy via weighted SFT on the resulting dataset. Empirically, we demonstrate that DRIFT matches or exceeds the performance of multi-turn reinforcement learning baselines while maintaining the training efficiency and simplicity of standard supervised fine-tuning. Code is available at https://github.com/2020-qqtcg/DRIFT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DRIFT makes multi-turn optimization cheaper by offline weighted SFT but the coverage and variance issues need verification before the claims can be trusted.

read the letter

DRIFT decouples the generation of multi-turn trajectories from the optimization step by sampling them once from a fixed reference policy and then applying return-based importance weights in a supervised fine-tuning pass. The paper claims this achieves performance on par with online RL methods at the cost of standard SFT.

The new element is operationalizing the KL-RL equivalence for multi-turn settings with this specific decoupling. Prior work has the theoretical link, but applying it to avoid online rollouts in interactive LLM training is the practical step forward here.

The paper does a solid job framing the tradeoff between expensive online correction and the collapse issues in plain SFT. The method stays close to familiar SFT pipelines, which lowers the barrier for adoption.

Where it is softer is on the practical validity of the fixed-reference approach. Multi-turn dynamics make distribution shift compound quickly, and a single reference policy may not hit the states that an improving policy would reach. The abstract gives no numbers on importance weight variance or effective sample size, so it is not possible to tell if the weights are usable or if the method just reproduces the reference behavior. The stress-test concern about coverage holds up based on the description.

This work is aimed at teams optimizing LLMs for chat or agent tasks who need something between full RL and basic SFT. A reader experimenting with efficient fine-tuning techniques could find the framework worth implementing and testing on their own setups.

I would send this to peer review. The problem is relevant and the proposal is concrete enough that referees can check whether the experiments address the coverage and variance issues.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes DRIFT, a framework that decouples rollout generation from optimization for multi-turn LLM fine-tuning. It samples full trajectories offline from a fixed reference policy, derives return-based importance weights, and performs weighted supervised fine-tuning on the resulting static dataset. The central claim is that this approach matches or exceeds the performance of online multi-turn RL baselines while retaining the training efficiency and simplicity of standard SFT.

Significance. If the empirical results and underlying assumptions hold, DRIFT would offer a practical bridge between the effectiveness of KL-regularized multi-turn RL and the computational simplicity of offline SFT, potentially enabling broader adoption of interactive optimization techniques without repeated online trajectory generation.

major comments (2)

[Abstract] Abstract: The claim of empirical parity with multi-turn RL baselines is asserted without any reported experiment details, variance estimates, effective sample size, or analysis of importance-weight variance, leaving the central empirical result unverifiable from the provided text.
[Theoretical Framework / Experiments] Theoretical and empirical sections: The method relies on the standard equivalence between KL-regularized RL and importance-weighted learning, but provides no analysis or metrics on reference-policy coverage of multi-turn state-action distributions, weight variance, or effective sample size. These quantities are load-bearing for the claim that fixed-reference trajectories plus return-based weights suffice to approximate the value function without online correction or behavioral collapse.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight opportunities to strengthen the presentation of empirical results and supporting analyses. We will revise the manuscript accordingly to improve verifiability while preserving the core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: The claim of empirical parity with multi-turn RL baselines is asserted without any reported experiment details, variance estimates, effective sample size, or analysis of importance-weight variance, leaving the central empirical result unverifiable from the provided text.

Authors: The abstract is space-constrained and therefore omits granular statistics. The full manuscript (Section 4 and Appendix) reports results over multiple random seeds with means and standard deviations, along with the experimental protocol. We will revise the abstract to briefly reference these details and will add a dedicated paragraph plus table summarizing importance-weight variance and effective sample size in the experiments section. revision: yes
Referee: [Theoretical Framework / Experiments] Theoretical and empirical sections: The method relies on the standard equivalence between KL-regularized RL and importance-weighted learning, but provides no analysis or metrics on reference-policy coverage of multi-turn state-action distributions, weight variance, or effective sample size. These quantities are load-bearing for the claim that fixed-reference trajectories plus return-based weights suffice to approximate the value function without online correction or behavioral collapse.

Authors: We agree that explicit metrics on coverage, weight variance, and effective sample size would strengthen the empirical grounding. The equivalence itself is standard (as cited), and the reference policy is the initial SFT model, which by construction covers the training distribution. In revision we will add quantitative analysis of these quantities, including effective sample size calculations and weight histograms, to the experiments section and a new appendix. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation rests on standard external equivalence between KL-regularized RL and importance-weighted SFT

full rationale

The paper invokes the known equivalence of the KL-regularized RL objective to importance-weighted supervised learning as a theoretical foundation, then decouples rollout (fixed reference policy) from optimization (weighted SFT). This equivalence is not derived or reduced within the paper's own equations; it is treated as an established insight that the method operationalizes. No parameters are fitted to a subset and then renamed as predictions, no self-citation chains bear the central claim, and the empirical performance comparison is presented as validation rather than a definitional consequence. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Ledger extracted from abstract only; no free parameters or invented entities named.

axioms (1)

domain assumption KL-regularized RL objective is equivalent to importance-weighted supervised learning
Invoked as the theoretical foundation for the method.

pith-pipeline@v0.9.1-grok · 5734 in / 991 out tokens · 16196 ms · 2026-06-28T23:14:48.707999+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 24 canonical work pages · 16 internal anchors

[1]

On- line preference alignment for language models via count- based exploration.arXiv preprint arXiv:2501.12735,

Bai, C., Zhang, Y ., Qiu, S., Zhang, Q., Xu, K., and Li, X. On- line preference alignment for language models via count- based exploration.arXiv preprint arXiv:2501.12735,

work page arXiv
[2]

Theoremqa: A theorem- driven question answering dataset.arXiv preprint arXiv:2305.12524,

Chen, W., Yin, M., Ku, M., Lu, P., Wan, Y ., Ma, X., Xu, J., Wang, X., and Xia, T. Theoremqa: A theorem- driven question answering dataset.arXiv preprint arXiv:2305.12524,

work page arXiv
[3]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Gao, Z., Zhan, W., Chang, J

URLhttps: //arxiv.org/abs/2502.11026. Gao, Z., Zhan, W., Chang, J. D., Swamy, G., Brantley, K., Lee, J. D., and Sun, W. Regressing the relative future: Efficient policy optimization for multi-turn rlhf.arXiv preprint arXiv:2410.04612,

work page arXiv
[5]

P., Leang, J

Gema, A. P., Leang, J. O. J., Hong, G., Devoto, A., Mancino, A. C. M., Saxena, R., He, X., Zhao, Y ., Du, X., Madani, M. R. G., et al. Are we done with mmlu? InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 5069–5096,

2025
[6]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Measuring Mathematical Problem Solving With the MATH Dataset

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring math- ematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Training Language Models to Self-Correct via Reinforcement Learning

Kumar, A., Zhuang, V ., Agarwal, R., Su, Y ., Co-Reyes, J. D., Singh, A., Baumli, K., Iqbal, S., Bishop, C., Roelofs, R., et al. Training language models to self-correct via reinforcement learning.arXiv preprint arXiv:2409.12917,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

LLMs Get Lost In Multi-Turn Conversation

Laban, P., Hayashi, H., Zhou, Y ., and Neville, J. Llms get lost in multi-turn conversation.arXiv preprint arXiv:2505.06120,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

Levine, S. Reinforcement learning and control as proba- bilistic inference: Tutorial and review.arXiv preprint arXiv:1805.00909,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Confidence matters: Revisiting intrinsic self- correction capabilities of large language models.arXiv preprint arXiv:2402.12563,

Li, L., Chen, Z., Chen, G., Zhang, Y ., Su, Y ., Xing, E., and Zhang, K. Confidence matters: Revisiting intrinsic self- correction capabilities of large language models.arXiv preprint arXiv:2402.12563,

work page arXiv
[12]

Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models

Li, Y ., Shen, X., Yao, X., Ding, X., Miao, Y ., Krishnan, R., and Padman, R. Beyond single-turn: A survey on multi-turn interactions with large language models.arXiv preprint arXiv:2504.04717,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

A simple” try again” can elicit multi-turn llm reasoning.arXiv preprint arXiv:2507.14295,

Liu, L., Wang, Z., Li, L., Xu, C., Lu, Y ., Liu, H., Sil, A., and Li, M. A simple” try again” can elicit multi-turn llm reasoning.arXiv preprint arXiv:2507.14295,

work page arXiv
[14]

Self-Refine: Iterative Refinement with Self-Feedback

Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y ., et al. Self-refine: Iterative refinement with self- feedback, 2023.URL https://arxiv. org/abs/2303.17651,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

Nair, A., Gupta, A., Dalal, M., and Levine, S. Awac: Accel- erating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359,

work page internal anchor Pith review Pith/arXiv arXiv 2006
[16]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Peng, X. B., Kumar, A., Zhang, G., and Levine, S. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.arXiv preprint arXiv:1910.00177,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[17]

and Springenberg, J

Qin, C. and Springenberg, J. T. Supervised fine tuning on curated data is reinforcement learning (and can be improved).arXiv preprint arXiv:2507.12856,

work page arXiv
[18]

Qwen2.5 Technical Report

URL https: //arxiv.org/abs/2412.15115. Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. Direct preference optimiza- tion: Your language model is secretly a reward model. Advances in neural information processing systems, 36: 53728–53741,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y ., Dirani, J., Michael, J., and Bowman, S. R. Gpqa: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

D., Agarwal, R., Anand, A., Patil, P., Garcia, X., Liu, P

Singh, A., Co-Reyes, J. D., Agarwal, R., Anand, A., Patil, P., Garcia, X., Liu, P. J., Harrison, J., Lee, J., Xu, K., et al. Beyond human data: Scaling self-training for problem-solving with language models.arXiv preprint arXiv:2312.06585,

work page arXiv
[23]

Generating sequences by learning to self-correct.arXiv preprint arXiv:2211.00053,

Welleck, S., Lu, X., West, P., Brahman, F., Shen, T., Khashabi, D., and Choi, Y . Generating sequences by learning to self-correct.arXiv preprint arXiv:2211.00053,

work page arXiv
[24]

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

Wen, X., Liu, Z., Zheng, S., Ye, S., Wu, Z., Wang, Y ., Xu, Z., Liang, X., Li, J., Miao, Z., et al. Reinforcement learn- ing with verifiable rewards implicitly incentivizes correct reasoning in base llms.arXiv preprint arXiv:2506.14245,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Llm alignment through successive policy re-weighting (spr)

Zhang, X., Zeng, S., Li, J., Lin, K., and Hong, M. Llm alignment through successive policy re-weighting (spr). InNeurIPS 2024 Workshop on Fine-Tuning in Modern Machine Learning: Principles and Scalability,

2024
[26]

Group Sequence Policy Optimization

Zheng, C., Liu, S., Li, M., Chen, X.-H., Yu, B., Gao, C., Dang, K., Liu, Y ., Men, R., Yang, A., Zhou, J., and Lin, J. Group sequence policy optimization, 2025a. URL https://arxiv.org/abs/2507.18071. Zheng, C., Liu, S., Li, M., Chen, X.-H., Yu, B., Gao, C., Dang, K., Liu, Y ., Men, R., Yang, A., et al. Group sequence policy optimization.arXiv preprint arX...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Complementary analyses study when RLVR with answer-only rewards can still promote correct reasoning and what normalization or supervision improves stability (Wen et al., 2025)

and importance-weighted SFT (Qin & Springenberg, 2025), and successive policy reweighting schemes that target RL objectives with SFT-like compute (Zhang et al., 2024). Complementary analyses study when RLVR with answer-only rewards can still promote correct reasoning and what normalization or supervision improves stability (Wen et al., 2025). Although onl...

2025
[28]

as the primary math benchmark with competition-style problems that require multi-step derivations. We also report MATH500 (Hendrycks et al., 2021), a 500-problem evaluation 20 DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization Table 3.Additional model results on Qwen2.5-7B-Instruct after training on MetaMat...

2021

[1] [1]

On- line preference alignment for language models via count- based exploration.arXiv preprint arXiv:2501.12735,

Bai, C., Zhang, Y ., Qiu, S., Zhang, Q., Xu, K., and Li, X. On- line preference alignment for language models via count- based exploration.arXiv preprint arXiv:2501.12735,

work page arXiv

[2] [2]

Theoremqa: A theorem- driven question answering dataset.arXiv preprint arXiv:2305.12524,

Chen, W., Yin, M., Ku, M., Lu, P., Wan, Y ., Ma, X., Xu, J., Wang, X., and Xia, T. Theoremqa: A theorem- driven question answering dataset.arXiv preprint arXiv:2305.12524,

work page arXiv

[3] [3]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Gao, Z., Zhan, W., Chang, J

URLhttps: //arxiv.org/abs/2502.11026. Gao, Z., Zhan, W., Chang, J. D., Swamy, G., Brantley, K., Lee, J. D., and Sun, W. Regressing the relative future: Efficient policy optimization for multi-turn rlhf.arXiv preprint arXiv:2410.04612,

work page arXiv

[5] [5]

P., Leang, J

Gema, A. P., Leang, J. O. J., Hong, G., Devoto, A., Mancino, A. C. M., Saxena, R., He, X., Zhao, Y ., Du, X., Madani, M. R. G., et al. Are we done with mmlu? InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 5069–5096,

2025

[6] [6]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Measuring Mathematical Problem Solving With the MATH Dataset

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring math- ematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Training Language Models to Self-Correct via Reinforcement Learning

Kumar, A., Zhuang, V ., Agarwal, R., Su, Y ., Co-Reyes, J. D., Singh, A., Baumli, K., Iqbal, S., Bishop, C., Roelofs, R., et al. Training language models to self-correct via reinforcement learning.arXiv preprint arXiv:2409.12917,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

LLMs Get Lost In Multi-Turn Conversation

Laban, P., Hayashi, H., Zhou, Y ., and Neville, J. Llms get lost in multi-turn conversation.arXiv preprint arXiv:2505.06120,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

Levine, S. Reinforcement learning and control as proba- bilistic inference: Tutorial and review.arXiv preprint arXiv:1805.00909,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Confidence matters: Revisiting intrinsic self- correction capabilities of large language models.arXiv preprint arXiv:2402.12563,

Li, L., Chen, Z., Chen, G., Zhang, Y ., Su, Y ., Xing, E., and Zhang, K. Confidence matters: Revisiting intrinsic self- correction capabilities of large language models.arXiv preprint arXiv:2402.12563,

work page arXiv

[12] [12]

Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models

Li, Y ., Shen, X., Yao, X., Ding, X., Miao, Y ., Krishnan, R., and Padman, R. Beyond single-turn: A survey on multi-turn interactions with large language models.arXiv preprint arXiv:2504.04717,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

A simple” try again” can elicit multi-turn llm reasoning.arXiv preprint arXiv:2507.14295,

Liu, L., Wang, Z., Li, L., Xu, C., Lu, Y ., Liu, H., Sil, A., and Li, M. A simple” try again” can elicit multi-turn llm reasoning.arXiv preprint arXiv:2507.14295,

work page arXiv

[14] [14]

Self-Refine: Iterative Refinement with Self-Feedback

Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y ., et al. Self-refine: Iterative refinement with self- feedback, 2023.URL https://arxiv. org/abs/2303.17651,

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

Nair, A., Gupta, A., Dalal, M., and Levine, S. Awac: Accel- erating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359,

work page internal anchor Pith review Pith/arXiv arXiv 2006

[16] [16]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Peng, X. B., Kumar, A., Zhang, G., and Levine, S. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.arXiv preprint arXiv:1910.00177,

work page internal anchor Pith review Pith/arXiv arXiv 1910

[17] [17]

and Springenberg, J

Qin, C. and Springenberg, J. T. Supervised fine tuning on curated data is reinforcement learning (and can be improved).arXiv preprint arXiv:2507.12856,

work page arXiv

[18] [18]

Qwen2.5 Technical Report

URL https: //arxiv.org/abs/2412.15115. Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. Direct preference optimiza- tion: Your language model is secretly a reward model. Advances in neural information processing systems, 36: 53728–53741,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y ., Dirani, J., Michael, J., and Bowman, S. R. Gpqa: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

D., Agarwal, R., Anand, A., Patil, P., Garcia, X., Liu, P

Singh, A., Co-Reyes, J. D., Agarwal, R., Anand, A., Patil, P., Garcia, X., Liu, P. J., Harrison, J., Lee, J., Xu, K., et al. Beyond human data: Scaling self-training for problem-solving with language models.arXiv preprint arXiv:2312.06585,

work page arXiv

[23] [23]

Generating sequences by learning to self-correct.arXiv preprint arXiv:2211.00053,

Welleck, S., Lu, X., West, P., Brahman, F., Shen, T., Khashabi, D., and Choi, Y . Generating sequences by learning to self-correct.arXiv preprint arXiv:2211.00053,

work page arXiv

[24] [24]

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

Wen, X., Liu, Z., Zheng, S., Ye, S., Wu, Z., Wang, Y ., Xu, Z., Liang, X., Li, J., Miao, Z., et al. Reinforcement learn- ing with verifiable rewards implicitly incentivizes correct reasoning in base llms.arXiv preprint arXiv:2506.14245,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Llm alignment through successive policy re-weighting (spr)

Zhang, X., Zeng, S., Li, J., Lin, K., and Hong, M. Llm alignment through successive policy re-weighting (spr). InNeurIPS 2024 Workshop on Fine-Tuning in Modern Machine Learning: Principles and Scalability,

2024

[26] [26]

Group Sequence Policy Optimization

Zheng, C., Liu, S., Li, M., Chen, X.-H., Yu, B., Gao, C., Dang, K., Liu, Y ., Men, R., Yang, A., Zhou, J., and Lin, J. Group sequence policy optimization, 2025a. URL https://arxiv.org/abs/2507.18071. Zheng, C., Liu, S., Li, M., Chen, X.-H., Yu, B., Gao, C., Dang, K., Liu, Y ., Men, R., Yang, A., et al. Group sequence policy optimization.arXiv preprint arX...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Complementary analyses study when RLVR with answer-only rewards can still promote correct reasoning and what normalization or supervision improves stability (Wen et al., 2025)

and importance-weighted SFT (Qin & Springenberg, 2025), and successive policy reweighting schemes that target RL objectives with SFT-like compute (Zhang et al., 2024). Complementary analyses study when RLVR with answer-only rewards can still promote correct reasoning and what normalization or supervision improves stability (Wen et al., 2025). Although onl...

2025

[28] [28]

as the primary math benchmark with competition-style problems that require multi-step derivations. We also report MATH500 (Hendrycks et al., 2021), a 500-problem evaluation 20 DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization Table 3.Additional model results on Qwen2.5-7B-Instruct after training on MetaMat...

2021