RL Excursions during Pre-Training: Re-examining Policy Optimization for LLM training

Clara Mohri; David Alvarez-Melis; Rachit Bansal; Sham Kakade; Tian Qin

arxiv: 2606.04272 · v1 · pith:67G7AXIDnew · submitted 2026-06-02 · 💻 cs.LG

RL Excursions during Pre-Training: Re-examining Policy Optimization for LLM training

Rachit Bansal , Clara Mohri , Tian Qin , David Alvarez-Melis , Sham Kakade This is my paper

Pith reviewed 2026-06-28 10:35 UTC · model grok-4.3

classification 💻 cs.LG

keywords reinforcement learningLLM pre-trainingsupervised fine-tuningpolicy optimizationtraining pipelineobjective merging

0 comments

The pith

Applying RL early during LLM pre-training matches the performance of the standard SFT followed by RL approach.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper challenges the standard practice of applying reinforcement learning only after pre-training and supervised fine-tuning by testing these methods on intermediate checkpoints during pre-training. It finds that RL works effectively very early and can match the full pipeline's performance soon after. Experiments reveal that carefully choosing pre-training data improves RL results more than making the model larger. Combining RL and SFT by averaging their objectives in parallel yields better performance than other methods while keeping the model's general abilities stable. The results point to potential benefits from using RL more throughout the training process.

Core claim

The authors show that RL applied to intermediate pre-training checkpoints is effective early on and often equals the SFT to RL pipeline, that pre-training data composition is a stronger factor for RL success than model scale, that RL broadens the model's output distribution while SFT narrows it, and that averaging RL and SFT objectives in parallel outperforms other training methods across metrics without harming general capabilities.

What carries the argument

Direct application of RL to pre-training checkpoints combined with parallel averaging of RL and SFT objectives.

If this is right

RL can achieve strong results without a separate SFT stage first.
Optimizing pre-training data for RL can be more impactful than scaling up models.
Early RL preserves general model capabilities better than SFT.
Merging objectives allows combining strengths of RL and SFT in one process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Interleaving RL steps during pre-training could lead to more efficient overall training schedules.
Models with expanded distributions from early RL might handle diverse tasks better downstream.
Future work could explore varying the frequency of RL applications during pre-training to optimize outcomes.

Load-bearing premise

The tested pre-training checkpoints, problem types, and model sizes capture the general behavior of LLM training across different cases.

What would settle it

Running similar experiments on a different LLM architecture or dataset and observing that RL no longer becomes effective early or fails to match the SFT-RL pipeline.

Figures

Figures reproduced from arXiv: 2606.04272 by Clara Mohri, David Alvarez-Melis, Rachit Bansal, Sham Kakade, Tian Qin.

**Figure 1.** Figure 1: Overview. We compare several post-training recipes applied to intermediate pre-training checkpoints Mt: direct RL (MRL t ), SFT with one solution per question (MSFT t ), SFT with multiple solutions (MSFT-Gold t ), the standard pipeline of RL after SFT (MSFT→RL t ), and parallel averaging of RL and SFT gradients (MParallel t ). 1 RL improves both pass@1 and pass@32 on checkpoints trained for as low as 4B pr… view at source ↗

**Figure 2.** Figure 2: RL is effective early in pre-training. GSM8K pass@k for Mt, MSFT t , MSFT→RL t , and MRL t across pre-training tokens t, with all SFT baselines trained on the SFT set (one ground-truth solution per problem). MRL t improves over Mt from as few as 4B tokens. By 10B tokens, MRL t matches the standard MSFT→RL t pipeline, and outperforms MSFT t alone. Difficulty splits. OpenMathInstruct contains two categories … view at source ↗

**Figure 3.** Figure 3: Diverse SFT data shifts the balance toward SFT-Gold. In contrast with [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Targeted pre-training data beats model scale for RL. Improvement on MATH from RL over the base model, across pre-training configurations: (i) 1B-50B, original pre-trained model; (ii) Scaling D (1B-60B), 1B model pre-trained from scratch with an additional 10B math-heavy tokens mixed in; (iii) Scaling N (4B-50B), 4B model trained on same 50B-token mix as original 1B model. Adding task-relevant pre-training … view at source ↗

**Figure 5.** Figure 5: Base pass@k on training data predicts RL effectiveness. Base model 8-shot pass@k on the test set (x-axis) vs. after RL (y-axis), for GSM8K(left) and MATH (right). pass@k accuracy on the test set might serve as a lightweight metric for whether RL training will yield downstream gains. checkpoints matches the SFT baseline on MATH ( [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Direct RL expands while SFT→RL sharpens. GSM8K pass@1 and pass@32 tracked across training stages on the same pretraining checkpoint Mt. Left: under the standard Mt → MSFT t → MSFT→RL t pipeline, pass@1 continues to improve during RL but pass@32 decreases, reproducing the sharpening effect reported in prior work. Right: applying RL directly to Mt improves both pass@1 and pass@32, expanding the model’s distr… view at source ↗

**Figure 7.** Figure 7: RL preserves general capabilities while SFT degrades them. Performance on six general-purpose (non-math) benchmarks for the base model Mt and three post-trained variants: MRL t , MSFT t , and MSFT-Gold t . Both SFT and SFT-Gold consistently degrade performance by 4–8 pp on average across the benchmarks, while RL leaves these capabilities essentially unchanged. model’s existing output distribution (§4.1). T… view at source ↗

**Figure 8.** Figure 8: Parallel averaging combines the strengths of RL and SFT across pre-training. (Left) The parallel-averaging update (Algorithm 1): at each step we take a single optimizer update from each of an RL gradient and an SFT gradient (each with its own optimizer state) and use their average to update the model weights. (Right) Parallel-averaging (MParallel t ) achieves the strongest pass@32 across pre-training check… view at source ↗

**Figure 9.** Figure 9: RL underperforms SFT→RL on harder MATH problems. MATH pass@k for Mt, MSFT t , MSFT→RL t , and MRL t trained on the full OpenMathInstruct, with the base model at N = 1B parameters and D = 50B pretraining tokens. MRL t still improves over Mt before Chinchilla-optimal token counts, but a persistent gap to MSFT→RL t remains throughout pretraining, indicating that direct RL is insufficient on harder reasoning t… view at source ↗

**Figure 10.** Figure 10: Adding math pretraining data narrows the MATH gap. Same setup as [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Scaling parameters does not close the MATH gap. Same setup as [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: RL training reaches convergence at all checkpoints. Training reward, validation reward, and GSM8K test reward during RL training for MRL t across pretraining checkpoints t. All three reward metrics converge by end-of-training, confirming that performance differences between checkpoints are not artifacts of insufficient RL optimization. For checkpoints with seed brittleness (t < 10B), we plot the favorable… view at source ↗

**Figure 13.** Figure 13: Training reward hides RL seed brittleness on early checkpoints. A favorable seed (blue) and an unfavorable seed (red) for MRL t at t = 4B tokens. Left: training reward curves are nearly identical between seeds, offering no warning of divergent test outcomes. Middle: validation reward begins to diverge mid-training and unfavorable seed only reaches 10% which comes from format reward. Right: on GSM8K, the f… view at source ↗

**Figure 14.** Figure 14: SFT converges by 5 epochs. GSM8K accuracy of MSFT t after training for different numbers of epochs on OpenMathInstruct. Performance plateaus by 5 epochs, which we use as the standard SFT training length for all MSFT t baselines. 0B 10B 20B 30B 40B Pre-training tokens 0 20 40 60 80 100 GSM8K Accuracy (%) Pass@1 0-shot 1-shot 8-shot 0B 10B 20B 30B 40B Pre-training tokens Pass@8 0B 10B 20B 30B 40B Pre-traini… view at source ↗

**Figure 15.** Figure 15: Base checkpoints peak at 8-shot prompting. Performance of pretraining checkpoints Mt on GSM8K (top) and MATH (bottom) under varying numbers of in-context examples. Across both benchmarks, accuracy is maximized at 8-shot, which we use throughout for Mt evaluation. Model and Method. We conduct all experiments using the OLMo2 1B model (OLMo Team et al., 2025b). We perform GRPO training for both GSM8K-Easy an… view at source ↗

**Figure 16.** Figure 16: Full results for parallel average algorithm. maximizes the utility of each training example, leading to faster convergence in terms of training steps. Conversely, reducing the rollout count (n = 5) is significantly more FLOP-efficient, achieving comparable performance with a fraction of the compute budget. Finally, this compute advantage is particularly pronounced on the GSM8K-Hard split for the pass@8 me… view at source ↗

**Figure 17.** Figure 17: Fewer rollouts are more FLOP-efficient at convergence. GSM8K pass@k during RL training with n = 5 versus n = 64 rollouts per prompt, on training sets sub-sampled to be relatively easy or hard for the base model (a proxy for late vs. early pretraining). Asymptotic performance is similar across rollout counts. However, n = 5 achieves comparable accuracy at substantially lower FLOPs, especially on the harder… view at source ↗

read the original abstract

The standard LLM training pipeline applies reinforcement learning (RL) only after pre-training and supervised fine-tuning (SFT). We question this status quo by training a LLM from scratch and applying RL, SFT, and SFT followed by RL directly to intermediate pre-training checkpoints. We find that RL is effective very early, and often matches the full SFT$\to$RL pipeline early as well. Through experiments on harder problems, we find that targeted pre-training data composition is a strong lever for RL effectiveness, even more so than model scale. Beyond reasoning accuracy, applying RL directly to base checkpoints expands the model's distribution; the sharpening effect reported in recent work arises only when RL follows SFT. The general capabilities of the model remain essentially unchanged by RL, while they degrade following SFT. Finally, we merge RL and SFT objectives by parallel averaging, which outperforms across all other training methods discussed, across metrics, while preserving general capabilities. Together, these results suggest that LLM training might benefit from an expanded use of RL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Early RL on pre-training checkpoints can match the usual pipeline and objective averaging beats sequential training, but results hinge on untested representativeness of the chosen setups.

read the letter

The main things to know are that RL applied straight to intermediate pre-training checkpoints often performs as well as the full SFT-then-RL sequence, and that simply averaging the RL and SFT objectives in parallel beats the other schedules they tested while keeping general capabilities intact.

The experiments start training from scratch and then run RL, SFT, or both at different pre-training points. That timing test is the clearest new piece. They also report that targeted pre-training data matters more for RL success than model scale, that RL widens the output distribution while SFT narrows it, and that general capabilities hold steady under RL but drop after SFT. The averaging trick is straightforward and apparently wins across their metrics.

The soft spot is exactly the one the stress-test note flags: the specific checkpoints, tasks, and scales may not be representative. Nothing in the abstract or summary shows comparisons to standard pre-training corpora, other model families, or scaling behavior beyond the points they ran. If those choices happened to favor early RL, the recommendation to expand its use would not carry over. Without error bars, run counts, or controls visible here, the outperformance claims are hard to weigh.

This is for people who design or tweak LLM training pipelines and want concrete alternatives to the current order. It deserves a serious referee because it directly tests a widespread assumption with from-scratch runs rather than just arguing from existing results.

Referee Report

2 major / 1 minor

Summary. The paper trains LLMs from scratch and applies RL, SFT, and SFT-then-RL to intermediate pre-training checkpoints. It reports that RL becomes effective very early and often matches the full SFT o RL pipeline; that targeted pre-training data composition is a stronger lever for RL success than model scale; that RL on base checkpoints expands the output distribution while sharpening occurs only after SFT; that general capabilities remain unchanged under RL but degrade under SFT; and that parallel averaging of RL and SFT objectives outperforms all other tested regimes across metrics while preserving general capabilities. The authors conclude that LLM training would benefit from expanded use of RL.

Significance. If the empirical patterns hold under broader conditions, the work would be significant for questioning the standard post-SFT RL stage and for identifying data composition and objective merging as high-leverage levers. The distinction between distribution expansion (RL) and sharpening (SFT), plus the preservation of general capabilities, would be useful observations for training design.

major comments (2)

[Experiments (implicit in abstract claims)] The central recommendation to expand RL use rests on the tested intermediate checkpoints, tasks, and scales being representative of typical LLM pipelines. No comparisons to standard pre-training corpora, no architecture-family variation, and no scaling curves beyond the reported points are described, which directly affects whether the headline claims generalize.
[Results on data composition] The claim that 'targeted pre-training data composition is a strong lever for RL effectiveness, even more so than model scale' is load-bearing for the data-composition emphasis, yet the abstract provides no quantitative comparison (e.g., effect sizes or ablation tables) that would allow readers to verify the relative importance.

minor comments (1)

[Abstract] The abstract states that 'the general capabilities of the model remain essentially unchanged by RL' without specifying the evaluation suite or controls for length or format effects.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and outline revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Experiments (implicit in abstract claims)] The central recommendation to expand RL use rests on the tested intermediate checkpoints, tasks, and scales being representative of typical LLM pipelines. No comparisons to standard pre-training corpora, no architecture-family variation, and no scaling curves beyond the reported points are described, which directly affects whether the headline claims generalize.

Authors: We agree that the experiments are confined to specific intermediate checkpoints, tasks, and scales, without direct comparisons against standard pre-training corpora, architecture-family variations, or extended scaling curves. This constrains the generalizability of the headline claims. We will add a dedicated Limitations section that explicitly discusses these scope restrictions and calls for future validation on broader corpora and architectures. The controlled setting nonetheless isolates the effects of RL timing and data composition, providing evidence that challenges the conventional post-SFT RL stage. revision: yes
Referee: [Results on data composition] The claim that 'targeted pre-training data composition is a strong lever for RL effectiveness, even more so than model scale' is load-bearing for the data-composition emphasis, yet the abstract provides no quantitative comparison (e.g., effect sizes or ablation tables) that would allow readers to verify the relative importance.

Authors: The manuscript body contains the requested quantitative comparisons, including ablation tables and effect-size measurements that contrast data-composition changes against scale increases (Tables 2–3 and associated figures). The abstract, as a high-level summary, omits these specifics. We will revise the abstract to include a concise quantitative statement on the relative effect sizes, enabling readers to assess the claim directly from the summary. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical claims with no derivation chain

full rationale

The paper contains no equations, derivations, fitted parameters presented as predictions, or self-citation chains that reduce claims to inputs by construction. All results are direct empirical observations from training runs on specific checkpoints, tasks, and scales. No load-bearing step invokes a uniqueness theorem, ansatz smuggled via citation, or renaming of known results as new organization. The central findings (early RL effectiveness, data composition effects, merging objectives) rest on reported experimental outcomes rather than any self-referential reduction, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract contains no mathematical model, derivations, or new entities; all content is empirical.

pith-pipeline@v0.9.1-grok · 5721 in / 946 out tokens · 29436 ms · 2026-06-28T10:35:41.608754+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 14 canonical work pages · 7 internal anchors

[1]

arXiv preprint arXiv:2509.19249 , year =

Reinforcement Learning on Pre-Training Data , author =. arXiv preprint arXiv:2509.19249 , year =

work page arXiv
[2]

2026 , url =

Hatamizadeh, Ali and Akter, Syeda Nahida and Prabhumoye, Shrimai and Kautz, Jan and Patwary, Mostofa and Shoeybi, Mohammad and Catanzaro, Bryan and Choi, Yejin , booktitle =. 2026 , url =

2026
[3]

arXiv preprint arXiv:2506.08007 , year =

Reinforcement Pre-Training , author =. arXiv preprint arXiv:2506.08007 , year =

work page arXiv
[4]

Advances in Neural Information Processing Systems , volume =

Training Language Models to Follow Instructions with Human Feedback , author =. Advances in Neural Information Processing Systems , volume =. 2022 , url =

2022
[5]

Dai, Josef and Pan, Xuehai and Sun, Ruiyang and Ji, Jiaming and Xu, Xinbo and Liu, Mickel and Wang, Yizhou and Yang, Yaodong , booktitle =. Safe. 2024 , url =

2024
[6]

International Conference on Machine Learning (ICML) , pages =

Trust Region Policy Optimization , author =. International Conference on Machine Learning (ICML) , pages =. 2015 , organization =

2015
[7]

Advances in Neural Information Processing Systems , volume =

Direct Preference Optimization: Your Language Model Is Secretly a Reward Model , author =. Advances in Neural Information Processing Systems , volume =. 2023 , url =

2023
[8]

Secrets of

Zheng, Rui and Dou, Shihan and Gao, Songyang and Hua, Yuan and Shen, Wei and Wang, Binghai and Liu, Yan and Jin, Senjie and Liu, Qin and Zhou, Yuhao and others , journal =. Secrets of. 2023 , url =

2023
[9]

2024 , url =

Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, YK and Wu, Yang and others , journal =. 2024 , url =

2024
[10]

On the Interplay of Pre-Training, Mid-Training, and

Zhang, Charlie and Neubig, Graham and Yue, Xiang , journal =. On the Interplay of Pre-Training, Mid-Training, and. 2025 , url =

2025
[11]

T., Krishnamurthy, A., and Foster, D

The Coverage Principle: How Pre-Training Enables Post-Training , author =. arXiv preprint arXiv:2510.15020 , year =

work page arXiv
[12]

arXiv preprint arXiv:2503.07453 , year =

Is a Good Foundation Necessary for Efficient Reinforcement Learning? The Computational Role of the Base Model in Exploration , author =. arXiv preprint arXiv:2503.07453 , year =

work page arXiv
[13]

2026 , url =

Cheng, Zhoujun and Xie, Yutao and Qu, Yuxiao and Setlur, Amrith and Hao, Shibo and Pimpalkhute, Varad and Liang, Tongtong and Yao, Feng and Liu, Hector and Xing, Eric and Smith, Virginia and Salakhutdinov, Ruslan and Hu, Zhiting and Killian, Taylor and Kumar, Aviral , howpublished =. 2026 , url =

2026
[14]

2025 , url =

Xing, Xingrun and Fan, Zhiyuan and Lou, Jie and Li, Guoqi and Zhang, Jiajun and Zhang, Debing , journal =. 2025 , url =

2025
[15]

2023 , organization =

Biderman, Stella and Schoelkopf, Hailey and Anthony, Quentin Gregory and Bradley, Herbie and O'Brien, Kyle and Hallahan, Eric and Khan, Mohammad Aflah and Purohit, Shivanshu and Prashanth, USVSN Sai and Raff, Edward and others , booktitle =. 2023 , organization =

2023
[16]

Conference on Language Modeling (COLM) , year =

2. Conference on Language Modeling (COLM) , year =
[17]

2024 , url =

Toshniwal, Shubham and Moshkov, Ivan and Narenthiran, Sean and Gitman, Daria and Jia, Fei and Gitman, Igor , journal =. 2024 , url =

2024
[18]

2024 , url =

Li, Jeffrey and Fang, Alex and Smyrnis, Georgios and Ivgi, Maor and Jordan, Matt and Gadre, Samir Yitzhak and Bansal, Hritik and Guha, Etash and Keh, Sedrick Scott and Arora, Kushal and others , journal =. 2024 , url =

2024
[19]

International Conference on Learning Representations (ICLR) , year =

Finetuned Language Models Are Zero-Shot Learners , author =. International Conference on Learning Representations (ICLR) , year =
[20]

2021 , url =

Su, Jianlin and Lu, Yu and Pan, Shengfeng and Wen, Bo and Liu, Yunfeng , journal =. 2021 , url =

2021
[21]

2020 , url =

Shazeer, Noam , journal =. 2020 , url =

2020
[22]

Advances in Neural Information Processing Systems , pages =

Attention Is All You Need , author =. Advances in Neural Information Processing Systems , pages =. 2017 , url =

2017
[23]

2019 , eprint =

Decoupled Weight Decay Regularization , author =. 2019 , eprint =

2019
[24]

2024 , url =

Sheng, Guangming and Zhang, Chi and Ye, Zilingfeng and Wu, Xibin and Zhang, Wang and Zhang, Ru and Peng, Yanghua and Lin, Haibin and Wu, Chuan , journal =. 2024 , url =

2024
[25]

Evaluating Large Language Models Trained on Code

Evaluating Large Language Models Trained on Code , author =. arXiv preprint arXiv:2107.03374 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Training Verifiers to Solve Math Word Problems

Training Verifiers to Solve Math Word Problems , author =. arXiv preprint arXiv:2110.14168 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Measuring Mathematical Problem Solving with the

Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , booktitle =. Measuring Mathematical Problem Solving with the. 2021 , url =

2021
[28]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in

Yue, Yang and Chen, Zhiqi and Lu, Rui and Zhao, Andrew and Wang, Zhaokai and Yue, Yang and Song, Shiji and Huang, Gao , booktitle =. Does Reinforcement Learning Really Incentivize Reasoning Capacity in. 2025 , url =

2025
[29]

Training Compute-Optimal Large Language Models

Training Compute-Optimal Large Language Models , author =. arXiv preprint arXiv:2203.15556 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[30]

The Invisible Leash: Why

Wu, Fang and Xuan, Weihao and Lu, Ximing and Liu, Mingjie and Dong, Yi and Harchaoui, Zaid and Choi, Yejin , journal =. The Invisible Leash: Why. 2025 , url =

2025
[31]

2023 , url =

Zhou, Chunting and Liu, Pengfei and Xu, Puxin and Iyer, Srinivasan and Sun, Jiao and Mao, Yuning and Ma, Xuezhe and Efrat, Avia and Yu, Ping and Yu, Lili and others , journal =. 2023 , url =

2023
[32]

2025 , url =

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Zhang, Ruoyu and Xu, Runxin and Zhu, Qihao and Ma, Shirong and Wang, Peiyi and Bi, Xiao and others , journal =. 2025 , url =

2025
[33]

arXiv preprint arXiv:2512.13961 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Learning to Reason under Off-Policy Guidance

Learning to Reason under Off-Policy Guidance , author =. arXiv preprint arXiv:2504.14945 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[35]

2026 , url =

Dong, Yihong and Jiang, Xue and Tao, Yongding and Liu, Huanyu and Zhang, Kechi and Mou, Lili and Cao, Rongyu and Ma, Yingwei and Chen, Jue and Li, Binhua and others , booktitle =. 2026 , url =

2026
[36]

2025 , url =

Fu, Yuqian and Chen, Tinghong and Chai, Jiajun and Wang, Xihuai and Tu, Songjun and Yin, Guojun and Lin, Wei and Zhang, Qichao and Zhu, Yuanheng and Zhao, Dongbin , journal =. 2025 , url =

2025
[37]

International Conference on Machine Learning (ICML) , year =

Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling , author =. International Conference on Machine Learning (ICML) , year =
[38]

arXiv preprint arXiv:2509.04419 , year =

Towards a Unified View of Large Language Model Post-Training , author =. arXiv preprint arXiv:2509.04419 , year =

work page arXiv
[39]

2025 , url =

Liu, Mingyang and Farina, Gabriele and Ozdaglar, Asuman , journal =. 2025 , url =

2025
[40]

On-Policy

Zhang, Wenhao and Xie, Yuexiang and Sun, Yuchang and Chen, Yanxi and Wang, Guoyin and Li, Yaliang and Ding, Bolin and Zhou, Jingren , year =. On-Policy. 2508.11408 , archivePrefix =

work page arXiv
[41]

Reasoning with Sampling: Your Base Model is Smarter Than You Think

Reasoning with Sampling: Your Base Model Is Smarter Than You Think , author =. arXiv preprint arXiv:2510.14901 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[42]

2025 , url =

Yu, Qiying and Zhang, Zheng and Zhu, Ruofei and Yuan, Yufeng and Zuo, Xiaochen and Yue, Yu and Dai, Weinan and Fan, Tiantian and Liu, Gaohong and Liu, Lingjun and others , journal =. 2025 , url =

2025
[43]

2025 , url =

Chu, Tianzhe and Zhai, Yuexiang and Yang, Jihan and Tong, Shengbang and Xie, Saining and Schuurmans, Dale and Le, Quoc V and Levine, Sergey and Ma, Yi , journal =. 2025 , url =

2025
[44]

Proceedings of the National Academy of Sciences , volume =

Explaining Neural Scaling Laws , author =. Proceedings of the National Academy of Sciences , volume =. 2024 , publisher =

2024
[45]

2026 , url =

Limozin, Alexis and Durech, Eduard and Hoefler, Torsten and Schlag, Imanol and Pyatkin, Valentina , journal =. 2026 , url =

2026
[46]

2024 , url =

Jaech, Aaron and Kalai, Adam and Lerer, Adam and Richardson, Adam and El-Kishky, Ahmed and Low, Aiden and Helyar, Alec and Madry, Aleksander and Beutel, Alex and Carney, Alex and others , journal =. 2024 , url =

2024
[47]

2025 , url =

Shenfeld, Idan and Pari, Jyothish and Agrawal, Pulkit , journal =. 2025 , url =

2025
[48]

The Art of Scaling Reinforcement Learning Compute for LLMs

Khatri, Devvrit and Madaan, Lovish and Tiwari, Rishabh and Bansal, Rachit and Duvvuri, Sai Surya and Zaheer, Manzil and Dhillon, Inderjit S. and Brandfonbrener, David and Agarwal, Rishabh , year =. The Art of Scaling Reinforcement Learning Compute for. 2510.13786 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[49]

2025 , eprint=

Decomposing Elements of Problem Solving: What "Math" Does RL Teach? , author=. 2025 , eprint=

2025
[50]

2026 , eprint=

Random Scaling of Emergent Capabilities , author=. 2026 , eprint=

2026
[51]

Sometimes I am a tree: Data drives unstable hierarchical generalization

Qin, Tian and Saphra, Naomi and Alvarez-Melis, David. Sometimes I am a tree: Data drives unstable hierarchical generalization. arXiv [cs.LG]. arXiv:2412.04619

work page arXiv

[1] [1]

arXiv preprint arXiv:2509.19249 , year =

Reinforcement Learning on Pre-Training Data , author =. arXiv preprint arXiv:2509.19249 , year =

work page arXiv

[2] [2]

2026 , url =

Hatamizadeh, Ali and Akter, Syeda Nahida and Prabhumoye, Shrimai and Kautz, Jan and Patwary, Mostofa and Shoeybi, Mohammad and Catanzaro, Bryan and Choi, Yejin , booktitle =. 2026 , url =

2026

[3] [3]

arXiv preprint arXiv:2506.08007 , year =

Reinforcement Pre-Training , author =. arXiv preprint arXiv:2506.08007 , year =

work page arXiv

[4] [4]

Advances in Neural Information Processing Systems , volume =

Training Language Models to Follow Instructions with Human Feedback , author =. Advances in Neural Information Processing Systems , volume =. 2022 , url =

2022

[5] [5]

Dai, Josef and Pan, Xuehai and Sun, Ruiyang and Ji, Jiaming and Xu, Xinbo and Liu, Mickel and Wang, Yizhou and Yang, Yaodong , booktitle =. Safe. 2024 , url =

2024

[6] [6]

International Conference on Machine Learning (ICML) , pages =

Trust Region Policy Optimization , author =. International Conference on Machine Learning (ICML) , pages =. 2015 , organization =

2015

[7] [7]

Advances in Neural Information Processing Systems , volume =

Direct Preference Optimization: Your Language Model Is Secretly a Reward Model , author =. Advances in Neural Information Processing Systems , volume =. 2023 , url =

2023

[8] [8]

Secrets of

Zheng, Rui and Dou, Shihan and Gao, Songyang and Hua, Yuan and Shen, Wei and Wang, Binghai and Liu, Yan and Jin, Senjie and Liu, Qin and Zhou, Yuhao and others , journal =. Secrets of. 2023 , url =

2023

[9] [9]

2024 , url =

Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, YK and Wu, Yang and others , journal =. 2024 , url =

2024

[10] [10]

On the Interplay of Pre-Training, Mid-Training, and

Zhang, Charlie and Neubig, Graham and Yue, Xiang , journal =. On the Interplay of Pre-Training, Mid-Training, and. 2025 , url =

2025

[11] [11]

T., Krishnamurthy, A., and Foster, D

The Coverage Principle: How Pre-Training Enables Post-Training , author =. arXiv preprint arXiv:2510.15020 , year =

work page arXiv

[12] [12]

arXiv preprint arXiv:2503.07453 , year =

Is a Good Foundation Necessary for Efficient Reinforcement Learning? The Computational Role of the Base Model in Exploration , author =. arXiv preprint arXiv:2503.07453 , year =

work page arXiv

[13] [13]

2026 , url =

Cheng, Zhoujun and Xie, Yutao and Qu, Yuxiao and Setlur, Amrith and Hao, Shibo and Pimpalkhute, Varad and Liang, Tongtong and Yao, Feng and Liu, Hector and Xing, Eric and Smith, Virginia and Salakhutdinov, Ruslan and Hu, Zhiting and Killian, Taylor and Kumar, Aviral , howpublished =. 2026 , url =

2026

[14] [14]

2025 , url =

Xing, Xingrun and Fan, Zhiyuan and Lou, Jie and Li, Guoqi and Zhang, Jiajun and Zhang, Debing , journal =. 2025 , url =

2025

[15] [15]

2023 , organization =

Biderman, Stella and Schoelkopf, Hailey and Anthony, Quentin Gregory and Bradley, Herbie and O'Brien, Kyle and Hallahan, Eric and Khan, Mohammad Aflah and Purohit, Shivanshu and Prashanth, USVSN Sai and Raff, Edward and others , booktitle =. 2023 , organization =

2023

[16] [16]

Conference on Language Modeling (COLM) , year =

2. Conference on Language Modeling (COLM) , year =

[17] [17]

2024 , url =

Toshniwal, Shubham and Moshkov, Ivan and Narenthiran, Sean and Gitman, Daria and Jia, Fei and Gitman, Igor , journal =. 2024 , url =

2024

[18] [18]

2024 , url =

Li, Jeffrey and Fang, Alex and Smyrnis, Georgios and Ivgi, Maor and Jordan, Matt and Gadre, Samir Yitzhak and Bansal, Hritik and Guha, Etash and Keh, Sedrick Scott and Arora, Kushal and others , journal =. 2024 , url =

2024

[19] [19]

International Conference on Learning Representations (ICLR) , year =

Finetuned Language Models Are Zero-Shot Learners , author =. International Conference on Learning Representations (ICLR) , year =

[20] [20]

2021 , url =

Su, Jianlin and Lu, Yu and Pan, Shengfeng and Wen, Bo and Liu, Yunfeng , journal =. 2021 , url =

2021

[21] [21]

2020 , url =

Shazeer, Noam , journal =. 2020 , url =

2020

[22] [22]

Advances in Neural Information Processing Systems , pages =

Attention Is All You Need , author =. Advances in Neural Information Processing Systems , pages =. 2017 , url =

2017

[23] [23]

2019 , eprint =

Decoupled Weight Decay Regularization , author =. 2019 , eprint =

2019

[24] [24]

2024 , url =

Sheng, Guangming and Zhang, Chi and Ye, Zilingfeng and Wu, Xibin and Zhang, Wang and Zhang, Ru and Peng, Yanghua and Lin, Haibin and Wu, Chuan , journal =. 2024 , url =

2024

[25] [25]

Evaluating Large Language Models Trained on Code

Evaluating Large Language Models Trained on Code , author =. arXiv preprint arXiv:2107.03374 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Training Verifiers to Solve Math Word Problems

Training Verifiers to Solve Math Word Problems , author =. arXiv preprint arXiv:2110.14168 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Measuring Mathematical Problem Solving with the

Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , booktitle =. Measuring Mathematical Problem Solving with the. 2021 , url =

2021

[28] [28]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in

Yue, Yang and Chen, Zhiqi and Lu, Rui and Zhao, Andrew and Wang, Zhaokai and Yue, Yang and Song, Shiji and Huang, Gao , booktitle =. Does Reinforcement Learning Really Incentivize Reasoning Capacity in. 2025 , url =

2025

[29] [29]

Training Compute-Optimal Large Language Models

Training Compute-Optimal Large Language Models , author =. arXiv preprint arXiv:2203.15556 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

The Invisible Leash: Why

Wu, Fang and Xuan, Weihao and Lu, Ximing and Liu, Mingjie and Dong, Yi and Harchaoui, Zaid and Choi, Yejin , journal =. The Invisible Leash: Why. 2025 , url =

2025

[31] [31]

2023 , url =

Zhou, Chunting and Liu, Pengfei and Xu, Puxin and Iyer, Srinivasan and Sun, Jiao and Mao, Yuning and Ma, Xuezhe and Efrat, Avia and Yu, Ping and Yu, Lili and others , journal =. 2023 , url =

2023

[32] [32]

2025 , url =

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Zhang, Ruoyu and Xu, Runxin and Zhu, Qihao and Ma, Shirong and Wang, Peiyi and Bi, Xiao and others , journal =. 2025 , url =

2025

[33] [33]

arXiv preprint arXiv:2512.13961 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

Learning to Reason under Off-Policy Guidance

Learning to Reason under Off-Policy Guidance , author =. arXiv preprint arXiv:2504.14945 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[35] [35]

2026 , url =

Dong, Yihong and Jiang, Xue and Tao, Yongding and Liu, Huanyu and Zhang, Kechi and Mou, Lili and Cao, Rongyu and Ma, Yingwei and Chen, Jue and Li, Binhua and others , booktitle =. 2026 , url =

2026

[36] [36]

2025 , url =

Fu, Yuqian and Chen, Tinghong and Chai, Jiajun and Wang, Xihuai and Tu, Songjun and Yin, Guojun and Lin, Wei and Zhang, Qichao and Zhu, Yuanheng and Zhao, Dongbin , journal =. 2025 , url =

2025

[37] [37]

International Conference on Machine Learning (ICML) , year =

Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling , author =. International Conference on Machine Learning (ICML) , year =

[38] [38]

arXiv preprint arXiv:2509.04419 , year =

Towards a Unified View of Large Language Model Post-Training , author =. arXiv preprint arXiv:2509.04419 , year =

work page arXiv

[39] [39]

2025 , url =

Liu, Mingyang and Farina, Gabriele and Ozdaglar, Asuman , journal =. 2025 , url =

2025

[40] [40]

On-Policy

Zhang, Wenhao and Xie, Yuexiang and Sun, Yuchang and Chen, Yanxi and Wang, Guoyin and Li, Yaliang and Ding, Bolin and Zhou, Jingren , year =. On-Policy. 2508.11408 , archivePrefix =

work page arXiv

[41] [41]

Reasoning with Sampling: Your Base Model is Smarter Than You Think

Reasoning with Sampling: Your Base Model Is Smarter Than You Think , author =. arXiv preprint arXiv:2510.14901 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[42] [42]

2025 , url =

Yu, Qiying and Zhang, Zheng and Zhu, Ruofei and Yuan, Yufeng and Zuo, Xiaochen and Yue, Yu and Dai, Weinan and Fan, Tiantian and Liu, Gaohong and Liu, Lingjun and others , journal =. 2025 , url =

2025

[43] [43]

2025 , url =

Chu, Tianzhe and Zhai, Yuexiang and Yang, Jihan and Tong, Shengbang and Xie, Saining and Schuurmans, Dale and Le, Quoc V and Levine, Sergey and Ma, Yi , journal =. 2025 , url =

2025

[44] [44]

Proceedings of the National Academy of Sciences , volume =

Explaining Neural Scaling Laws , author =. Proceedings of the National Academy of Sciences , volume =. 2024 , publisher =

2024

[45] [45]

2026 , url =

Limozin, Alexis and Durech, Eduard and Hoefler, Torsten and Schlag, Imanol and Pyatkin, Valentina , journal =. 2026 , url =

2026

[46] [46]

2024 , url =

Jaech, Aaron and Kalai, Adam and Lerer, Adam and Richardson, Adam and El-Kishky, Ahmed and Low, Aiden and Helyar, Alec and Madry, Aleksander and Beutel, Alex and Carney, Alex and others , journal =. 2024 , url =

2024

[47] [47]

2025 , url =

Shenfeld, Idan and Pari, Jyothish and Agrawal, Pulkit , journal =. 2025 , url =

2025

[48] [48]

The Art of Scaling Reinforcement Learning Compute for LLMs

Khatri, Devvrit and Madaan, Lovish and Tiwari, Rishabh and Bansal, Rachit and Duvvuri, Sai Surya and Zaheer, Manzil and Dhillon, Inderjit S. and Brandfonbrener, David and Agarwal, Rishabh , year =. The Art of Scaling Reinforcement Learning Compute for. 2510.13786 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[49] [49]

2025 , eprint=

Decomposing Elements of Problem Solving: What "Math" Does RL Teach? , author=. 2025 , eprint=

2025

[50] [50]

2026 , eprint=

Random Scaling of Emergent Capabilities , author=. 2026 , eprint=

2026

[51] [51]

Sometimes I am a tree: Data drives unstable hierarchical generalization

Qin, Tian and Saphra, Naomi and Alvarez-Melis, David. Sometimes I am a tree: Data drives unstable hierarchical generalization. arXiv [cs.LG]. arXiv:2412.04619

work page arXiv