SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR

Kelly Liu; Siddharth Aphale

arxiv: 2606.18487 · v2 · pith:X6RSJVKXnew · submitted 2026-06-16 · 💻 cs.LG · cs.AI· cs.CL

SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR

Siddharth Aphale , Kelly Liu This is my paper

Pith reviewed 2026-06-27 01:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords SFT overtrainingentropy collapseGRPOrank inversiongroup advantage varianceRLVRpass@1 heuristicrollout distribution

0 comments

The pith

Overtrained SFT checkpoints can invert GRPO performance rankings via entropy collapse

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the usual rule of choosing the SFT checkpoint with the highest pass@1 before GRPO can produce the opposite of the intended result. When SFT is continued too far it squeezes the rollout distribution, cutting entropy so the success probability p falls below the threshold p*(g) for a given group size. At that point the expected within-group advantage variance p(1-p)(g-1)/g drops near zero because most groups receive identical binary rewards and supply no relative signal. Experiments on Qwen2.5-Coder-3B show pass@1 rising with SFT depth while peak GRPO pass@10 falls from 0.806 to 0.481, with pre-RL entropy correlating at +0.69; the pattern is weaker on DeepSeek-Coder-6.7B where p stays above the threshold. A simple two-stage entropy check before and during early GRPO can detect and halt these failing trajectories.

Core claim

When SFT depth increases, pre-RL pass@1 rises but the rollout entropy falls, driving p below p*(g) and collapsing the within-group advantage variance p(1-p)(g-1)/g; this removes the relative reward signal needed for effective GRPO updates, so that the checkpoint with the highest SFT pass@1 produces the lowest post-GRPO performance on Qwen2.5-Coder-3B while the effect is milder on DeepSeek-Coder-6.7B where p remains above the threshold.

What carries the argument

The formula for expected within-group advantage variance under binary rewards, p(1-p)(g-1)/g, which drops sharply once the success probability p falls below the critical value p*(g) and thereby removes distinguishable advantage signals inside groups.

If this is right

On Qwen2.5-Coder-3B, deeper SFT raises pass@1 yet lowers peak GRPO pass@10 from 0.806 to 0.481 across three seeds.
Pre-RL entropy correlates positively (rho = +0.69) with final GRPO outcome on the Qwen model.
A two-stage diagnostic that combines pre-RL entropy triage with an early GRPO entropy monitor can flag and stop failing runs.
KL-to-reference regularisation and label-smoothing variants do not rescue the collapsed Qwen checkpoint.
On DeepSeek-Coder-6.7B, where pass@1 remains above p*(8) = 0.083, GRPO outcomes compress rather than fully invert.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Entropy measured on the rollout distribution before RL could replace pass@1 as the primary criterion for selecting an SFT starting checkpoint.
The inversion effect appears model-dependent, as the two architectures tested exhibit different thresholds and severity.
Similar variance-collapse conditions may exist for non-binary rewards or other group-relative RL methods beyond GRPO.

Load-bearing premise

The observed drop in GRPO performance with greater SFT depth is produced by entropy collapse lowering group advantage variance rather than by other unmeasured differences between checkpoints or model families.

What would settle it

A controlled experiment in which an SFT checkpoint with low pre-RL entropy and p below p*(g) still reaches high GRPO pass@10, or in which measured within-group advantage variance stays high despite low p, would falsify the claimed causal link.

Figures

Figures reproduced from arXiv: 2606.18487 by Kelly Liu, Siddharth Aphale.

**Figure 1.** Figure 1: Pre RL entropy across the SFT ladder. Mean next token entropy on a 40 problem HumanEval+ probe (T=1.0, n=128); lines = 3 seed mean, bands = per checkpoint min/max. Dashed: Stage 1 threshold τH=0.18 nats. 4.1. Rank Inversion and Rank Compression Under GRPO The inversion is present from the first evaluation at step 50 and holds without exception. The 1.0 epoch policy peaks at pass@10 of 0.824 at step 250, wh… view at source ↗

**Figure 2.** Figure 2: Rank inversion vs. rank compression (n=20, T=1.0, in training pass@10). Lines: 3 seed mean; bands: min/max. (a) Qwen2.5-Coder-3B-Base: curves fan out with SFT depth, post RL ordering inverts pre RL. (b) DeepSeek-Coder-6.7B-Base: four matched three seed depths bunch in [0.80, 0.91]. Pre and post RL ranks coincide exactly (Spearman ρ=+1.00). Y axes differ between panels; absolute levels not comparable across… view at source ↗

**Figure 3.** Figure 3: Entropy collapse and GRPO outcomes. (a) Per step worst token entropy (n=5 probe, log scale, 3 seed mean): Qwen† collapses with SFT depth; DeepSeek‡ band is the per step min/max across four GRPO checkpoints. (b) Pre RL entropy vs. ∆pass@10 (post RL − pre RL deep eval, n=128, 3 seed mean): Qwen inverts (∆<0), DeepSeek compresses (∆≥0). †Qwen-3B; ‡DeepSeek-6.7B. the 5.8 epoch worst token entropy drops from 0.… view at source ↗

**Figure 4.** Figure 4: Post hoc interventions fail to rescue trainability (Qwen2.5-Coder-3B-Base). (a) KL penalty preserves the rank inversion: pass@10 under β=0 (solid) vs. β=0.01 (dashed); both KL curves underperform their baselines. (b) Entropy restoration paradox: 5.8 epoch baseline vs. 5.8 epoch + label smoothing (α=0.1). LS yields the highest pre RL pass@1 of any run (0.600) but pass@10 caps at 0.214. from 0.824 to 0.684 a… view at source ↗

**Figure 5.** Figure 5: Peak outcome and diversity collapse across both models. (a) Peak pass@10 by SFT depth (3 seed mean, min/max bars): Qwen-3B declines monotonically with SFT depth; DeepSeek-6.7B stays flat. (b) Diversity ratio pass@10/pass@1 during GRPO: Qwen† reaches 5.7 to 8.8× for deeper checkpoints (per checkpoint lines), with the 1.0 epoch curve staying < 2.7×; DeepSeek-6.7B‡ stays ≈ 2.1 to 2.4× across all GRPO arms (me… view at source ↗

read the original abstract

The standard heuristic of selecting the SFT checkpoint with the highest pass@1 for GRPO can fail when SFT compresses the rollout distribution. For binary rewards, the expected within group advantage variance is $p(1{-}p)(g{-}1)/g$; when early GRPO drives $p$ below $p^*(g)$, most groups have identical rewards and provide no group relative signal. We study SFT depth ladders for Qwen2.5-Coder-3B and DeepSeek-Coder-6.7B. We test Qwen2.5-Coder-3B across five depths and three seeds, and DeepSeek-Coder-6.7B across four matched depths and three seeds. On Qwen, pre RL pass@1 rises with SFT depth, but peak GRPO pass@10 falls from $0.806$ to $0.481$ (3 seed mean, $n{=}20$); pre RL entropy is positively associated with the GRPO outcome ($\rho{=}{+}0.69$). On DeepSeek, pass@1 remains far above $p^*(8){=}0.083$, and GRPO outcomes compress rather than invert. A two stage diagnostic, combining pre RL entropy triage with an early GRPO entropy monitor, flags high risk checkpoints and can stop failing runs early. Simple KL to reference regularisation and label smoothing variants do not rescue the collapsed Qwen checkpoint in our setting, suggesting the failure is not a trivial GRPO hyperparameter artefact.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SFT depth on Qwen tanks later GRPO via entropy drop, with a clean correlation but only indirect support for the variance mechanism.

read the letter

The main thing to know is that this paper shows excessive SFT on Qwen2.5-Coder-3B raises pass@1 but then cuts GRPO pass@10 from 0.806 down to 0.481 across seeds, tied to falling pre-RL entropy with rho +0.69. DeepSeek avoids the inversion because its pass@1 stays well above the p* threshold where groups lose signal.

The paper does a solid job running matched depth ladders on two models with three seeds each and reporting the concrete performance numbers plus the entropy link. The variance formula from Bernoulli stats is straightforward and parameter-free. The two-stage diagnostic for early stopping is a usable takeaway, and checking that KL regularization and label smoothing do not rescue the run rules out a simple hyperparameter fix.

The soft spot is the causal step. The claim that entropy collapse reduces effective group advantage variance rests on the correlation plus the independence-assuming formula, without a direct measurement of within-group reward variance or sample diversity during the first GRPO steps. The Qwen versus DeepSeek difference also suggests model-specific factors may be at play beyond entropy alone. If the full paper has additional controls or variance traces, that would strengthen it.

This is aimed at practitioners doing SFT-then-GRPO on code models who need to pick checkpoints carefully. A reader in that setting gets a clear warning and a practical monitor. It deserves peer review because the empirical pattern is reproducible and has immediate training implications, even if the mechanism needs tighter evidence.

Referee Report

3 major / 2 minor

Summary. The paper claims that the common heuristic of selecting the SFT checkpoint with highest pass@1 for subsequent GRPO can fail due to SFT-induced entropy collapse in the rollout distribution. For binary rewards this reduces expected within-group advantage variance below the independent Bernoulli baseline p(1-p)(g-1)/g once p falls below p*(g), causing most groups to yield identical rewards and no relative signal. Experiments on Qwen2.5-Coder-3B (five SFT depths, three seeds) show pre-RL pass@1 rising while peak GRPO pass@10 falls from 0.806 to 0.481 (3-seed mean); pre-RL entropy correlates with GRPO outcome at ρ=+0.69. DeepSeek-Coder-6.7B avoids inversion because its pass@1 remains above p*(8)=0.083. A two-stage entropy-based diagnostic is proposed, and KL regularization plus label-smoothing variants fail to rescue the collapsed Qwen checkpoint.

Significance. If the proposed mechanism is confirmed, the result would directly affect checkpoint-selection practice in SFT-then-RL pipelines and motivate routine entropy monitoring. The variance formula is a clear strength: it is a direct, parameter-free derivation from Bernoulli statistics that supplies an explicit, falsifiable threshold p*(g) rather than a fitted quantity. The cross-model contrast (Qwen inversion versus DeepSeek compression) supplies a useful control, though model-specific factors are left open.

major comments (3)

[Abstract] Abstract: the central causal claim—that SFT-induced entropy collapse (rather than other depth-dependent changes in policy quality or output distribution) reduces effective group-relative advantage variance and produces the observed GRPO rank inversion—is supported only by the pre-RL entropy correlation (ρ=+0.69) and the theoretical variance formula; the manuscript reports no direct empirical comparison of within-group reward variance or effective sample diversity during the first GRPO steps across the SFT-depth ladder.
[Abstract] Abstract: the statement that KL regularization and label-smoothing variants “do not rescue the collapsed Qwen checkpoint” is used to argue the failure is not a trivial hyper-parameter artefact, yet the manuscript supplies neither the regularization coefficients nor the precise implementation details, leaving open whether the interventions were sufficient to restore rollout entropy.
[Abstract] Abstract: the independence assumption underlying the variance formula p(1-p)(g-1)/g and the threshold p*(g) is not tested against the actual empirical distribution of group rewards in the low-entropy regime; any systematic dependence among the g samples would alter the effective variance and the location of p*.

minor comments (2)

[Abstract] Abstract: the reported n=20 is not defined (e.g., evaluations per seed, total rollouts, or something else).
[Abstract] Abstract: the precise definition of “pre RL entropy” (token-level, sequence-level, or rollout-level) and the exact number of samples used to compute it should be stated explicitly.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. The points raised highlight opportunities to strengthen the empirical support for our proposed mechanism. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central causal claim—that SFT-induced entropy collapse (rather than other depth-dependent changes in policy quality or output distribution) reduces effective group-relative advantage variance and produces the observed GRPO rank inversion—is supported only by the pre-RL entropy correlation (ρ=+0.69) and the theoretical variance formula; the manuscript reports no direct empirical comparison of within-group reward variance or effective sample diversity during the first GRPO steps across the SFT-depth ladder.

Authors: We agree that direct measurement of within-group reward variance and sample diversity during early GRPO steps would provide stronger causal evidence beyond the pre-RL correlation and theoretical formula. In the revised manuscript we will add these empirical comparisons (computed from the rollout buffers of the first 50–100 GRPO steps) across the five SFT-depth checkpoints, plotted against both the theoretical baseline and the observed GRPO outcomes. revision: yes
Referee: [Abstract] Abstract: the statement that KL regularization and label-smoothing variants “do not rescue the collapsed Qwen checkpoint” is used to argue the failure is not a trivial hyper-parameter artefact, yet the manuscript supplies neither the regularization coefficients nor the precise implementation details, leaving open whether the interventions were sufficient to restore rollout entropy.

Authors: We acknowledge that the regularization experiments lack sufficient detail. In the revision we will report the exact KL coefficients (e.g., 0.01 and 0.05), label-smoothing strengths, the precise application schedule within the GRPO objective, and the resulting pre- and post-intervention entropy values on the collapsed checkpoint to demonstrate that entropy was not restored to levels that would avoid the variance collapse. revision: yes
Referee: [Abstract] Abstract: the independence assumption underlying the variance formula p(1-p)(g-1)/g and the threshold p*(g) is not tested against the actual empirical distribution of group rewards in the low-entropy regime; any systematic dependence among the g samples would alter the effective variance and the location of p*.

Authors: The Bernoulli independence assumption is standard for binary rewards but merits empirical scrutiny. We will add an analysis that computes the empirical within-group reward variance directly from the low-entropy rollouts and compares it to the theoretical p(1-p)(g-1)/g prediction; any systematic deviation will be quantified and its effect on the location of p*(g) discussed in the revised text. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical correlations and Bernoulli variance formula are independent of the central claims.

full rationale

The paper presents an observational study across SFT depth ladders on two models, reporting measured pass@1, entropy values, and downstream GRPO pass@10 outcomes along with an empirical Spearman correlation ρ=+0.69. The within-group variance expression p(1-p)(g-1)/g is the direct algebraic consequence of the Bernoulli variance for g i.i.d. binary samples and does not depend on any fitted parameter or result from the present experiments. No derivation step reduces a claimed prediction to a fitted input by construction, no self-citation chain is invoked as load-bearing justification, and the diagnostic is described as a post-hoc combination of the observed quantities rather than a closed-form tautology. The central claim therefore rests on external experimental measurements rather than internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard statistical variance formula for binary group rewards and on the empirical correlation between pre-RL entropy and GRPO outcome; no free parameters or invented entities are introduced.

axioms (1)

standard math Expected within-group advantage variance for binary rewards equals p(1-p)(g-1)/g
This follows directly from the variance of a Bernoulli random variable scaled by the number of samples per group.

pith-pipeline@v0.9.1-grok · 5812 in / 1290 out tokens · 36229 ms · 2026-06-27T01:12:23.334189+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 11 linked inside Pith

[1]

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint, arXiv:2107.03374, 2021

Pith/arXiv arXiv 2021
[2]

X., and Shi, G

Chen, Z., Qin, X., Wu, Y., Ling, Y., Ye, Q., Zhao, W. X., and Shi, G. Pass@k training for adaptively balancing exploration and exploitation of large reasoning models. arXiv preprint arXiv:2508.10751, 2025

arXiv 2025
[3]

V., Levine, S., and Ma, Y

Chu, T., Zhai, Y., Yang, J., Tong, S., Xie, S., Schuurmans, D., Le, Q. V., Levine, S., and Ma, Y. SFT memorizes, RL generalizes: A comparative study of foundation model post-training. In International Conference on Machine Learning, 2025. arXiv:2501.17161

Pith/arXiv arXiv 2025
[4]

Cui, G. et al. The entropy mechanism of reinforcement learning for reasoning language models. arXiv preprint arXiv:2505.22617, 2025

Pith/arXiv arXiv 2025
[5]

F., Lan, Q., Rahman, P., Mahmood, A

Dohare, S., Hernandez-Garcia, J. F., Lan, Q., Rahman, P., Mahmood, A. R., and Sutton, R. S. Loss of plasticity in deep continual learning. Nature, 2024

2024
[6]

Dragoi, M. et al. Beyond pass@k: Breadth-depth metrics for reasoning boundaries. arXiv preprint arXiv:2510.08325, 2025

arXiv 2025
[7]

K., et al

Guo, D., Zhu, Q., Yang, D., Xie, Z., Dong, K., Zhang, W., Chen, G., Bi, X., Wu, Y., Li, Y. K., et al. DeepSeek-Coder : When the large language model meets programming -- the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024

Pith/arXiv arXiv 2024
[8]

Qwen2.5-coder technical report

Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Lu, K., et al. Qwen2.5-coder technical report. arXiv preprint arXiv:2409.12186, 2024

Pith/arXiv arXiv 2024
[9]

Quagmires in SFT - RL post-training: When high SFT scores mislead and what to use instead

Kang, F., Kuchnik, M., Padthe, K., Vlastelica, M., Jia, R., Wu, C.-J., and Ardalani, N. Quagmires in SFT - RL post-training: When high SFT scores mislead and what to use instead. arXiv preprint arXiv:2510.01624, 2025

arXiv 2025
[10]

Maintaining plasticity in continual learning via regenerative regularization

Kumar, S., Marklund, H., and Van Roy, B. Maintaining plasticity in continual learning via regenerative regularization. arXiv preprint arXiv:2308.11958, 2023

arXiv 2023
[11]

S., Wang, Y., and Zhang, L

Liu, J., Xia, C. S., Wang, Y., and Zhang, L. Is your code generated by ChatGPT really correct? rigorous evaluation of large language models with EvalPlus . In Advances in Neural Information Processing Systems, 2023. arXiv:2305.01210

Pith/arXiv arXiv 2023
[12]

A., Pascanu, R., and Dabney, W

Lyle, C., Zheng, Z., Nikishin, E., Pires, B. A., Pascanu, R., and Dabney, W. Understanding plasticity in neural networks. In International Conference on Machine Learning, 2023. arXiv:2303.01486

arXiv 2023
[13]

The primacy bias in deep reinforcement learning

Nikishin, E., Schwarzer, M., D'Oro, P., Bacon, P.-L., and Courville, A. The primacy bias in deep reinforcement learning. In International Conference on Machine Learning, 2022. arXiv:2205.07802

arXiv 2022
[14]

DeepSeekMath : Pushing the limits of mathematical reasoning in open language models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al. DeepSeekMath : Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

Pith/arXiv arXiv 2024
[15]

and Karkhanis, D

Walder, C. and Karkhanis, D. Pass@ K policy optimization: Solving harder reinforcement learning problems. In Advances in Neural Information Processing Systems, 2025. arXiv:2505.15201

Pith/arXiv arXiv 2025
[16]

Xu, X. et al. KodCode : A diverse, challenging, and verifiable synthetic dataset for coding. In Annual Meeting of the Association for Computational Linguistics, 2025. arXiv:2503.02951

arXiv 2025
[17]

Yu, Q. et al. DAPO : An open-source LLM reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025

Pith/arXiv arXiv 2025
[18]

Yue, Y. et al. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? In Advances in Neural Information Processing Systems, 2025. arXiv:2504.13837

Pith/arXiv arXiv 2025
[19]

Good SFT optimizes for SFT , better SFT prepares for reinforcement learning

Zhang, D., Xu, Y., Wang, H., Chen, Q., and Peng, H. Good SFT optimizes for SFT , better SFT prepares for reinforcement learning. arXiv preprint arXiv:2602.01058, 2026

Pith/arXiv arXiv 2026

[1] [1]

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint, arXiv:2107.03374, 2021

Pith/arXiv arXiv 2021

[2] [2]

X., and Shi, G

Chen, Z., Qin, X., Wu, Y., Ling, Y., Ye, Q., Zhao, W. X., and Shi, G. Pass@k training for adaptively balancing exploration and exploitation of large reasoning models. arXiv preprint arXiv:2508.10751, 2025

arXiv 2025

[3] [3]

V., Levine, S., and Ma, Y

Chu, T., Zhai, Y., Yang, J., Tong, S., Xie, S., Schuurmans, D., Le, Q. V., Levine, S., and Ma, Y. SFT memorizes, RL generalizes: A comparative study of foundation model post-training. In International Conference on Machine Learning, 2025. arXiv:2501.17161

Pith/arXiv arXiv 2025

[4] [4]

Cui, G. et al. The entropy mechanism of reinforcement learning for reasoning language models. arXiv preprint arXiv:2505.22617, 2025

Pith/arXiv arXiv 2025

[5] [5]

F., Lan, Q., Rahman, P., Mahmood, A

Dohare, S., Hernandez-Garcia, J. F., Lan, Q., Rahman, P., Mahmood, A. R., and Sutton, R. S. Loss of plasticity in deep continual learning. Nature, 2024

2024

[6] [6]

Dragoi, M. et al. Beyond pass@k: Breadth-depth metrics for reasoning boundaries. arXiv preprint arXiv:2510.08325, 2025

arXiv 2025

[7] [7]

K., et al

Guo, D., Zhu, Q., Yang, D., Xie, Z., Dong, K., Zhang, W., Chen, G., Bi, X., Wu, Y., Li, Y. K., et al. DeepSeek-Coder : When the large language model meets programming -- the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024

Pith/arXiv arXiv 2024

[8] [8]

Qwen2.5-coder technical report

Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Lu, K., et al. Qwen2.5-coder technical report. arXiv preprint arXiv:2409.12186, 2024

Pith/arXiv arXiv 2024

[9] [9]

Quagmires in SFT - RL post-training: When high SFT scores mislead and what to use instead

Kang, F., Kuchnik, M., Padthe, K., Vlastelica, M., Jia, R., Wu, C.-J., and Ardalani, N. Quagmires in SFT - RL post-training: When high SFT scores mislead and what to use instead. arXiv preprint arXiv:2510.01624, 2025

arXiv 2025

[10] [10]

Maintaining plasticity in continual learning via regenerative regularization

Kumar, S., Marklund, H., and Van Roy, B. Maintaining plasticity in continual learning via regenerative regularization. arXiv preprint arXiv:2308.11958, 2023

arXiv 2023

[11] [11]

S., Wang, Y., and Zhang, L

Liu, J., Xia, C. S., Wang, Y., and Zhang, L. Is your code generated by ChatGPT really correct? rigorous evaluation of large language models with EvalPlus . In Advances in Neural Information Processing Systems, 2023. arXiv:2305.01210

Pith/arXiv arXiv 2023

[12] [12]

A., Pascanu, R., and Dabney, W

Lyle, C., Zheng, Z., Nikishin, E., Pires, B. A., Pascanu, R., and Dabney, W. Understanding plasticity in neural networks. In International Conference on Machine Learning, 2023. arXiv:2303.01486

arXiv 2023

[13] [13]

The primacy bias in deep reinforcement learning

Nikishin, E., Schwarzer, M., D'Oro, P., Bacon, P.-L., and Courville, A. The primacy bias in deep reinforcement learning. In International Conference on Machine Learning, 2022. arXiv:2205.07802

arXiv 2022

[14] [14]

DeepSeekMath : Pushing the limits of mathematical reasoning in open language models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al. DeepSeekMath : Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

Pith/arXiv arXiv 2024

[15] [15]

and Karkhanis, D

Walder, C. and Karkhanis, D. Pass@ K policy optimization: Solving harder reinforcement learning problems. In Advances in Neural Information Processing Systems, 2025. arXiv:2505.15201

Pith/arXiv arXiv 2025

[16] [16]

Xu, X. et al. KodCode : A diverse, challenging, and verifiable synthetic dataset for coding. In Annual Meeting of the Association for Computational Linguistics, 2025. arXiv:2503.02951

arXiv 2025

[17] [17]

Yu, Q. et al. DAPO : An open-source LLM reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025

Pith/arXiv arXiv 2025

[18] [18]

Yue, Y. et al. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? In Advances in Neural Information Processing Systems, 2025. arXiv:2504.13837

Pith/arXiv arXiv 2025

[19] [19]

Good SFT optimizes for SFT , better SFT prepares for reinforcement learning

Zhang, D., Xu, Y., Wang, H., Chen, Q., and Peng, H. Good SFT optimizes for SFT , better SFT prepares for reinforcement learning. arXiv preprint arXiv:2602.01058, 2026

Pith/arXiv arXiv 2026