When Importance Sampling Misallocates Credit: Asymmetric Ratios for Outcome-Supervised RL

Fuzheng Zhang; Guorui Zhou; Jiakang Wang; Kun Gai; Lei Lin; Ling Pan; Qingpeng Cai; Runze Liu; Wenping Hu; Xiu Li

arxiv: 2510.06062 · v2 · pith:WJBPH6XGnew · submitted 2025-10-07 · 💻 cs.CL

When Importance Sampling Misallocates Credit: Asymmetric Ratios for Outcome-Supervised RL

Jiakang Wang , Runze Liu , Qingpeng Cai , Lei Lin , Wenping Hu , Xiu Li , Fuzheng Zhang , Guorui Zhou

show 2 more authors

Kun Gai Ling Pan

This is my paper

Pith reviewed 2026-05-21 20:24 UTC · model grok-4.3

classification 💻 cs.CL

keywords outcome-supervised reinforcement learningimportance samplingtoken-level weightingentropy collapseLLM post-trainingcredit allocationasymmetric policy optimization

0 comments

The pith

Importance sampling ratios in outcome-supervised RL shift into token weights that unbalance positive and negative advantage updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that advantages shared across all tokens in a response during outcome-supervised RL cause importance sampling ratios to stop acting mainly as distribution correctors. Instead the ratios become token-level multipliers that allocate the single shared advantage signal. This produces a mismatch for positive-advantage tokens: already high-probability tokens receive amplified updates while lagging tokens receive suppressed updates. The resulting rich-get-richer pattern drives entropy collapse, repetition, and early stopping in LLM training. A sympathetic reader would care because the same pattern explains training failures that clipping alone has not solved.

Core claim

In OSRL advantages are shared across tokens within a response, so importance sampling ratios shift from distribution correction to allocating the shared advantage signal at token level. This shift produces a critical mismatch for positive-advantage tokens that suppresses updates to underrepresented tokens while over-amplifying high-probability tokens, creating rich-get-richer dynamics that drive entropy collapse, excessive repetition, and premature convergence.

What carries the argument

The role shift of importance sampling ratios from distribution correction to token-level advantage allocation under shared advantages in OSRL

If this is right

Reversing the ratio weighting for positive-advantage tokens aligns their update direction with that of negative-advantage tokens.
The correction reduces entropy collapse and excessive repetition during training.
Training stability improves while gradient flow is preserved.
Performance rises on math reasoning and coding benchmarks relative to standard GRPO baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar ratio-induced weighting imbalances may appear in other RL methods that share a single outcome signal across long sequences.
The asymmetric correction may combine with existing clipping thresholds to produce further gains in stability.
Credit allocation rules in token-level RL more generally may need explicit asymmetry to avoid probability-dependent suppression.

Load-bearing premise

The observed entropy collapse, repetition, and premature convergence are caused primarily by the unbalanced token weighting induced by importance sampling ratios rather than by reward design, data distribution, or optimizer choices.

What would settle it

Training runs that apply the proposed asymmetric ratio reversal for positive-advantage tokens yet still show the same suppression of low-probability positive tokens or the same rate of entropy collapse would falsify the claimed mechanism.

read the original abstract

Reinforcement learning (RL) has shown great promise in large language models (LLMs) post-training, which typically rely on token-level clipping to maintain stability during optimization. Despite the empirical success of GRPO-style methods, we identify a fundamental and previously overlooked challenge in this popular Outcome-Supervised RL (OSRL) paradigm. We reveal that in OSRL, where advantages are shared across tokens within a response, importance sampling (IS) ratios deviate from their traditional purpose of distribution correction as in classic RL, which become token-level weights that allocate the shared advantage signal across tokens. We show that this hidden role shift induces a critical mismatch for positive-advantage tokens, leading to unbalanced token weighting between positive and negative tokens. Specifically, it suppresses the update of underrepresented tokens that are lagging behind, while over-amplifying already high-probability tokens. This mismatch results in rich-get-richer dynamics that over-reinforce confident tokens, weaken catch-up learning that drive entropy collapse, excessive repetition, and premature convergence. To address this, we propose Asymmetric Importance Sampling Policy Optimization (ASPO), a simple yet effective strategy that reverses the ratio-induced weighting of positive-advantage tokens, while stabilizing extreme updates and maintaining gradient flow. This mismatch correction aligns their update direction with the learning dynamics of negative ones. Comprehensive experiments across math reasoning and coding benchmarks demonstrate that ASPO significantly mitigates entropy collapse, improves training stability, and enhances performance over strong GRPO-based baselines. Our analysis provides new insights into the role of token-level weighting in OSRL and highlights the critical importance of correcting ratio-induced weighting in LLM RL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags a real shift in how IS ratios allocate shared advantages in OSRL, proposes a simple reversal fix, but does not isolate that mechanism from other training factors.

read the letter

The core observation is that when advantages are shared across tokens in a response, the importance sampling ratio stops acting mainly as a distribution corrector and starts functioning as a per-token multiplier on the advantage signal. This creates an imbalance where positive-advantage tokens that already have high probability get amplified while lagging ones are suppressed, which the authors tie to entropy collapse and repetition. That diagnosis is not spelled out in the GRPO papers they cite, so the framing is new. They then introduce ASPO, which reverses the ratio weighting only for positive-advantage tokens and adds some stabilization, and they report gains on math and coding benchmarks over standard GRPO baselines. Those results are the practical part worth noting. The experiments show improved stability and final performance, which is useful to see even if the numbers are not dramatically larger than prior work. The main limitation is that the causal link between the ratio mismatch and the observed pathologies is not tightly controlled. There is no ablation that keeps reward design, data distribution, optimizer, and clipping fixed while only neutralizing the ratio-based allocation for positive tokens. Without that isolation, it remains possible that sparse outcome rewards or existing clipping rules are contributing more to the rich-get-richer behavior than the authors emphasize. The derivation itself looks internally consistent from the abstract and description, but the paper would be stronger with explicit gradient terms and a controlled comparison. This work is aimed at practitioners running outcome-supervised RL on LLMs who are already seeing repetition or early convergence. A reader in that setting could try the ASPO change directly and see if it helps their runs. It is not a foundational theoretical result, but the problem it targets is common enough that the paper merits a serious referee rather than a desk reject. I would send it out for review with a request for tighter ablations on the causal claim.

Referee Report

3 major / 2 minor

Summary. The paper claims that in outcome-supervised RL (OSRL) for LLMs, where a single advantage A is shared across all tokens in a response, the importance-sampling ratio r_t = π_new/π_old ceases to perform distribution correction and instead functions as a per-token weight that allocates the shared advantage signal. For positive-A tokens this produces a mismatch: high-probability tokens receive amplified updates while lagging tokens are suppressed, inducing rich-get-richer dynamics, entropy collapse, repetition, and premature convergence. The authors introduce Asymmetric Importance Sampling Policy Optimization (ASPO), which reverses the ratio weighting for positive-advantage tokens while preserving gradient flow, and report improved stability and benchmark performance over GRPO baselines on math and coding tasks.

Significance. If the identified weighting mismatch is shown to be the dominant driver, the work supplies a mechanistic account of pathologies routinely observed in GRPO-style training and a lightweight corrective mechanism that preserves the outcome-supervised paradigm. The proposal is parameter-free in its core adjustment and directly targets the diagnosed imbalance, which is a strength. Experimental gains on standard reasoning benchmarks indicate practical utility, though the strength of the causal attribution remains the central open question.

major comments (3)

[§3] §3 (Analysis of IS role shift): the derivation that r_t * A becomes a token-level allocator for shared advantage is plausible from the per-token policy gradient, but the manuscript does not provide an explicit side-by-side comparison of the OSRL gradient versus the classic RL gradient under the same shared-A setting; without this, it is unclear whether the claimed mismatch is an inevitable consequence or an artifact of particular clipping or normalization choices.
[Experiments] Experiments section (ablation studies): the central causal claim—that the ratio-induced weighting for positive-A tokens is the primary cause of entropy collapse and repetition—requires a controlled ablation that holds reward design, data distribution, optimizer, and clipping fixed while neutralizing only the ratio allocation (e.g., forcing r_t = 1 for all positive-A tokens). No such isolation experiment is reported; therefore alternative explanations (sparse outcome rewards, batch statistics, or existing GRPO clipping) cannot yet be ruled out.
[§4] §4 (ASPO definition): the reversal of the ratio for positive-advantage tokens is presented as a direct correction, yet the manuscript does not derive or bound the resulting gradient norm or show that the modification preserves unbiasedness or monotonic improvement guarantees under the original OSRL objective.

minor comments (2)

[§2] Notation: the distinction between response-level advantage A and per-token advantage is introduced late; early equations would benefit from explicit indexing (e.g., A_i for response i) to avoid ambiguity when discussing token-level weighting.
[Figures] Figure captions: several training-dynamic plots lack error bars or run-to-run variance, making it difficult to assess whether the reported entropy and repetition reductions are statistically reliable.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our analysis and experiments. We address each major point below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [§3] §3 (Analysis of IS role shift): the derivation that r_t * A becomes a token-level allocator for shared advantage is plausible from the per-token policy gradient, but the manuscript does not provide an explicit side-by-side comparison of the OSRL gradient versus the classic RL gradient under the same shared-A setting; without this, it is unclear whether the claimed mismatch is an inevitable consequence or an artifact of particular clipping or normalization choices.

Authors: We agree that an explicit side-by-side comparison would strengthen the section. In the revised manuscript we will add a dedicated paragraph deriving the per-token gradient under shared advantage (OSRL) next to the standard per-token RL gradient with per-token advantages. This will show that the ratio acting as a weight follows directly from the shared-A structure and is independent of clipping or normalization details. revision: yes
Referee: [Experiments] Experiments section (ablation studies): the central causal claim—that the ratio-induced weighting for positive-A tokens is the primary cause of entropy collapse and repetition—requires a controlled ablation that holds reward design, data distribution, optimizer, and clipping fixed while neutralizing only the ratio allocation (e.g., forcing r_t = 1 for all positive-A tokens). No such isolation experiment is reported; therefore alternative explanations (sparse outcome rewards, batch statistics, or existing GRPO clipping) cannot yet be ruled out.

Authors: We will add the requested isolation experiment in the revised version. Specifically, we will introduce a controlled variant that sets the importance ratio to 1 for all positive-advantage tokens while keeping every other hyper-parameter and implementation detail identical to the GRPO baseline. Results of this ablation will be reported alongside the existing entropy and repetition metrics to isolate the contribution of the ratio weighting. revision: yes
Referee: [§4] §4 (ASPO definition): the reversal of the ratio for positive-advantage tokens is presented as a direct correction, yet the manuscript does not derive or bound the resulting gradient norm or show that the modification preserves unbiasedness or monotonic improvement guarantees under the original OSRL objective.

Authors: ASPO is introduced as a practical correction to the diagnosed weighting mismatch rather than a theoretically equivalent estimator of the original objective. We do not claim that the modification preserves unbiasedness or monotonic improvement guarantees; such guarantees are already difficult to establish for GRPO-style methods under outcome supervision. In the revision we will add an empirical analysis of gradient norms under ASPO and a brief discussion clarifying the heuristic nature of the adjustment. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper's core chain starts from the standard per-token policy gradient term r_t * A * ∇logπ (with shared outcome advantage A) and analytically identifies the resulting token-weighting mismatch for positive-A tokens under IS ratios. This identification is a direct unpacking of existing RL math applied to the OSRL setting rather than a self-definition, fitted prediction, or self-citation reduction. The ASPO proposal follows as an explicit reversal of that identified weighting, preserving gradient flow without introducing new fitted parameters or renaming known results. No load-bearing uniqueness theorem, ansatz smuggling, or self-citation chain is required for the central claim. The analysis remains falsifiable via the described ablations and benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard RL policy-gradient assumptions plus the empirical observation that shared advantages turn IS ratios into token weights; no new free parameters or invented physical entities are introduced.

axioms (1)

standard math Standard assumptions of policy gradient methods and importance sampling in RL
The analysis builds directly on the usual importance-sampling correction and advantage estimation used in GRPO-style methods.

invented entities (1)

ASPO weighting reversal no independent evidence
purpose: To invert the ratio-induced weighting specifically for positive-advantage tokens
A new algorithmic modification introduced to align update directions; no external falsifiable prediction is provided beyond the reported experiments.

pith-pipeline@v0.9.0 · 5850 in / 1447 out tokens · 44473 ms · 2026-05-21T20:24:55.323852+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel / Jcost definition echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

For tokens with Â_i_t > 0, we use the reciprocal of their IS weights ... ˆr_i_t = π_old(·) / π(·) [eq. 4]
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection / RCLCombiner_isCoupling_iff echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

This mismatch ... over-amplifying already high-probability tokens ... rich-get-richer dynamics

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR
cs.LG 2026-05 conditional novelty 7.0

Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
cs.SE 2026-05 unverdicted novelty 7.0

StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs
cs.LG 2026-05 unverdicted novelty 7.0

On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.
Bounded Ratio Reinforcement Learning
cs.LG 2026-04 conditional novelty 7.0

BRRL derives an analytic optimal policy for regularized constrained RL that guarantees monotonic improvement and yields the BPO algorithm that matches or exceeds PPO.
Multi-Step Likelihood-Ratio Correction for Reinforcement Learning with Verifiable Rewards
cs.LG 2026-05 unverdicted novelty 6.0

NFPO augments the PPO surrogate with N-step forward traces to bridge local approximations and exact policy gradients, delivering tighter policy-improvement bounds and improved results on reasoning benchmarks.
When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR
cs.LG 2026-05 unverdicted novelty 6.0

Dynamic Gradient Gating monitors lm_head gradient norms to safely reuse rollout batches in RLVR, achieving up to 2.93x sample efficiency and 2.14x wall-clock speedup across math, ALFWorld, WebShop, and QA tasks.
Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO
cs.LG 2026-04 unverdicted novelty 6.0

Balanced Aggregation fixes sign-length coupling and length downweighting in GRPO by computing separate token means for positive and negative subsets and combining them with sequence-count weights, yielding more stable...
STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens
cs.CL 2026-02 unverdicted novelty 6.0

STAPO stabilizes RL for LLMs by suppressing gradient updates from rare spurious tokens, yielding 11.49% average gains on math benchmarks over GRPO and similar baselines.
OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 5.0

OGER adds an auxiliary exploration reward built from offline trajectories and model entropy to hybrid RL training, yielding gains on math reasoning benchmarks and out-of-domain generalization.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 9 Pith papers · 13 internal anchors

[1]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

URLhttps://hkunlp.github. io/blog/2025/Polaris. Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

URLhttps://proceedings.neurips.cc/ paper_files/paper/2017/file/d5e2c0adad503c91f91df240d0cd4e49-Paper.pdf. Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617,

work page internal anchor Pith review Pith/arXiv arXiv 2017
[3]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Skywork Open Reasoner 1 Technical Report

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.211. URLhttps://aclanthology.org/2024.acl-long.211/. Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, et al. Skywork open reasoner 1 technical report.arXiv preprint arXiv:2505.22312,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.acl-long.211 2024
[5]

ISBN 9798400702297

Association for Computing Machinery. ISBN 9798400702297. doi: 10.1145/3600006.3613165. URLhttps://doi.org/10.1145/3600006. 3613165. Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ra- masesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving ...

work page doi:10.1145/3600006.3613165
[6]

URLhttps://proceedings.neurips.cc/paper_files/paper/ 2022/file/18abbeef8cfe9203fdf9053c9c4fe191-Paper-Conference.pdf. Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen ...

work page 2022
[7]

Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals

doi: 10.1126/science.abq1158. URL https://www.science.org/doi/abs/10.1126/science.abq1158. Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations,

work page doi:10.1126/science.abq1158
[8]

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

URLhttps://openreview.net/ forum?id=v8L0pN6EOi. Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models.arXiv preprint arXiv:2505.24864, 2025a. Runze Liu, Fengshuo Bai, Yali Du, and Yaodong Yang. Meta-reward-net: Implicitly dif...

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling

URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ 8be9c134bb193d8bd3827d4df8488228-Paper-Conference.pdf. Runze Liu, Junqi Gao, Jian Zhao, Kaiyan Zhang, Xiu Li, Biqing Qi, Wanli Ouyang, and Bowen Zhou. Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling.arXiv preprint arXiv:2502.06703, 2025b. Runze Liu, Jiakang Wang, Yul...

work page arXiv 2022
[10]

Proximal Policy Optimization Algorithms

Morgan Kaufmann Publishers Inc. ISBN 1558607072. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

ISBN 9798400711961

Association for Computing Machinery. ISBN 9798400711961. doi: 10.1145/ 3689031.3696075. URLhttps://doi.org/10.1145/3689031.3696075. Mingyang Song, Mao Zheng, Zheng Li, Wenjie Yang, Xuan Luo, Yue Pan, and Feng Zhang. Fastcurl: Curriculum reinforcement learning with stage-wise context scaling for efficient training r1-like reasoning models.arXiv preprint ar...

work page doi:10.1145/3689031.3696075
[13]

Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR

Jiakang Wang, Runze Liu, Fuzheng Zhang, Xiu Li, and Guorui Zhou. Stabilizing knowledge, promoting reasoning: Dual-token constraints for rlvr.arXiv preprint arXiv:2507.15778, 2025a. Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority toke...

work page internal anchor Pith review Pith/arXiv arXiv
[14]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

15 ASPO: Asymmetric Importance Sampling Policy Optimization doi: 10.1609/aaai.v34i04.6144. URLhttps://ojs.aaai.org/index.php/AAAI/article/ view/6144. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv prepri...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1609/aaai.v34i04.6144
[15]

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild.arXiv preprint arXiv:2503.18892,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

A Survey of Reinforcement Learning for Large Reasoning Models

Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, Yu Fu, Xingtai Lv, Yuchen Zhang, Sihang Zeng, Shang Qu, Haozhan Li, Shijie Wang, Yuru Wang, Xinwei Long, Fangfu Liu, Xiang Xu, Jiaze Ma, Xuekai Zhu, Ermo Hua, Yihao Liu, Zonglin Li, Huayu Chen, Xiaoye Qu, Yafu Li, Weize Chen, Zhenzhao Yua...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.findings-acl.547 2025
[18]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

•DAPO(Yu et al., 2025): A strong OSRL algorithm built upon GRPO (Shao et al., 2024)

as the base model and compare ASPO with the following baselines: •Base Model: The original model without any RL fine-tuning. •DAPO(Yu et al., 2025): A strong OSRL algorithm built upon GRPO (Shao et al., 2024). • DeepScaleR-1.5B(Luo et al., 2025b): A 1.5B model trained for mathematical reasoning with iterative context-length expansion. • DeepCoder-1.5B(Luo...

work page 2025
[20]

For coding, we employ DeepCoder (Luo et al., 2025a), CodeContests (Li et al., 2022), and CodeForces (Penedo et al.,

for mathematical tasks. For coding, we employ DeepCoder (Luo et al., 2025a), CodeContests (Li et al., 2022), and CodeForces (Penedo et al.,

work page 2022

[1] [1]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

URLhttps://hkunlp.github. io/blog/2025/Polaris. Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585,

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

URLhttps://proceedings.neurips.cc/ paper_files/paper/2017/file/d5e2c0adad503c91f91df240d0cd4e49-Paper.pdf. Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617,

work page internal anchor Pith review Pith/arXiv arXiv 2017

[3] [3]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Skywork Open Reasoner 1 Technical Report

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.211. URLhttps://aclanthology.org/2024.acl-long.211/. Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, et al. Skywork open reasoner 1 technical report.arXiv preprint arXiv:2505.22312,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.acl-long.211 2024

[5] [5]

ISBN 9798400702297

Association for Computing Machinery. ISBN 9798400702297. doi: 10.1145/3600006.3613165. URLhttps://doi.org/10.1145/3600006. 3613165. Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ra- masesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving ...

work page doi:10.1145/3600006.3613165

[6] [6]

URLhttps://proceedings.neurips.cc/paper_files/paper/ 2022/file/18abbeef8cfe9203fdf9053c9c4fe191-Paper-Conference.pdf. Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen ...

work page 2022

[7] [7]

Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals

doi: 10.1126/science.abq1158. URL https://www.science.org/doi/abs/10.1126/science.abq1158. Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations,

work page doi:10.1126/science.abq1158

[8] [8]

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

URLhttps://openreview.net/ forum?id=v8L0pN6EOi. Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models.arXiv preprint arXiv:2505.24864, 2025a. Runze Liu, Fengshuo Bai, Yali Du, and Yaodong Yang. Meta-reward-net: Implicitly dif...

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling

URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ 8be9c134bb193d8bd3827d4df8488228-Paper-Conference.pdf. Runze Liu, Junqi Gao, Jian Zhao, Kaiyan Zhang, Xiu Li, Biqing Qi, Wanli Ouyang, and Bowen Zhou. Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling.arXiv preprint arXiv:2502.06703, 2025b. Runze Liu, Jiakang Wang, Yul...

work page arXiv 2022

[10] [10]

Proximal Policy Optimization Algorithms

Morgan Kaufmann Publishers Inc. ISBN 1558607072. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

ISBN 9798400711961

Association for Computing Machinery. ISBN 9798400711961. doi: 10.1145/ 3689031.3696075. URLhttps://doi.org/10.1145/3689031.3696075. Mingyang Song, Mao Zheng, Zheng Li, Wenjie Yang, Xuan Luo, Yue Pan, and Feng Zhang. Fastcurl: Curriculum reinforcement learning with stage-wise context scaling for efficient training r1-like reasoning models.arXiv preprint ar...

work page doi:10.1145/3689031.3696075

[13] [13]

Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR

Jiakang Wang, Runze Liu, Fuzheng Zhang, Xiu Li, and Guorui Zhou. Stabilizing knowledge, promoting reasoning: Dual-token constraints for rlvr.arXiv preprint arXiv:2507.15778, 2025a. Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority toke...

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

15 ASPO: Asymmetric Importance Sampling Policy Optimization doi: 10.1609/aaai.v34i04.6144. URLhttps://ojs.aaai.org/index.php/AAAI/article/ view/6144. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv prepri...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1609/aaai.v34i04.6144

[15] [15]

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild.arXiv preprint arXiv:2503.18892,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

A Survey of Reinforcement Learning for Large Reasoning Models

Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, Yu Fu, Xingtai Lv, Yuchen Zhang, Sihang Zeng, Shang Qu, Haozhan Li, Shijie Wang, Yuru Wang, Xinwei Long, Fangfu Liu, Xiang Xu, Jiaze Ma, Xuekai Zhu, Ermo Hua, Yihao Liu, Zonglin Li, Huayu Chen, Xiaoye Qu, Yafu Li, Weize Chen, Zhenzhao Yua...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.findings-acl.547 2025

[18] [18]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

•DAPO(Yu et al., 2025): A strong OSRL algorithm built upon GRPO (Shao et al., 2024)

as the base model and compare ASPO with the following baselines: •Base Model: The original model without any RL fine-tuning. •DAPO(Yu et al., 2025): A strong OSRL algorithm built upon GRPO (Shao et al., 2024). • DeepScaleR-1.5B(Luo et al., 2025b): A 1.5B model trained for mathematical reasoning with iterative context-length expansion. • DeepCoder-1.5B(Luo...

work page 2025

[20] [20]

For coding, we employ DeepCoder (Luo et al., 2025a), CodeContests (Li et al., 2022), and CodeForces (Penedo et al.,

for mathematical tasks. For coding, we employ DeepCoder (Luo et al., 2025a), CodeContests (Li et al., 2022), and CodeForces (Penedo et al.,

work page 2022