When Importance Sampling Misallocates Credit: Asymmetric Ratios for Outcome-Supervised RL
Pith reviewed 2026-05-21 20:24 UTC · model grok-4.3
The pith
Importance sampling ratios in outcome-supervised RL shift into token weights that unbalance positive and negative advantage updates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In OSRL advantages are shared across tokens within a response, so importance sampling ratios shift from distribution correction to allocating the shared advantage signal at token level. This shift produces a critical mismatch for positive-advantage tokens that suppresses updates to underrepresented tokens while over-amplifying high-probability tokens, creating rich-get-richer dynamics that drive entropy collapse, excessive repetition, and premature convergence.
What carries the argument
The role shift of importance sampling ratios from distribution correction to token-level advantage allocation under shared advantages in OSRL
If this is right
- Reversing the ratio weighting for positive-advantage tokens aligns their update direction with that of negative-advantage tokens.
- The correction reduces entropy collapse and excessive repetition during training.
- Training stability improves while gradient flow is preserved.
- Performance rises on math reasoning and coding benchmarks relative to standard GRPO baselines.
Where Pith is reading between the lines
- Similar ratio-induced weighting imbalances may appear in other RL methods that share a single outcome signal across long sequences.
- The asymmetric correction may combine with existing clipping thresholds to produce further gains in stability.
- Credit allocation rules in token-level RL more generally may need explicit asymmetry to avoid probability-dependent suppression.
Load-bearing premise
The observed entropy collapse, repetition, and premature convergence are caused primarily by the unbalanced token weighting induced by importance sampling ratios rather than by reward design, data distribution, or optimizer choices.
What would settle it
Training runs that apply the proposed asymmetric ratio reversal for positive-advantage tokens yet still show the same suppression of low-probability positive tokens or the same rate of entropy collapse would falsify the claimed mechanism.
read the original abstract
Reinforcement learning (RL) has shown great promise in large language models (LLMs) post-training, which typically rely on token-level clipping to maintain stability during optimization. Despite the empirical success of GRPO-style methods, we identify a fundamental and previously overlooked challenge in this popular Outcome-Supervised RL (OSRL) paradigm. We reveal that in OSRL, where advantages are shared across tokens within a response, importance sampling (IS) ratios deviate from their traditional purpose of distribution correction as in classic RL, which become token-level weights that allocate the shared advantage signal across tokens. We show that this hidden role shift induces a critical mismatch for positive-advantage tokens, leading to unbalanced token weighting between positive and negative tokens. Specifically, it suppresses the update of underrepresented tokens that are lagging behind, while over-amplifying already high-probability tokens. This mismatch results in rich-get-richer dynamics that over-reinforce confident tokens, weaken catch-up learning that drive entropy collapse, excessive repetition, and premature convergence. To address this, we propose Asymmetric Importance Sampling Policy Optimization (ASPO), a simple yet effective strategy that reverses the ratio-induced weighting of positive-advantage tokens, while stabilizing extreme updates and maintaining gradient flow. This mismatch correction aligns their update direction with the learning dynamics of negative ones. Comprehensive experiments across math reasoning and coding benchmarks demonstrate that ASPO significantly mitigates entropy collapse, improves training stability, and enhances performance over strong GRPO-based baselines. Our analysis provides new insights into the role of token-level weighting in OSRL and highlights the critical importance of correcting ratio-induced weighting in LLM RL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that in outcome-supervised RL (OSRL) for LLMs, where a single advantage A is shared across all tokens in a response, the importance-sampling ratio r_t = π_new/π_old ceases to perform distribution correction and instead functions as a per-token weight that allocates the shared advantage signal. For positive-A tokens this produces a mismatch: high-probability tokens receive amplified updates while lagging tokens are suppressed, inducing rich-get-richer dynamics, entropy collapse, repetition, and premature convergence. The authors introduce Asymmetric Importance Sampling Policy Optimization (ASPO), which reverses the ratio weighting for positive-advantage tokens while preserving gradient flow, and report improved stability and benchmark performance over GRPO baselines on math and coding tasks.
Significance. If the identified weighting mismatch is shown to be the dominant driver, the work supplies a mechanistic account of pathologies routinely observed in GRPO-style training and a lightweight corrective mechanism that preserves the outcome-supervised paradigm. The proposal is parameter-free in its core adjustment and directly targets the diagnosed imbalance, which is a strength. Experimental gains on standard reasoning benchmarks indicate practical utility, though the strength of the causal attribution remains the central open question.
major comments (3)
- [§3] §3 (Analysis of IS role shift): the derivation that r_t * A becomes a token-level allocator for shared advantage is plausible from the per-token policy gradient, but the manuscript does not provide an explicit side-by-side comparison of the OSRL gradient versus the classic RL gradient under the same shared-A setting; without this, it is unclear whether the claimed mismatch is an inevitable consequence or an artifact of particular clipping or normalization choices.
- [Experiments] Experiments section (ablation studies): the central causal claim—that the ratio-induced weighting for positive-A tokens is the primary cause of entropy collapse and repetition—requires a controlled ablation that holds reward design, data distribution, optimizer, and clipping fixed while neutralizing only the ratio allocation (e.g., forcing r_t = 1 for all positive-A tokens). No such isolation experiment is reported; therefore alternative explanations (sparse outcome rewards, batch statistics, or existing GRPO clipping) cannot yet be ruled out.
- [§4] §4 (ASPO definition): the reversal of the ratio for positive-advantage tokens is presented as a direct correction, yet the manuscript does not derive or bound the resulting gradient norm or show that the modification preserves unbiasedness or monotonic improvement guarantees under the original OSRL objective.
minor comments (2)
- [§2] Notation: the distinction between response-level advantage A and per-token advantage is introduced late; early equations would benefit from explicit indexing (e.g., A_i for response i) to avoid ambiguity when discussing token-level weighting.
- [Figures] Figure captions: several training-dynamic plots lack error bars or run-to-run variance, making it difficult to assess whether the reported entropy and repetition reductions are statistically reliable.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the presentation of our analysis and experiments. We address each major point below and indicate the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Analysis of IS role shift): the derivation that r_t * A becomes a token-level allocator for shared advantage is plausible from the per-token policy gradient, but the manuscript does not provide an explicit side-by-side comparison of the OSRL gradient versus the classic RL gradient under the same shared-A setting; without this, it is unclear whether the claimed mismatch is an inevitable consequence or an artifact of particular clipping or normalization choices.
Authors: We agree that an explicit side-by-side comparison would strengthen the section. In the revised manuscript we will add a dedicated paragraph deriving the per-token gradient under shared advantage (OSRL) next to the standard per-token RL gradient with per-token advantages. This will show that the ratio acting as a weight follows directly from the shared-A structure and is independent of clipping or normalization details. revision: yes
-
Referee: [Experiments] Experiments section (ablation studies): the central causal claim—that the ratio-induced weighting for positive-A tokens is the primary cause of entropy collapse and repetition—requires a controlled ablation that holds reward design, data distribution, optimizer, and clipping fixed while neutralizing only the ratio allocation (e.g., forcing r_t = 1 for all positive-A tokens). No such isolation experiment is reported; therefore alternative explanations (sparse outcome rewards, batch statistics, or existing GRPO clipping) cannot yet be ruled out.
Authors: We will add the requested isolation experiment in the revised version. Specifically, we will introduce a controlled variant that sets the importance ratio to 1 for all positive-advantage tokens while keeping every other hyper-parameter and implementation detail identical to the GRPO baseline. Results of this ablation will be reported alongside the existing entropy and repetition metrics to isolate the contribution of the ratio weighting. revision: yes
-
Referee: [§4] §4 (ASPO definition): the reversal of the ratio for positive-advantage tokens is presented as a direct correction, yet the manuscript does not derive or bound the resulting gradient norm or show that the modification preserves unbiasedness or monotonic improvement guarantees under the original OSRL objective.
Authors: ASPO is introduced as a practical correction to the diagnosed weighting mismatch rather than a theoretically equivalent estimator of the original objective. We do not claim that the modification preserves unbiasedness or monotonic improvement guarantees; such guarantees are already difficult to establish for GRPO-style methods under outcome supervision. In the revision we will add an empirical analysis of gradient norms under ASPO and a brief discussion clarifying the heuristic nature of the adjustment. revision: partial
Circularity Check
No significant circularity; derivation self-contained
full rationale
The paper's core chain starts from the standard per-token policy gradient term r_t * A * ∇logπ (with shared outcome advantage A) and analytically identifies the resulting token-weighting mismatch for positive-A tokens under IS ratios. This identification is a direct unpacking of existing RL math applied to the OSRL setting rather than a self-definition, fitted prediction, or self-citation reduction. The ASPO proposal follows as an explicit reversal of that identified weighting, preserving gradient flow without introducing new fitted parameters or renaming known results. No load-bearing uniqueness theorem, ansatz smuggling, or self-citation chain is required for the central claim. The analysis remains falsifiable via the described ablations and benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard assumptions of policy gradient methods and importance sampling in RL
invented entities (1)
-
ASPO weighting reversal
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel / Jcost definition echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
For tokens with Â_i_t > 0, we use the reciprocal of their IS weights ... ˆr_i_t = π_old(·) / π(·) [eq. 4]
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection / RCLCombiner_isCoupling_iff echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
This mismatch ... over-amplifying already high-probability tokens ... rich-get-richer dynamics
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 9 Pith papers
-
Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR
Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.
-
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
-
The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs
On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.
-
Bounded Ratio Reinforcement Learning
BRRL derives an analytic optimal policy for regularized constrained RL that guarantees monotonic improvement and yields the BPO algorithm that matches or exceeds PPO.
-
Multi-Step Likelihood-Ratio Correction for Reinforcement Learning with Verifiable Rewards
NFPO augments the PPO surrogate with N-step forward traces to bridge local approximations and exact policy gradients, delivering tighter policy-improvement bounds and improved results on reasoning benchmarks.
-
When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR
Dynamic Gradient Gating monitors lm_head gradient norms to safely reuse rollout batches in RLVR, achieving up to 2.93x sample efficiency and 2.14x wall-clock speedup across math, ALFWorld, WebShop, and QA tasks.
-
Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO
Balanced Aggregation fixes sign-length coupling and length downweighting in GRPO by computing separate token means for positive and negative subsets and combining them with sequence-count weights, yielding more stable...
-
STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens
STAPO stabilizes RL for LLMs by suppressing gradient updates from rare spurious tokens, yielding 11.49% average gains on math benchmarks over GRPO and similar baselines.
-
OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning
OGER adds an auxiliary exploration reward built from offline trajectories and model entropy to hybrid RL training, yielding gains on math reasoning benchmarks and out-of-domain generalization.
Reference graph
Works this paper leans on
-
[1]
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
URLhttps://hkunlp.github. io/blog/2025/Polaris. Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585,
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
URLhttps://proceedings.neurips.cc/ paper_files/paper/2017/file/d5e2c0adad503c91f91df240d0cd4e49-Paper.pdf. Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617,
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[3]
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Skywork Open Reasoner 1 Technical Report
Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.211. URLhttps://aclanthology.org/2024.acl-long.211/. Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, et al. Skywork open reasoner 1 technical report.arXiv preprint arXiv:2505.22312,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.acl-long.211 2024
-
[5]
Association for Computing Machinery. ISBN 9798400702297. doi: 10.1145/3600006.3613165. URLhttps://doi.org/10.1145/3600006. 3613165. Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ra- masesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving ...
-
[6]
URLhttps://proceedings.neurips.cc/paper_files/paper/ 2022/file/18abbeef8cfe9203fdf9053c9c4fe191-Paper-Conference.pdf. Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen ...
work page 2022
-
[7]
doi: 10.1126/science.abq1158. URL https://www.science.org/doi/abs/10.1126/science.abq1158. Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations,
-
[8]
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
URLhttps://openreview.net/ forum?id=v8L0pN6EOi. Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models.arXiv preprint arXiv:2505.24864, 2025a. Runze Liu, Fengshuo Bai, Yali Du, and Yaodong Yang. Meta-reward-net: Implicitly dif...
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling
URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ 8be9c134bb193d8bd3827d4df8488228-Paper-Conference.pdf. Runze Liu, Junqi Gao, Jian Zhao, Kaiyan Zhang, Xiu Li, Biqing Qi, Wanli Ouyang, and Bowen Zhou. Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling.arXiv preprint arXiv:2502.06703, 2025b. Runze Liu, Jiakang Wang, Yul...
-
[10]
Proximal Policy Optimization Algorithms
Morgan Kaufmann Publishers Inc. ISBN 1558607072. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Association for Computing Machinery. ISBN 9798400711961. doi: 10.1145/ 3689031.3696075. URLhttps://doi.org/10.1145/3689031.3696075. Mingyang Song, Mao Zheng, Zheng Li, Wenjie Yang, Xuan Luo, Yue Pan, and Feng Zhang. Fastcurl: Curriculum reinforcement learning with stage-wise context scaling for efficient training r1-like reasoning models.arXiv preprint ar...
-
[13]
Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR
Jiakang Wang, Runze Liu, Fuzheng Zhang, Xiu Li, and Guorui Zhou. Stabilizing knowledge, promoting reasoning: Dual-token constraints for rlvr.arXiv preprint arXiv:2507.15778, 2025a. Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority toke...
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
15 ASPO: Asymmetric Importance Sampling Policy Optimization doi: 10.1609/aaai.v34i04.6144. URLhttps://ojs.aaai.org/index.php/AAAI/article/ view/6144. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv prepri...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1609/aaai.v34i04.6144
-
[15]
VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks
Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild
Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild.arXiv preprint arXiv:2503.18892,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
A Survey of Reinforcement Learning for Large Reasoning Models
Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, Yu Fu, Xingtai Lv, Yuchen Zhang, Sihang Zeng, Shang Qu, Haozhan Li, Shijie Wang, Yuru Wang, Xinwei Long, Fangfu Liu, Xiang Xu, Jiaze Ma, Xuekai Zhu, Ermo Hua, Yihao Liu, Zonglin Li, Huayu Chen, Xiaoye Qu, Yafu Li, Weize Chen, Zhenzhao Yua...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.findings-acl.547 2025
-
[18]
Group Sequence Policy Optimization
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
•DAPO(Yu et al., 2025): A strong OSRL algorithm built upon GRPO (Shao et al., 2024)
as the base model and compare ASPO with the following baselines: •Base Model: The original model without any RL fine-tuning. •DAPO(Yu et al., 2025): A strong OSRL algorithm built upon GRPO (Shao et al., 2024). • DeepScaleR-1.5B(Luo et al., 2025b): A 1.5B model trained for mathematical reasoning with iterative context-length expansion. • DeepCoder-1.5B(Luo...
work page 2025
-
[20]
for mathematical tasks. For coding, we employ DeepCoder (Luo et al., 2025a), CodeContests (Li et al., 2022), and CodeForces (Penedo et al.,
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.