Recognition: 2 theorem links
· Lean TheoremBeyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
Pith reviewed 2026-05-12 12:06 UTC · model grok-4.3
The pith
High-entropy forking tokens steer LLM reasoning in RLVR, so updates on just 20% of tokens match or beat full training on larger models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In RLVR for LLM reasoning, only a small fraction of tokens exhibit high entropy and act as forking points that determine reasoning pathways. RLVR primarily adjusts the entropy of these high-entropy tokens. Restricting policy gradient updates exclusively to these forking tokens enables utilization of only 20% of tokens while maintaining comparable performance to full-gradient updates on Qwen3-8B and surpassing it on Qwen3-14B and Qwen3-32B models on AIME benchmarks, with gains up to 11 points.
What carries the argument
High-entropy forking tokens, the minority set of tokens with elevated uncertainty in CoT reasoning that steer the model toward different reasoning pathways; restricting policy gradient updates to them focuses the learning on these critical decision points.
If this is right
- Performance on reasoning benchmarks can be preserved or enhanced by updating gradients for only the top 20% highest-entropy tokens.
- RLVR training becomes more efficient as it ignores the majority of low-entropy tokens that do not influence reasoning directions.
- Larger models show greater benefits from this selective update strategy, suggesting improved scalability.
- Training exclusively on low-entropy tokens results in degraded reasoning performance.
- The mechanism of RLVR is revealed as primarily entropy adjustment at reasoning forks rather than broad distribution changes.
Where Pith is reading between the lines
- Methods to dynamically identify high-entropy tokens during inference could further optimize training.
- This selective update approach might apply to other RL settings in language models beyond reasoning tasks.
- If high-entropy tokens are the key, it could simplify the design of reward models or verification in RLVR.
- Exploring whether these patterns hold in non-reasoning domains like coding or creative tasks would test the generality.
Load-bearing premise
That high-entropy tokens are the causal factors responsible for the reasoning improvements from RLVR, such that updates to them alone capture the essential learning without requiring adjustments to low-entropy tokens.
What would settle it
An experiment where restricting updates to high-entropy tokens results in no improvement or worse performance than full updates on the same benchmarks, while low-entropy restricted training performs equally well.
read the original abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs), while its mechanisms are not yet well understood. In this work, we undertake a pioneering exploration of RLVR through the novel perspective of token entropy patterns, comprehensively analyzing how different tokens influence reasoning performance. By examining token entropy patterns in Chain-of-Thought (CoT) reasoning, we observe that only a small fraction of tokens exhibit high entropy, and these tokens act as critical forks that steer the model toward diverse reasoning pathways. Furthermore, studying how entropy patterns evolve during RLVR training reveals that RLVR largely adheres to the base model's entropy patterns, primarily adjusting the entropy of high-entropy tokens. These findings highlight the significance of high-entropy tokens (i.e., forking tokens) to RLVR. We ultimately improve RLVR by restricting policy gradient updates to forking tokens and uncover a finding even beyond the 80/20 rule: utilizing only 20% of the tokens while maintaining performance comparable to full-gradient updates on the Qwen3-8B base model and significantly surpassing full-gradient updates on the Qwen3-32B (+11.04 on AIME'25 and +7.71 on AIME'24) and Qwen3-14B (+4.79 on AIME'25 and +5.21 on AIME'24) base models, highlighting a strong scaling trend. In contrast, training exclusively on the 80% lowest-entropy tokens leads to a marked decline in performance. These findings indicate that the efficacy of RLVR primarily arises from optimizing the high-entropy tokens that decide reasoning directions. Collectively, our results highlight the potential to understand RLVR through a token-entropy perspective and optimize RLVR by leveraging high-entropy minority tokens to further improve LLM reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that RLVR for LLM reasoning is driven by a small minority (~20%) of high-entropy 'forking tokens' in CoT trajectories that steer reasoning paths. Analysis shows entropy patterns are stable and RLVR primarily adjusts high-entropy tokens. Restricting policy gradients to these tokens matches full updates on Qwen3-8B and exceeds them on larger models (+11.04 AIME'25 and +7.71 AIME'24 on 32B; +4.79 and +5.21 on 14B), while low-entropy-only updates collapse performance. This indicates RLVR efficacy arises from optimizing high-entropy tokens that decide reasoning directions.
Significance. If the empirical results hold, the work provides a useful token-entropy perspective on RLVR mechanisms and a practical route to more efficient training by updating only 20% of tokens while preserving or improving reasoning performance. The reported scaling trend (larger models benefit more from the restriction) is potentially important for future RLVR design. The contrast between high- and low-entropy restrictions offers a falsifiable observation that could guide further mechanistic studies.
major comments (3)
- [§5] §5 (restricted-update experiments): the performance deltas (e.g., +11.04 on AIME'25 for Qwen3-32B) are reported without error bars, number of random seeds, or statistical significance tests. This undermines the claim of 'significantly surpassing' full-gradient updates, as the gains could lie within run-to-run variance.
- [Method for token selection] Token-selection procedure (described in the method for high-entropy masking): the 20% threshold is used without sensitivity analysis across nearby values (15%, 25%) or justification for its optimality. Because the central 'beyond 80/20' claim rests on this specific cutoff, the result is not yet shown to be robust.
- [§3–4] §3–4 (entropy-pattern analysis and forking-token interpretation): the masking experiments establish sufficiency of high-entropy tokens but do not isolate causality. No control (e.g., gradient-norm-matched masking, targeted logit perturbation on high-entropy tokens only, or measurement of reasoning-path divergence attributable solely to those tokens) is presented to rule out the alternative that high-entropy tokens are merely proxies for higher-gradient-variance positions.
minor comments (3)
- [§2–3] The entropy formula (including any temperature scaling) should be stated explicitly in §2 or §3 rather than assumed known.
- [Figures] Figures depicting entropy evolution during training would benefit from consistent y-axis scaling and legends indicating model size and training step.
- [Related work] A brief comparison to prior work on token-level importance or entropy regularization in RLHF/RLVR would help situate the contribution.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments, which help clarify the strengths and limitations of our work. We address each major comment point by point below, indicating where revisions will be made to improve rigor and robustness.
read point-by-point responses
-
Referee: §5 (restricted-update experiments): the performance deltas (e.g., +11.04 on AIME'25 for Qwen3-32B) are reported without error bars, number of random seeds, or statistical significance tests. This undermines the claim of 'significantly surpassing' full-gradient updates, as the gains could lie within run-to-run variance.
Authors: We agree that the absence of error bars, seed counts, and statistical tests is a limitation that weakens the strength of the 'significantly surpassing' claim in the current manuscript. The reported deltas are large (particularly the +11.04 and +7.71 points on the 32B model), but without variance estimates they cannot be rigorously distinguished from run-to-run noise. In the revised version we will rerun the restricted-update experiments across at least three random seeds, report mean ± standard deviation, and include paired statistical significance tests (e.g., t-tests) against the full-gradient baseline. This will be added to §5 and the corresponding tables. revision: yes
-
Referee: Token-selection procedure (described in the method for high-entropy masking): the 20% threshold is used without sensitivity analysis across nearby values (15%, 25%) or justification for its optimality. Because the central 'beyond 80/20' claim rests on this specific cutoff, the result is not yet shown to be robust.
Authors: The 20% cutoff was selected because it approximately matches the fraction of tokens whose entropy exceeds a natural inflection point in the per-trajectory entropy distribution (see Figure 2 and the entropy histogram in §3). Nevertheless, we acknowledge that a sensitivity study is required to demonstrate robustness. In the revision we will add an ablation varying the threshold from 10% to 30% in 5% increments, reporting both average performance and the fraction of total entropy captured. We will also provide a brief justification based on the cumulative entropy contribution of the top-k tokens, showing that performance plateaus or improves in the 15–25% range while remaining superior to the full-gradient baseline on the larger models. revision: yes
-
Referee: §3–4 (entropy-pattern analysis and forking-token interpretation): the masking experiments establish sufficiency of high-entropy tokens but do not isolate causality. No control (e.g., gradient-norm-matched masking, targeted logit perturbation on high-entropy tokens only, or measurement of reasoning-path divergence attributable solely to those tokens) is presented to rule out the alternative that high-entropy tokens are merely proxies for higher-gradient-variance positions.
Authors: The high- versus low-entropy masking contrast does establish that restricting updates to the high-entropy subset is sufficient to match or exceed full-gradient performance while the complementary low-entropy subset collapses, which is difficult to reconcile with a pure gradient-variance proxy story. However, we concede that the experiments do not fully isolate causality from correlated factors such as gradient magnitude or variance. In the revision we will (i) add an explicit discussion of this alternative explanation in §4, (ii) include a gradient-norm-matched random masking control where possible, and (iii) report reasoning-path divergence metrics (e.g., token-level edit distance to the final answer) conditioned on high-entropy updates. These additions will strengthen the causal interpretation without overclaiming the current evidence. revision: partial
Circularity Check
No circularity: empirical ablation results are measured outcomes, not tautological
full rationale
The paper's chain consists of (1) observational measurement of token entropy distributions in CoT traces, (2) tracking how those distributions evolve under RLVR, and (3) an ablation experiment that masks gradients to the top ~20% high-entropy tokens and reports downstream benchmark scores. The performance numbers (+11.04 AIME'25 on 32B, etc.) are externally measured quantities obtained after training; they are not algebraically or statistically forced by the entropy-threshold definition used to select the mask. No equations reduce the final result to the input selection rule, no self-citations carry the central claim, and no fitted parameter is relabeled as a prediction. The derivation is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- high-entropy token selection threshold (20%)
axioms (1)
- domain assumption High-entropy tokens correspond to critical reasoning forks that determine downstream performance
invented entities (1)
-
forking tokens
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.HierarchyForcingadditive_composition_is_minimal echoesutilizing only 20% of the tokens while maintaining performance comparable to full-gradient updates... training exclusively on the 80% lowest-entropy tokens leads to a marked decline
Forward citations
Cited by 32 Pith papers
-
Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation
Persistent 'Rock Tokens' in on-policy distillation resist teacher corrections, consume large gradient norms, yet add negligible value to reasoning, allowing targeted bypassing to streamline alignment.
-
The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits
The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interv...
-
When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models
Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.
-
Where to Spend Rollouts: Hit-Utility Optimal Rollout Allocation for Group-Based RLVR
HORA adaptively allocates rollouts using hit utility to improve Pass@K over compute-matched GRPO on math reasoning benchmarks while preserving Pass@1.
-
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
RL improves LLM reasoning by sparse policy selection at high-entropy tokens rather than new capability learning, and a minimal RL-free method matches its gains at three orders of magnitude lower cost.
-
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy
Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduce...
-
H\"older Policy Optimisation
HölderPO unifies token aggregation in GRPO via the Hölder mean with dynamic p annealing, reporting 54.9% average math-benchmark accuracy and 93.8% ALFWorld success.
-
Selective Off-Policy Reference Tuning with Plan Guidance
SORT turns all-wrong prompts into selective learning signals by weighting tokens more predictable under plan guidance from reference solutions, improving over GRPO on reasoning benchmarks especially for weaker models.
-
Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization
OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.
-
Epistemic Uncertainty for Test-Time Discovery
UG-TTT adds epistemic uncertainty measured by adapter disagreement as an exploration bonus in RL for LLMs, raising maximum reward and diversity on scientific discovery benchmarks.
-
AIPO: : Learning to Reason from Active Interaction
AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...
-
HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control
HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchm...
-
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
RL for LLM reasoning acts as sparse policy selection at high-entropy tokens already present in the base model, enabling ReasonMaxxer—an efficient contrastive method that recovers most RL gains at three orders of magni...
-
Estimating the Black-box LLM Uncertainty with Distribution-Aligned Adversarial Distillation
DisAAD trains a 1%-sized proxy model via adversarial distillation to quantify uncertainty in black-box LLMs by aligning with their output distributions.
-
When Embedding-Based Defenses Fail: Rethinking Safety in LLM-Based Multi-Agent Systems
Embedding-based defenses fail against attacks that align malicious message embeddings with benign ones in LLM multi-agent systems, but token-level confidence scores improve robustness by enabling better pruning of sus...
-
Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance
Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math a...
-
GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning
GRPO-VPS improves GRPO by using segment-wise conditional probabilities of the correct answer to supply process-level feedback, yielding up to 2.6-point accuracy gains and 13.7% shorter reasoning on math tasks.
-
Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data
A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.
-
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
-
Characterizing Model-Native Skills
Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming...
-
HiRO-Nav: Hybrid ReasOning Enables Efficient Embodied Navigation
HiRO-Nav adaptively triggers reasoning only on high-entropy actions via a hybrid training pipeline and shows better success-token trade-offs than always-reason or never-reason baselines on the CHORES-S benchmark.
-
The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment
The Master Key Hypothesis states that capabilities are low-dimensional directions transferable across models through linear subspace alignment, with UNLOCK demonstrating gains such as 12.1% accuracy improvement on MAT...
-
LLMs Should Express Uncertainty Explicitly
Training LLMs to express uncertainty explicitly via global confidence or local markers enhances calibration and intervention triggers compared to post-hoc estimation.
-
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training...
-
Selective Off-Policy Reference Tuning with Plan Guidance
SORT converts all-failed reasoning prompts into selective, structure-aware training signals by weighting tokens according to how much a reference-derived plan increases their probability.
-
How You Begin is How You Reason: Driving Exploration in RLVR via Prefix-Tuned Priors
IMAX trains soft prefixes with an InfoMax reward to drive diverse exploration in RLVR, yielding up to 11.60% gains in Pass@4 over standard RLVR across model scales.
-
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models
Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.
-
EGAD: Entropy-Guided Adaptive Distillation for Token-Level Knowledge Transfer
EGAD adaptively distills LLM knowledge at the token level by using entropy to create a curriculum from low- to high-entropy tokens, adjust temperature, and switch between logits-only and feature-based branches.
-
OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning
OGER adds an auxiliary exploration reward built from offline trajectories and model entropy to hybrid RL training, yielding gains on math reasoning benchmarks and out-of-domain generalization.
-
MCPO: Mastery-Consolidated Policy Optimization for Large Reasoning Models
MCPO fixes vanishing training signals and shrinking weights in GRPO by using a hinge-KL regularizer on mastered prompts and prioritizing majority-correct prompts, yielding higher pass@1 and pass@k on math tasks.
-
Rethinking Token-Level Credit Assignment in RLVR: A Polarity-Entropy Analysis
Token credit in RLVR is upper-bounded by entropy, with reasoning gains concentrated in high-entropy tokens, motivating Entropy-Aware Policy Optimization that outperforms baselines.
-
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
GLM-4.5, a 355B-parameter MoE model with hybrid reasoning, scores 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified while ranking 3rd overall and 2nd on agentic benchmarks.
Reference graph
Works this paper leans on
-
[1]
Arc- agi-2: A new challenge for frontier ai reasoning systems.arXiv preprint arXiv:2505.11831, 2025
[Accessed 01-05-2025]. Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-agi-2: A new challenge for frontier ai reasoning systems.arXiv preprint arXiv: 2505.11831,
-
[2]
On the Measure of Intelligence
François Chollet. On the measure of intelligence.arXiv preprint arXiv: 1911.01547,
work page internal anchor Pith review arXiv 1911
-
[3]
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V . Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training.arXiv preprint arXiv:2501.17161,
work page internal anchor Pith review arXiv
-
[4]
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
KTO: Model Alignment as Prospect Theoretic Optimization
Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306,
work page internal anchor Pith review arXiv
-
[6]
Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars
Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars.arXiv preprint arXiv:2503.01307,
-
[7]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava S...
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
arXiv preprint arXiv:2501.04519 , year=
Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang. rstar-math: Small llms can master math reasoning with self-evolved deep thinking.arXiv preprint arXiv:2501.04519,
-
[9]
Jian Hu, Xibin Wu, Zilin Zhu, Xianyu, Weixun Wang, Dehao Zhang, and Yu Cao. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework.arXiv preprint arXiv:2405.11143,
-
[10]
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290,
work page internal anchor Pith review arXiv
-
[11]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
17 Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V . Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirz...
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Dacheng Li, Shiyi Cao, Tyler Griggs, Shu Liu, Xiangxi Mo, Eric Tang, Sumanth Hegde, Kourosh Hakhamaneshi, Shishir G Patil, Matei Zaharia, et al. Llms can easily learn to reason from demonstrations structure, not content, is what matters!arXiv preprint arXiv:2502.07374,
-
[14]
Zicheng Lin, Tian Liang, Jiahao Xu, Xing Wang, Ruilin Luo, Chufan Shi, Siheng Li, Yujiu Yang, and Zhaopeng Tu. Critical tokens matter: Token-level contrastive estimation enhence llm’s reasoning capability.arXiv preprint arXiv:2411.19943,
-
[15]
Flow-GRPO: Training Flow Matching Models via Online RL
Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
[Ac- cessed 01-05-2025]. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744,
work page 2025
-
[17]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Yunhao Tang, Daniel Zhaohan Guo, Zeyu Zheng, Daniele Calandriello, Yuan Cao, Eugene Tarassov, Rémi Munos, Bernardo Ávila Pires, Michal Valko, Yong Cheng, et al. Understanding the performance gap between online and offline alignment algorithms.arXiv preprint arXiv:2405.08448,
-
[21]
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haoz...
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
URL https://qwenlm.github. io/blog/qwq-32b/. Jean Vassoyan, Nathanaël Beau, and Roman Plaud. Ignore the kl penalty! boosting exploration on critical tokens to enhance rl fine-tuning.arXiv preprint arXiv:2502.06533,
-
[23]
Chen Wang, Lai Wei, Yanzhi Zhang, Chenyang Shao, Zedong Dan, Weiran Huang, Yuzhi Zhang, and Yue Wang
Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Lucas Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, and Yelong Shen. Reinforcement learning for reasoning in large language models with one training example.arXiv preprint arXiv:2504.20571,
-
[24]
DanceGRPO: Unleashing GRPO on Visual Generation
Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, and Ping Luo. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837, 2025a. Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, Xian...
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335,
work page internal anchor Pith review arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.