pith. machine review for the scientific record. sign in

arxiv: 2506.01939 · v2 · submitted 2025-06-02 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Andrew Zhao, An Yang, Bowen Yu, Chang Gao, Chujie Zheng, Gao Huang, Jianxin Yang, Junyang Lin, Kai Dang, Le Yu, Rui Lu, Shenzhi Wang, Shiji Song, Shixuan Liu, Xionghui Chen, Yang Yue, Yuqiong Liu, Zhenru Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 12:06 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords reinforcement learningLLM reasoningtoken entropyRLVRchain of thoughtpolicy gradienthigh entropy tokens
0
0 comments X

The pith

High-entropy forking tokens steer LLM reasoning in RLVR, so updates on just 20% of tokens match or beat full training on larger models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines token entropy during reinforcement learning with verifiable rewards for improving LLM reasoning abilities. It identifies that high-entropy tokens serve as critical branching points in chain-of-thought sequences. Analysis shows RLVR training mainly modifies these high-entropy tokens rather than low-entropy ones. By limiting policy updates to these minority tokens, the approach achieves performance on par with full updates for smaller models and superior results for larger ones on math reasoning tasks. This indicates that the benefits of RLVR stem from optimizing decision points in reasoning paths.

Core claim

In RLVR for LLM reasoning, only a small fraction of tokens exhibit high entropy and act as forking points that determine reasoning pathways. RLVR primarily adjusts the entropy of these high-entropy tokens. Restricting policy gradient updates exclusively to these forking tokens enables utilization of only 20% of tokens while maintaining comparable performance to full-gradient updates on Qwen3-8B and surpassing it on Qwen3-14B and Qwen3-32B models on AIME benchmarks, with gains up to 11 points.

What carries the argument

High-entropy forking tokens, the minority set of tokens with elevated uncertainty in CoT reasoning that steer the model toward different reasoning pathways; restricting policy gradient updates to them focuses the learning on these critical decision points.

If this is right

  • Performance on reasoning benchmarks can be preserved or enhanced by updating gradients for only the top 20% highest-entropy tokens.
  • RLVR training becomes more efficient as it ignores the majority of low-entropy tokens that do not influence reasoning directions.
  • Larger models show greater benefits from this selective update strategy, suggesting improved scalability.
  • Training exclusively on low-entropy tokens results in degraded reasoning performance.
  • The mechanism of RLVR is revealed as primarily entropy adjustment at reasoning forks rather than broad distribution changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Methods to dynamically identify high-entropy tokens during inference could further optimize training.
  • This selective update approach might apply to other RL settings in language models beyond reasoning tasks.
  • If high-entropy tokens are the key, it could simplify the design of reward models or verification in RLVR.
  • Exploring whether these patterns hold in non-reasoning domains like coding or creative tasks would test the generality.

Load-bearing premise

That high-entropy tokens are the causal factors responsible for the reasoning improvements from RLVR, such that updates to them alone capture the essential learning without requiring adjustments to low-entropy tokens.

What would settle it

An experiment where restricting updates to high-entropy tokens results in no improvement or worse performance than full updates on the same benchmarks, while low-entropy restricted training performs equally well.

read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs), while its mechanisms are not yet well understood. In this work, we undertake a pioneering exploration of RLVR through the novel perspective of token entropy patterns, comprehensively analyzing how different tokens influence reasoning performance. By examining token entropy patterns in Chain-of-Thought (CoT) reasoning, we observe that only a small fraction of tokens exhibit high entropy, and these tokens act as critical forks that steer the model toward diverse reasoning pathways. Furthermore, studying how entropy patterns evolve during RLVR training reveals that RLVR largely adheres to the base model's entropy patterns, primarily adjusting the entropy of high-entropy tokens. These findings highlight the significance of high-entropy tokens (i.e., forking tokens) to RLVR. We ultimately improve RLVR by restricting policy gradient updates to forking tokens and uncover a finding even beyond the 80/20 rule: utilizing only 20% of the tokens while maintaining performance comparable to full-gradient updates on the Qwen3-8B base model and significantly surpassing full-gradient updates on the Qwen3-32B (+11.04 on AIME'25 and +7.71 on AIME'24) and Qwen3-14B (+4.79 on AIME'25 and +5.21 on AIME'24) base models, highlighting a strong scaling trend. In contrast, training exclusively on the 80% lowest-entropy tokens leads to a marked decline in performance. These findings indicate that the efficacy of RLVR primarily arises from optimizing the high-entropy tokens that decide reasoning directions. Collectively, our results highlight the potential to understand RLVR through a token-entropy perspective and optimize RLVR by leveraging high-entropy minority tokens to further improve LLM reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper claims that RLVR for LLM reasoning is driven by a small minority (~20%) of high-entropy 'forking tokens' in CoT trajectories that steer reasoning paths. Analysis shows entropy patterns are stable and RLVR primarily adjusts high-entropy tokens. Restricting policy gradients to these tokens matches full updates on Qwen3-8B and exceeds them on larger models (+11.04 AIME'25 and +7.71 AIME'24 on 32B; +4.79 and +5.21 on 14B), while low-entropy-only updates collapse performance. This indicates RLVR efficacy arises from optimizing high-entropy tokens that decide reasoning directions.

Significance. If the empirical results hold, the work provides a useful token-entropy perspective on RLVR mechanisms and a practical route to more efficient training by updating only 20% of tokens while preserving or improving reasoning performance. The reported scaling trend (larger models benefit more from the restriction) is potentially important for future RLVR design. The contrast between high- and low-entropy restrictions offers a falsifiable observation that could guide further mechanistic studies.

major comments (3)
  1. [§5] §5 (restricted-update experiments): the performance deltas (e.g., +11.04 on AIME'25 for Qwen3-32B) are reported without error bars, number of random seeds, or statistical significance tests. This undermines the claim of 'significantly surpassing' full-gradient updates, as the gains could lie within run-to-run variance.
  2. [Method for token selection] Token-selection procedure (described in the method for high-entropy masking): the 20% threshold is used without sensitivity analysis across nearby values (15%, 25%) or justification for its optimality. Because the central 'beyond 80/20' claim rests on this specific cutoff, the result is not yet shown to be robust.
  3. [§3–4] §3–4 (entropy-pattern analysis and forking-token interpretation): the masking experiments establish sufficiency of high-entropy tokens but do not isolate causality. No control (e.g., gradient-norm-matched masking, targeted logit perturbation on high-entropy tokens only, or measurement of reasoning-path divergence attributable solely to those tokens) is presented to rule out the alternative that high-entropy tokens are merely proxies for higher-gradient-variance positions.
minor comments (3)
  1. [§2–3] The entropy formula (including any temperature scaling) should be stated explicitly in §2 or §3 rather than assumed known.
  2. [Figures] Figures depicting entropy evolution during training would benefit from consistent y-axis scaling and legends indicating model size and training step.
  3. [Related work] A brief comparison to prior work on token-level importance or entropy regularization in RLHF/RLVR would help situate the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and insightful comments, which help clarify the strengths and limitations of our work. We address each major comment point by point below, indicating where revisions will be made to improve rigor and robustness.

read point-by-point responses
  1. Referee: §5 (restricted-update experiments): the performance deltas (e.g., +11.04 on AIME'25 for Qwen3-32B) are reported without error bars, number of random seeds, or statistical significance tests. This undermines the claim of 'significantly surpassing' full-gradient updates, as the gains could lie within run-to-run variance.

    Authors: We agree that the absence of error bars, seed counts, and statistical tests is a limitation that weakens the strength of the 'significantly surpassing' claim in the current manuscript. The reported deltas are large (particularly the +11.04 and +7.71 points on the 32B model), but without variance estimates they cannot be rigorously distinguished from run-to-run noise. In the revised version we will rerun the restricted-update experiments across at least three random seeds, report mean ± standard deviation, and include paired statistical significance tests (e.g., t-tests) against the full-gradient baseline. This will be added to §5 and the corresponding tables. revision: yes

  2. Referee: Token-selection procedure (described in the method for high-entropy masking): the 20% threshold is used without sensitivity analysis across nearby values (15%, 25%) or justification for its optimality. Because the central 'beyond 80/20' claim rests on this specific cutoff, the result is not yet shown to be robust.

    Authors: The 20% cutoff was selected because it approximately matches the fraction of tokens whose entropy exceeds a natural inflection point in the per-trajectory entropy distribution (see Figure 2 and the entropy histogram in §3). Nevertheless, we acknowledge that a sensitivity study is required to demonstrate robustness. In the revision we will add an ablation varying the threshold from 10% to 30% in 5% increments, reporting both average performance and the fraction of total entropy captured. We will also provide a brief justification based on the cumulative entropy contribution of the top-k tokens, showing that performance plateaus or improves in the 15–25% range while remaining superior to the full-gradient baseline on the larger models. revision: yes

  3. Referee: §3–4 (entropy-pattern analysis and forking-token interpretation): the masking experiments establish sufficiency of high-entropy tokens but do not isolate causality. No control (e.g., gradient-norm-matched masking, targeted logit perturbation on high-entropy tokens only, or measurement of reasoning-path divergence attributable solely to those tokens) is presented to rule out the alternative that high-entropy tokens are merely proxies for higher-gradient-variance positions.

    Authors: The high- versus low-entropy masking contrast does establish that restricting updates to the high-entropy subset is sufficient to match or exceed full-gradient performance while the complementary low-entropy subset collapses, which is difficult to reconcile with a pure gradient-variance proxy story. However, we concede that the experiments do not fully isolate causality from correlated factors such as gradient magnitude or variance. In the revision we will (i) add an explicit discussion of this alternative explanation in §4, (ii) include a gradient-norm-matched random masking control where possible, and (iii) report reasoning-path divergence metrics (e.g., token-level edit distance to the final answer) conditioned on high-entropy updates. These additions will strengthen the causal interpretation without overclaiming the current evidence. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical ablation results are measured outcomes, not tautological

full rationale

The paper's chain consists of (1) observational measurement of token entropy distributions in CoT traces, (2) tracking how those distributions evolve under RLVR, and (3) an ablation experiment that masks gradients to the top ~20% high-entropy tokens and reports downstream benchmark scores. The performance numbers (+11.04 AIME'25 on 32B, etc.) are externally measured quantities obtained after training; they are not algebraically or statistically forced by the entropy-threshold definition used to select the mask. No equations reduce the final result to the input selection rule, no self-citations carry the central claim, and no fitted parameter is relabeled as a prediction. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on empirical patterns in token entropy during RLVR rather than formal axioms or derivations; the 20% cutoff appears chosen based on observed distributions.

free parameters (1)
  • high-entropy token selection threshold (20%)
    Percentage used to isolate the minority of tokens for restricted updates; chosen to match the observed high-entropy fraction.
axioms (1)
  • domain assumption High-entropy tokens correspond to critical reasoning forks that determine downstream performance
    Invoked when interpreting entropy patterns as causal drivers of RLVR efficacy.
invented entities (1)
  • forking tokens no independent evidence
    purpose: Label for high-entropy tokens that steer reasoning pathways
    Introduced to describe the observed minority tokens whose updates drive performance.

pith-pipeline@v0.9.0 · 5712 in / 1465 out tokens · 58316 ms · 2026-05-12T12:06:49.188880+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 32 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation

    cs.CL 2026-05 unverdicted novelty 7.0

    Persistent 'Rock Tokens' in on-policy distillation resist teacher corrections, consume large gradient norms, yet add negligible value to reasoning, allowing targeted bypassing to streamline alignment.

  2. The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits

    cs.LG 2026-05 unverdicted novelty 7.0

    The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interv...

  3. When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.

  4. Where to Spend Rollouts: Hit-Utility Optimal Rollout Allocation for Group-Based RLVR

    cs.LG 2026-05 unverdicted novelty 7.0

    HORA adaptively allocates rollouts using hit utility to improve Pass@K over compute-matched GRPO on math reasoning benchmarks while preserving Pass@1.

  5. Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

    cs.CL 2026-05 unverdicted novelty 7.0

    RL improves LLM reasoning by sparse policy selection at high-entropy tokens rather than new capability learning, and a minimal RL-free method matches its gains at three orders of magnitude lower cost.

  6. When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy

    cs.CV 2026-05 unverdicted novelty 6.0

    Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduce...

  7. H\"older Policy Optimisation

    cs.LG 2026-05 unverdicted novelty 6.0

    HölderPO unifies token aggregation in GRPO via the Hölder mean with dynamic p annealing, reporting 54.9% average math-benchmark accuracy and 93.8% ALFWorld success.

  8. Selective Off-Policy Reference Tuning with Plan Guidance

    cs.AI 2026-05 unverdicted novelty 6.0

    SORT turns all-wrong prompts into selective learning signals by weighting tokens more predictable under plan guidance from reference solutions, improving over GRPO on reasoning benchmarks especially for weaker models.

  9. Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization

    cs.LG 2026-05 unverdicted novelty 6.0

    OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.

  10. Epistemic Uncertainty for Test-Time Discovery

    cs.LG 2026-05 unverdicted novelty 6.0

    UG-TTT adds epistemic uncertainty measured by adapter disagreement as an exploration bonus in RL for LLMs, raising maximum reward and diversity on scientific discovery benchmarks.

  11. AIPO: : Learning to Reason from Active Interaction

    cs.CL 2026-05 unverdicted novelty 6.0

    AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...

  12. HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control

    cs.LG 2026-05 unverdicted novelty 6.0

    HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchm...

  13. Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

    cs.CL 2026-05 unverdicted novelty 6.0

    RL for LLM reasoning acts as sparse policy selection at high-entropy tokens already present in the base model, enabling ReasonMaxxer—an efficient contrastive method that recovers most RL gains at three orders of magni...

  14. Estimating the Black-box LLM Uncertainty with Distribution-Aligned Adversarial Distillation

    cs.CL 2026-05 unverdicted novelty 6.0

    DisAAD trains a 1%-sized proxy model via adversarial distillation to quantify uncertainty in black-box LLMs by aligning with their output distributions.

  15. When Embedding-Based Defenses Fail: Rethinking Safety in LLM-Based Multi-Agent Systems

    cs.CR 2026-05 unverdicted novelty 6.0

    Embedding-based defenses fail against attacks that align malicious message embeddings with benign ones in LLM multi-agent systems, but token-level confidence scores improve robustness by enabling better pruning of sus...

  16. Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance

    cs.CL 2026-04 unverdicted novelty 6.0

    Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math a...

  17. GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning

    cs.LG 2026-04 unverdicted novelty 6.0

    GRPO-VPS improves GRPO by using segment-wise conditional probabilities of the correct answer to supply process-level feedback, yielding up to 2.6-point accuracy gains and 13.7% shorter reasoning on math tasks.

  18. Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data

    cs.LG 2026-04 unverdicted novelty 6.0

    A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.

  19. HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment

    cs.LG 2026-04 unverdicted novelty 6.0

    HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.

  20. Characterizing Model-Native Skills

    cs.AI 2026-04 conditional novelty 6.0

    Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming...

  21. HiRO-Nav: Hybrid ReasOning Enables Efficient Embodied Navigation

    cs.AI 2026-04 unverdicted novelty 6.0

    HiRO-Nav adaptively triggers reasoning only on high-entropy actions via a hybrid training pipeline and shows better success-token trade-offs than always-reason or never-reason baselines on the CHORES-S benchmark.

  22. The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment

    cs.LG 2026-04 unverdicted novelty 6.0

    The Master Key Hypothesis states that capabilities are low-dimensional directions transferable across models through linear subspace alignment, with UNLOCK demonstrating gains such as 12.1% accuracy improvement on MAT...

  23. LLMs Should Express Uncertainty Explicitly

    cs.LG 2026-04 unverdicted novelty 6.0

    Training LLMs to express uncertainty explicitly via global confidence or local markers enhances calibration and intervention triggers compared to post-hoc estimation.

  24. MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    cs.CL 2025-06 unverdicted novelty 6.0

    MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training...

  25. Selective Off-Policy Reference Tuning with Plan Guidance

    cs.AI 2026-05 unverdicted novelty 5.0

    SORT converts all-failed reasoning prompts into selective, structure-aware training signals by weighting tokens according to how much a reference-derived plan increases their probability.

  26. How You Begin is How You Reason: Driving Exploration in RLVR via Prefix-Tuned Priors

    cs.AI 2026-05 unverdicted novelty 5.0

    IMAX trains soft prefixes with an InfoMax reward to drive diverse exploration in RLVR, yielding up to 11.60% gains in Pass@4 over standard RLVR across model scales.

  27. Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models

    cs.AI 2026-05 unverdicted novelty 5.0

    Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.

  28. EGAD: Entropy-Guided Adaptive Distillation for Token-Level Knowledge Transfer

    cs.CL 2026-05 unverdicted novelty 5.0

    EGAD adaptively distills LLM knowledge at the token level by using entropy to create a curriculum from low- to high-entropy tokens, adjust temperature, and switch between logits-only and feature-based branches.

  29. OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning

    cs.AI 2026-04 unverdicted novelty 5.0

    OGER adds an auxiliary exploration reward built from offline trajectories and model entropy to hybrid RL training, yielding gains on math reasoning benchmarks and out-of-domain generalization.

  30. MCPO: Mastery-Consolidated Policy Optimization for Large Reasoning Models

    cs.AI 2026-04 unverdicted novelty 5.0

    MCPO fixes vanishing training signals and shrinking weights in GRPO by using a hinge-KL regularizer on mastered prompts and prioritizing majority-correct prompts, yielding higher pass@1 and pass@k on math tasks.

  31. Rethinking Token-Level Credit Assignment in RLVR: A Polarity-Entropy Analysis

    cs.LG 2026-04 unverdicted novelty 5.0

    Token credit in RLVR is upper-bounded by entropy, with reasoning gains concentrated in high-entropy tokens, motivating Entropy-Aware Policy Optimization that outperforms baselines.

  32. GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

    cs.CL 2025-08 unverdicted novelty 4.0

    GLM-4.5, a 355B-parameter MoE model with hybrid reasoning, scores 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified while ranking 3rd overall and 2nd on agentic benchmarks.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · cited by 30 Pith papers · 18 internal anchors

  1. [1]

    Arc- agi-2: A new challenge for frontier ai reasoning systems.arXiv preprint arXiv:2505.11831, 2025

    [Accessed 01-05-2025]. Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-agi-2: A new challenge for frontier ai reasoning systems.arXiv preprint arXiv: 2505.11831,

  2. [2]

    On the Measure of Intelligence

    François Chollet. On the measure of intelligence.arXiv preprint arXiv: 1911.01547,

  3. [3]

    SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

    Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V . Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training.arXiv preprint arXiv:2501.17161,

  4. [4]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

  5. [5]

    KTO: Model Alignment as Prospect Theoretic Optimization

    Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306,

  6. [6]

    Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars

    Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars.arXiv preprint arXiv:2503.01307,

  7. [7]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava S...

  8. [8]

    arXiv preprint arXiv:2501.04519 , year=

    Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang. rstar-math: Small llms can master math reasoning with self-evolved deep thinking.arXiv preprint arXiv:2501.04519,

  9. [9]

    Openrlhf: An easy-to-use, scalable and high-performance rlhf framework.arXiv preprint arXiv:2405.11143, 2024

    Jian Hu, Xibin Wu, Zilin Zhu, Xianyu, Weixun Wang, Dehao Zhang, and Yu Cao. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework.arXiv preprint arXiv:2405.11143,

  10. [10]

    Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

    Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290,

  11. [11]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    17 Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974,

  12. [12]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V . Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirz...

  13. [13]

    Llms can easily learn to reason from demonstrations structure, not content, is what matters!arXiv preprint arXiv:2502.07374,

    Dacheng Li, Shiyi Cao, Tyler Griggs, Shu Liu, Xiangxi Mo, Eric Tang, Sumanth Hegde, Kourosh Hakhamaneshi, Shishir G Patil, Matei Zaharia, et al. Llms can easily learn to reason from demonstrations structure, not content, is what matters!arXiv preprint arXiv:2502.07374,

  14. [14]

    Critical tokens matter: Token-level contrastive estimation enhence llm’s reasoning capability.arXiv preprint arXiv:2411.19943,

    Zicheng Lin, Tian Liang, Jiahao Xu, Xing Wang, Ruilin Luo, Chufan Shi, Siheng Li, Yujiu Yang, and Zhaopeng Tu. Critical tokens matter: Token-level contrastive estimation enhence llm’s reasoning capability.arXiv preprint arXiv:2411.19943,

  15. [15]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470,

  16. [16]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al

    [Ac- cessed 01-05-2025]. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744,

  17. [17]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

  18. [18]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  19. [19]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,

  20. [20]

    Understanding the performance gap between online and offline alignment algorithms.arXiv preprint arXiv:2405.08448,

    Yunhao Tang, Daniel Zhaohan Guo, Zeyu Zheng, Daniele Calandriello, Yuan Cao, Eugene Tarassov, Rémi Munos, Bernardo Ávila Pires, Michal Valko, Yong Cheng, et al. Understanding the performance gap between online and offline alignment algorithms.arXiv preprint arXiv:2405.08448,

  21. [21]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haoz...

  22. [22]

    io/blog/qwq-32b/

    URL https://qwenlm.github. io/blog/qwq-32b/. Jean Vassoyan, Nathanaël Beau, and Roman Plaud. Ignore the kl penalty! boosting exploration on critical tokens to enhance rl fine-tuning.arXiv preprint arXiv:2502.06533,

  23. [23]

    Chen Wang, Lai Wei, Yanzhi Zhang, Chenyang Shao, Zedong Dan, Weiran Huang, Yuzhi Zhang, and Yue Wang

    Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Lucas Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, and Yelong Shen. Reinforcement learning for reasoning in large language models with one training example.arXiv preprint arXiv:2504.20571,

  24. [24]

    DanceGRPO: Unleashing GRPO on Visual Generation

    Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, and Ping Luo. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818,

  25. [25]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  26. [26]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

  27. [27]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837, 2025a. Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, Xian...

  28. [28]

    Absolute Zero: Reinforced Self-play Reasoning with Zero Data

    Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335,