Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Andrew Zhao; An Yang; Bowen Yu; Chang Gao; Chujie Zheng; Gao Huang; Jianxin Yang; Junyang Lin; Kai Dang; Le Yu

arxiv: 2506.01939 · v2 · submitted 2025-06-02 · 💻 cs.CL · cs.AI· cs.LG

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Shenzhi Wang , Le Yu , Chang Gao , Chujie Zheng , Shixuan Liu , Rui Lu , Kai Dang , Xionghui Chen

show 10 more authors

Jianxin Yang Zhenru Zhang Yuqiong Liu An Yang Andrew Zhao Yang Yue Shiji Song Bowen Yu Gao Huang Junyang Lin

This is my paper

Pith reviewed 2026-05-12 12:06 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords reinforcement learningLLM reasoningtoken entropyRLVRchain of thoughtpolicy gradienthigh entropy tokens

0 comments

The pith

High-entropy forking tokens steer LLM reasoning in RLVR, so updates on just 20% of tokens match or beat full training on larger models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines token entropy during reinforcement learning with verifiable rewards for improving LLM reasoning abilities. It identifies that high-entropy tokens serve as critical branching points in chain-of-thought sequences. Analysis shows RLVR training mainly modifies these high-entropy tokens rather than low-entropy ones. By limiting policy updates to these minority tokens, the approach achieves performance on par with full updates for smaller models and superior results for larger ones on math reasoning tasks. This indicates that the benefits of RLVR stem from optimizing decision points in reasoning paths.

Core claim

In RLVR for LLM reasoning, only a small fraction of tokens exhibit high entropy and act as forking points that determine reasoning pathways. RLVR primarily adjusts the entropy of these high-entropy tokens. Restricting policy gradient updates exclusively to these forking tokens enables utilization of only 20% of tokens while maintaining comparable performance to full-gradient updates on Qwen3-8B and surpassing it on Qwen3-14B and Qwen3-32B models on AIME benchmarks, with gains up to 11 points.

What carries the argument

High-entropy forking tokens, the minority set of tokens with elevated uncertainty in CoT reasoning that steer the model toward different reasoning pathways; restricting policy gradient updates to them focuses the learning on these critical decision points.

If this is right

Performance on reasoning benchmarks can be preserved or enhanced by updating gradients for only the top 20% highest-entropy tokens.
RLVR training becomes more efficient as it ignores the majority of low-entropy tokens that do not influence reasoning directions.
Larger models show greater benefits from this selective update strategy, suggesting improved scalability.
Training exclusively on low-entropy tokens results in degraded reasoning performance.
The mechanism of RLVR is revealed as primarily entropy adjustment at reasoning forks rather than broad distribution changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Methods to dynamically identify high-entropy tokens during inference could further optimize training.
This selective update approach might apply to other RL settings in language models beyond reasoning tasks.
If high-entropy tokens are the key, it could simplify the design of reward models or verification in RLVR.
Exploring whether these patterns hold in non-reasoning domains like coding or creative tasks would test the generality.

Load-bearing premise

That high-entropy tokens are the causal factors responsible for the reasoning improvements from RLVR, such that updates to them alone capture the essential learning without requiring adjustments to low-entropy tokens.

What would settle it

An experiment where restricting updates to high-entropy tokens results in no improvement or worse performance than full updates on the same benchmarks, while low-entropy restricted training performs equally well.

read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs), while its mechanisms are not yet well understood. In this work, we undertake a pioneering exploration of RLVR through the novel perspective of token entropy patterns, comprehensively analyzing how different tokens influence reasoning performance. By examining token entropy patterns in Chain-of-Thought (CoT) reasoning, we observe that only a small fraction of tokens exhibit high entropy, and these tokens act as critical forks that steer the model toward diverse reasoning pathways. Furthermore, studying how entropy patterns evolve during RLVR training reveals that RLVR largely adheres to the base model's entropy patterns, primarily adjusting the entropy of high-entropy tokens. These findings highlight the significance of high-entropy tokens (i.e., forking tokens) to RLVR. We ultimately improve RLVR by restricting policy gradient updates to forking tokens and uncover a finding even beyond the 80/20 rule: utilizing only 20% of the tokens while maintaining performance comparable to full-gradient updates on the Qwen3-8B base model and significantly surpassing full-gradient updates on the Qwen3-32B (+11.04 on AIME'25 and +7.71 on AIME'24) and Qwen3-14B (+4.79 on AIME'25 and +5.21 on AIME'24) base models, highlighting a strong scaling trend. In contrast, training exclusively on the 80% lowest-entropy tokens leads to a marked decline in performance. These findings indicate that the efficacy of RLVR primarily arises from optimizing the high-entropy tokens that decide reasoning directions. Collectively, our results highlight the potential to understand RLVR through a token-entropy perspective and optimize RLVR by leveraging high-entropy minority tokens to further improve LLM reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Restricting RLVR to the top 20% highest-entropy tokens matches or beats full updates on 14B and 32B models, but the causal story for why those tokens steer reasoning stays assumptive.

read the letter

The main point is that this paper shows a practical efficiency result: doing RLVR updates only on the 20% highest-entropy tokens keeps performance on par with full gradients for Qwen3-8B and improves it on the 14B and 32B versions, with gains of 4-11 points on AIME'24 and AIME'25. The low-entropy-only control collapses, which is a clean contrast. That scaling trend with model size is the clearest new signal here. The entropy analysis itself is straightforward observational work that tracks how patterns hold steady from the base model through training and that RL mostly moves the high-entropy subset. The experiment then turns that observation into a restricted-update method that works better than expected on bigger models. This angle on token entropy as a lens for RLVR is not something earlier papers emphasized with the same direct test. The soft spot is the jump from correlation to mechanism. The authors label the high-entropy tokens as forking points that decide reasoning paths, yet the evidence is only that the subset succeeds and the complement fails. No perturbation test or path-divergence measurement isolates whether those specific tokens are the causal drivers rather than simply the places with higher gradient signal. The 20% cutoff also looks chosen for the reported win without much sensitivity data shown in the abstract. Readers working on cheaper RL for reasoning or on what RLVR actually changes inside the model will find the numbers useful. The empirical result is concrete enough to merit a serious referee, though the interpretation will need tighter controls before it can be taken as settled.

Referee Report

3 major / 3 minor

Summary. The paper claims that RLVR for LLM reasoning is driven by a small minority (~20%) of high-entropy 'forking tokens' in CoT trajectories that steer reasoning paths. Analysis shows entropy patterns are stable and RLVR primarily adjusts high-entropy tokens. Restricting policy gradients to these tokens matches full updates on Qwen3-8B and exceeds them on larger models (+11.04 AIME'25 and +7.71 AIME'24 on 32B; +4.79 and +5.21 on 14B), while low-entropy-only updates collapse performance. This indicates RLVR efficacy arises from optimizing high-entropy tokens that decide reasoning directions.

Significance. If the empirical results hold, the work provides a useful token-entropy perspective on RLVR mechanisms and a practical route to more efficient training by updating only 20% of tokens while preserving or improving reasoning performance. The reported scaling trend (larger models benefit more from the restriction) is potentially important for future RLVR design. The contrast between high- and low-entropy restrictions offers a falsifiable observation that could guide further mechanistic studies.

major comments (3)

[§5] §5 (restricted-update experiments): the performance deltas (e.g., +11.04 on AIME'25 for Qwen3-32B) are reported without error bars, number of random seeds, or statistical significance tests. This undermines the claim of 'significantly surpassing' full-gradient updates, as the gains could lie within run-to-run variance.
[Method for token selection] Token-selection procedure (described in the method for high-entropy masking): the 20% threshold is used without sensitivity analysis across nearby values (15%, 25%) or justification for its optimality. Because the central 'beyond 80/20' claim rests on this specific cutoff, the result is not yet shown to be robust.
[§3–4] §3–4 (entropy-pattern analysis and forking-token interpretation): the masking experiments establish sufficiency of high-entropy tokens but do not isolate causality. No control (e.g., gradient-norm-matched masking, targeted logit perturbation on high-entropy tokens only, or measurement of reasoning-path divergence attributable solely to those tokens) is presented to rule out the alternative that high-entropy tokens are merely proxies for higher-gradient-variance positions.

minor comments (3)

[§2–3] The entropy formula (including any temperature scaling) should be stated explicitly in §2 or §3 rather than assumed known.
[Figures] Figures depicting entropy evolution during training would benefit from consistent y-axis scaling and legends indicating model size and training step.
[Related work] A brief comparison to prior work on token-level importance or entropy regularization in RLHF/RLVR would help situate the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and insightful comments, which help clarify the strengths and limitations of our work. We address each major comment point by point below, indicating where revisions will be made to improve rigor and robustness.

read point-by-point responses

Referee: §5 (restricted-update experiments): the performance deltas (e.g., +11.04 on AIME'25 for Qwen3-32B) are reported without error bars, number of random seeds, or statistical significance tests. This undermines the claim of 'significantly surpassing' full-gradient updates, as the gains could lie within run-to-run variance.

Authors: We agree that the absence of error bars, seed counts, and statistical tests is a limitation that weakens the strength of the 'significantly surpassing' claim in the current manuscript. The reported deltas are large (particularly the +11.04 and +7.71 points on the 32B model), but without variance estimates they cannot be rigorously distinguished from run-to-run noise. In the revised version we will rerun the restricted-update experiments across at least three random seeds, report mean ± standard deviation, and include paired statistical significance tests (e.g., t-tests) against the full-gradient baseline. This will be added to §5 and the corresponding tables. revision: yes
Referee: Token-selection procedure (described in the method for high-entropy masking): the 20% threshold is used without sensitivity analysis across nearby values (15%, 25%) or justification for its optimality. Because the central 'beyond 80/20' claim rests on this specific cutoff, the result is not yet shown to be robust.

Authors: The 20% cutoff was selected because it approximately matches the fraction of tokens whose entropy exceeds a natural inflection point in the per-trajectory entropy distribution (see Figure 2 and the entropy histogram in §3). Nevertheless, we acknowledge that a sensitivity study is required to demonstrate robustness. In the revision we will add an ablation varying the threshold from 10% to 30% in 5% increments, reporting both average performance and the fraction of total entropy captured. We will also provide a brief justification based on the cumulative entropy contribution of the top-k tokens, showing that performance plateaus or improves in the 15–25% range while remaining superior to the full-gradient baseline on the larger models. revision: yes
Referee: §3–4 (entropy-pattern analysis and forking-token interpretation): the masking experiments establish sufficiency of high-entropy tokens but do not isolate causality. No control (e.g., gradient-norm-matched masking, targeted logit perturbation on high-entropy tokens only, or measurement of reasoning-path divergence attributable solely to those tokens) is presented to rule out the alternative that high-entropy tokens are merely proxies for higher-gradient-variance positions.

Authors: The high- versus low-entropy masking contrast does establish that restricting updates to the high-entropy subset is sufficient to match or exceed full-gradient performance while the complementary low-entropy subset collapses, which is difficult to reconcile with a pure gradient-variance proxy story. However, we concede that the experiments do not fully isolate causality from correlated factors such as gradient magnitude or variance. In the revision we will (i) add an explicit discussion of this alternative explanation in §4, (ii) include a gradient-norm-matched random masking control where possible, and (iii) report reasoning-path divergence metrics (e.g., token-level edit distance to the final answer) conditioned on high-entropy updates. These additions will strengthen the causal interpretation without overclaiming the current evidence. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical ablation results are measured outcomes, not tautological

full rationale

The paper's chain consists of (1) observational measurement of token entropy distributions in CoT traces, (2) tracking how those distributions evolve under RLVR, and (3) an ablation experiment that masks gradients to the top ~20% high-entropy tokens and reports downstream benchmark scores. The performance numbers (+11.04 AIME'25 on 32B, etc.) are externally measured quantities obtained after training; they are not algebraically or statistically forced by the entropy-threshold definition used to select the mask. No equations reduce the final result to the input selection rule, no self-citations carry the central claim, and no fitted parameter is relabeled as a prediction. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on empirical patterns in token entropy during RLVR rather than formal axioms or derivations; the 20% cutoff appears chosen based on observed distributions.

free parameters (1)

high-entropy token selection threshold (20%)
Percentage used to isolate the minority of tokens for restricted updates; chosen to match the observed high-entropy fraction.

axioms (1)

domain assumption High-entropy tokens correspond to critical reasoning forks that determine downstream performance
Invoked when interpreting entropy patterns as causal drivers of RLVR efficacy.

invented entities (1)

forking tokens no independent evidence
purpose: Label for high-entropy tokens that steer reasoning pathways
Introduced to describe the observed minority tokens whose updates drive performance.

pith-pipeline@v0.9.0 · 5712 in / 1465 out tokens · 58316 ms · 2026-05-12T12:06:49.188880+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.HierarchyForcing additive_composition_is_minimal echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

utilizing only 20% of the tokens while maintaining performance comparable to full-gradient updates... training exclusively on the 80% lowest-entropy tokens leads to a marked decline

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Visual-Advantage On-Policy Distillation for Vision-Language Models
cs.CV 2026-05 unverdicted novelty 7.0

VA-OPD improves VLM performance over standard on-policy distillation by reweighting rollouts and separating KL terms according to token-level visual advantage on math and visual benchmarks.
DISA: Offline Importance Sampling for Distribution-Matching LLM-RL
cs.LG 2026-05 unverdicted novelty 7.0

DISA decouples partition function estimation using offline importance sampling for distribution-matching LLM-RL, matching or exceeding online baselines like FlowRL on math and code benchmarks while retaining more stra...
PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media
cs.CL 2026-05 unverdicted novelty 7.0

PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic onlin...
Learn Where Outcomes Diverge: Efficient VLA RL via Probabilistic Chunk Masking
cs.LG 2026-05 unverdicted novelty 7.0

PCM uses success-failure action variance to probabilistically select and mask chunks for gradient updates in GRPO, matching standard success rates with 2.38x wall-clock speedup and 60% lower memory on LIBERO benchmarks.
Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 7.0

Persistent 'Rock Tokens' in on-policy distillation resist teacher corrections, consume large gradient norms, yet add negligible value to reasoning, allowing targeted bypassing to streamline alignment.
The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits
cs.LG 2026-05 unverdicted novelty 7.0

The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interv...
When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models
cs.LG 2026-05 unverdicted novelty 7.0

Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.
Where to Spend Rollouts: Hit-Utility Optimal Rollout Allocation for Group-Based RLVR
cs.LG 2026-05 unverdicted novelty 7.0

HORA adaptively allocates rollouts using hit utility to improve Pass@K over compute-matched GRPO on math reasoning benchmarks while preserving Pass@1.
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
cs.CL 2026-05 unverdicted novelty 7.0

RL improves LLM reasoning by sparse policy selection at high-entropy tokens rather than new capability learning, and a minimal RL-free method matches its gains at three orders of magnitude lower cost.
The Stepwise Informativeness Assumption: Why are Entropy Dynamics and Reasoning Correlated in LLMs?
cs.CL 2026-03 unverdicted novelty 7.0

The Stepwise Informativeness Assumption explains the correlation between LLM entropy dynamics and reasoning correctness by positing that correct traces accumulate answer-relevant information stepwise during generation.
Learning from Self-Debate: Preparing Reasoning Models for Multi-Agent Debate
cs.CL 2026-01 unverdicted novelty 7.0

SDRL trains LLMs via self-generated multi-path debates and joint optimization of standalone plus debate-conditioned responses to boost both single-model reasoning and multi-agent debate performance.
HE-SNR: Uncovering Latent Logic via Entropy for Guiding Mid-Training on SWE-bench
cs.LG 2026-01 unverdicted novelty 7.0

HE-SNR is a high-entropy signal-to-noise ratio metric derived from the Entropy Compression Hypothesis to better guide LLM mid-training on complex software engineering benchmarks.
Scaling Latent Reasoning via Looped Language Models
cs.CL 2025-10 unverdicted novelty 7.0

Looped language models with latent iterative computation and entropy-regularized depth allocation achieve performance matching up to 12B standard LLMs through superior knowledge manipulation.
Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling
cs.LG 2025-07 unverdicted novelty 7.0

Prefix-RFT blends SFT and RFT via prefix sampling from demonstrations to outperform standalone SFT, RFT, and mixed-policy baselines on math reasoning problems.
Unified Data Selection for LLM Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

High-Entropy Sum (HES) selects high-quality reasoning data for LLMs by summing entropy of the top highest-entropy tokens, matching full-dataset performance with top 20% in SFT and outperforming baselines in RFT and RL.
DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards
cs.LG 2026-05 unverdicted novelty 6.0

DelTA estimates token coefficients to amplify discriminative directions in token-gradient vectors, reweighting the RLVR surrogate to produce more contrastive side-wise centroids and yielding 3.26 and 2.62 point gains ...
Beyond Mode Collapse: Distribution Matching for Diverse Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

DMPO approximates forward KL minimization in on-policy RL by aligning the policy to a group-level reward-proportional target distribution, yielding 9-12% relative gains over GRPO on NP-Bench and smaller gains on math ...
Taming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

CES applies conditional bidirectional entropy control on top of DAPO to improve accuracy and shorten responses on mathematical benchmarks for 7B and 1.5B LLMs.
Post-Trained MoE Can Skip Half Experts via Self-Distillation
cs.LG 2026-05 unverdicted novelty 6.0

ZEDA injects zero-output experts and uses two-stage self-distillation to adapt post-trained MoE models into dynamic ones that skip over half the experts, yielding 1.2x inference speedup with small accuracy drops.
How Off-Policy Can GRPO Be? Mu-GRPO for Efficient LLM Reinforcement Learning
cs.LG 2026-05 conditional novelty 6.0

Mu-GRPO enables substantially more off-policy GRPO training for LLMs via relaxed clipping and negative-advantage veto in large staged batches, matching standard GRPO performance at ~2x training speed.
Reasoning Can Be Restored by Correcting a Few Decision Tokens
cs.AI 2026-05 conditional novelty 6.0

Reasoning gaps between base LLMs and LRMs concentrate on ~8% of early planning tokens; intervening with the reasoning model only at high-disagreement positions recovers performance.
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy
cs.CV 2026-05 unverdicted novelty 6.0

Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduce...
Holder Policy Optimisation
cs.LG 2026-05 unverdicted novelty 6.0

HölderPO unifies token-level aggregation in GRPO via the Hölder mean with a tunable p parameter and annealing schedule, delivering 54.9% average accuracy on math benchmarks and 93.8% success on ALFWorld.
Holder Policy Optimisation
cs.LG 2026-05 unverdicted novelty 6.0

HölderPO unifies token aggregation in GRPO via the Hölder mean with dynamic p annealing, reporting 54.9% average math-benchmark accuracy and 93.8% ALFWorld success.
Selective Off-Policy Reference Tuning with Plan Guidance
cs.AI 2026-05 unverdicted novelty 6.0

SORT turns all-wrong prompts into selective learning signals by weighting tokens more predictable under plan guidance from reference solutions, improving over GRPO on reasoning benchmarks especially for weaker models.
Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization
cs.LG 2026-05 unverdicted novelty 6.0

OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.
Epistemic Uncertainty for Test-Time Discovery
cs.LG 2026-05 unverdicted novelty 6.0

UG-TTT adds epistemic uncertainty measured by adapter disagreement as an exploration bonus in RL for LLMs, raising maximum reward and diversity on scientific discovery benchmarks.
AIPO: Learning to Reason from Active Interaction
cs.CL 2026-05 unverdicted novelty 6.0

AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...
AIPO: Learning to Reason from Active Interaction
cs.CL 2026-05 unverdicted novelty 6.0

AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.
HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control
cs.LG 2026-05 unverdicted novelty 6.0

HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchm...
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
cs.CL 2026-05 unverdicted novelty 6.0

RL for LLM reasoning acts as sparse policy selection at high-entropy tokens already present in the base model, enabling ReasonMaxxer—an efficient contrastive method that recovers most RL gains at three orders of magni...
Estimating the Black-box LLM Uncertainty with Distribution-Aligned Adversarial Distillation
cs.CL 2026-05 unverdicted novelty 6.0

DisAAD trains a 1%-sized proxy model via adversarial distillation to quantify uncertainty in black-box LLMs by aligning with their output distributions.
When Embedding-Based Defenses Fail: Rethinking Safety in LLM-Based Multi-Agent Systems
cs.CR 2026-05 unverdicted novelty 6.0

Embedding-based defenses fail against attacks that align malicious message embeddings with benign ones in LLM multi-agent systems, but token-level confidence scores improve robustness by enabling better pruning of sus...
Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance
cs.CL 2026-04 unverdicted novelty 6.0

Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math a...
GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

GRPO-VPS improves GRPO by using segment-wise conditional probabilities of the correct answer to supply process-level feedback, yielding up to 2.6-point accuracy gains and 13.7% shorter reasoning on math tasks.
Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data
cs.LG 2026-04 unverdicted novelty 6.0

A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
cs.LG 2026-04 unverdicted novelty 6.0

HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
Characterizing Model-Native Skills
cs.AI 2026-04 conditional novelty 6.0

Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming...
HiRO-Nav: Hybrid ReasOning Enables Efficient Embodied Navigation
cs.AI 2026-04 unverdicted novelty 6.0

HiRO-Nav adaptively triggers reasoning only on high-entropy actions via a hybrid training pipeline and shows better success-token trade-offs than always-reason or never-reason baselines on the CHORES-S benchmark.
The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment
cs.LG 2026-04 unverdicted novelty 6.0

The Master Key Hypothesis states that capabilities are low-dimensional directions transferable across models through linear subspace alignment, with UNLOCK demonstrating gains such as 12.1% accuracy improvement on MAT...
LLMs Should Express Uncertainty Explicitly
cs.LG 2026-04 unverdicted novelty 6.0

Training LLMs to express uncertainty explicitly via global confidence or local markers enhances calibration and intervention triggers compared to post-hoc estimation.
LLMs Should Express Uncertainty Explicitly
cs.LG 2026-04 unverdicted novelty 6.0

Training LLMs to verbalize uncertainty explicitly at the end or during reasoning reduces overconfident errors and improves answer quality on factual tasks while enabling RAG triggers.
From Broad Exploration to Stable Synthesis: Entropy-Guided Optimization for Autoregressive Image Generation
cs.LG 2026-03 unverdicted novelty 6.0

EG-GRPO improves autoregressive text-to-image models by reallocating RL updates according to token entropy, excluding low-entropy tokens from reward signals while adding entropy bonuses to high-entropy ones, yielding ...
STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens
cs.CL 2026-02 unverdicted novelty 6.0

STAPO stabilizes RL for LLMs by suppressing gradient updates from rare spurious tokens, yielding 11.49% average gains on math benchmarks over GRPO and similar baselines.
Training-Trajectory-Aware Token Selection
cs.CL 2026-01 unverdicted novelty 6.0

Training-Trajectory-Aware Token Selection (T3S) reconstructs the token-level training objective to overcome a performance bottleneck in continual distillation of reasoning capabilities from large to small language models.
High-Entropy Tokens as Multimodal Failure Points in Vision-Language Models
cs.CV 2025-12 unverdicted novelty 6.0

High-entropy tokens act as concentrated multimodal failure points in VLMs, enabling sparse Entropy-Guided Attacks that achieve 93-95% success and 30-38% harmful rates with cross-model transfer.
Boosting Reasoning in Large Multimodal Models via Activation Replay
cs.CV 2025-11 unverdicted novelty 6.0

Activation Replay boosts multimodal reasoning in post-trained LMMs by replaying low-entropy activations from base models to RLVR counterparts at test time via visual token manipulation.
Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning
cs.AI 2025-10 unverdicted novelty 6.0

UCAS refines RLVR advantage signals with a logit-space self-confidence proxy for response-level modulation and asymmetric token-level penalties based on raw logit certainty to boost exploration and reduce entropy collapse.
Entropy After </Think> for reasoning model early exiting
cs.LG 2025-09 unverdicted novelty 6.0

Entropy After </Think> (EAT) enables early exiting in reasoning LLMs by tracking entropy stabilization after a </think> token, cutting token use 12-22% on MATH500 and AIME2025 with no accuracy loss.
GIFT: Guided Importance-Aware Fine-Tuning for Diffusion Language Models
cs.CL 2025-09 unverdicted novelty 6.0

GIFT weights tokens by entropy during fine-tuning of diffusion language models and reports better performance than standard SFT on reasoning benchmarks across multiple settings.
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
cs.AI 2025-09 accept novelty 6.0

Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR
cs.CL 2025-07 unverdicted novelty 6.0

Archer introduces response-level entropy normalization and differentiated clipping/KL regularization in RLVR to encourage exploration on reasoning tokens while stabilizing knowledge tokens, yielding gains in pass@1 an...
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
cs.CL 2025-06 unverdicted novelty 6.0

MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training...
Selective Off-Policy Reference Tuning with Plan Guidance
cs.AI 2026-05 unverdicted novelty 5.0

SORT converts all-failed reasoning prompts into selective, structure-aware training signals by weighting tokens according to how much a reference-derived plan increases their probability.
How You Begin is How You Reason: Driving Exploration in RLVR via Prefix-Tuned Priors
cs.AI 2026-05 unverdicted novelty 5.0

IMAX trains soft prefixes with an InfoMax reward to drive diverse exploration in RLVR, yielding up to 11.60% gains in Pass@4 over standard RLVR across model scales.
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models
cs.AI 2026-05 unverdicted novelty 5.0

Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.
EGAD: Entropy-Guided Adaptive Distillation for Token-Level Knowledge Transfer
cs.CL 2026-05 unverdicted novelty 5.0

EGAD adaptively distills LLM knowledge at the token level by using entropy to create a curriculum from low- to high-entropy tokens, adjust temperature, and switch between logits-only and feature-based branches.
OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 5.0

OGER adds an auxiliary exploration reward built from offline trajectories and model entropy to hybrid RL training, yielding gains on math reasoning benchmarks and out-of-domain generalization.
MCPO: Mastery-Consolidated Policy Optimization for Large Reasoning Models
cs.AI 2026-04 unverdicted novelty 5.0

MCPO fixes vanishing training signals and shrinking weights in GRPO by using a hinge-KL regularizer on mastered prompts and prioritizing majority-correct prompts, yielding higher pass@1 and pass@k on math tasks.
Rethinking Token-Level Credit Assignment in RLVR: A Polarity-Entropy Analysis
cs.LG 2026-04 unverdicted novelty 5.0

Token credit in RLVR is upper-bounded by entropy, with reasoning gains concentrated in high-entropy tokens, motivating Entropy-Aware Policy Optimization that outperforms baselines.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · cited by 59 Pith papers · 23 internal anchors

[1]

ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

[Accessed 01-05-2025]. Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-agi-2: A new challenge for frontier ai reasoning systems.arXiv preprint arXiv: 2505.11831,

work page internal anchor Pith review arXiv 2025
[2]

On the Measure of Intelligence

François Chollet. On the measure of intelligence.arXiv preprint arXiv: 1911.01547,

work page internal anchor Pith review arXiv 1911
[3]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V . Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training.arXiv preprint arXiv:2501.17161,

work page internal anchor Pith review arXiv
[4]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

work page internal anchor Pith review Pith/arXiv arXiv
[5]

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306,

work page internal anchor Pith review arXiv
[6]

Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs

Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars.arXiv preprint arXiv:2503.01307,

work page internal anchor Pith review arXiv
[7]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava S...

work page internal anchor Pith review Pith/arXiv arXiv
[8]

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang. rstar-math: Small llms can master math reasoning with self-evolved deep thinking.arXiv preprint arXiv:2501.04519,

work page internal anchor Pith review arXiv
[9]

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Jian Hu, Xibin Wu, Zilin Zhu, Xianyu, Weixun Wang, Dehao Zhang, and Yu Cao. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework.arXiv preprint arXiv:2405.11143,

work page internal anchor Pith review arXiv
[10]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290,

work page internal anchor Pith review arXiv
[11]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

17 Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V . Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirz...

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Llms can easily learn to reason from demonstrations structure, not content, is what matters!arXiv preprint arXiv:2502.07374, 2025

Dacheng Li, Shiyi Cao, Tyler Griggs, Shu Liu, Xiangxi Mo, Eric Tang, Sumanth Hegde, Kourosh Hakhamaneshi, Shishir G Patil, Matei Zaharia, et al. Llms can easily learn to reason from demonstrations structure, not content, is what matters!arXiv preprint arXiv:2502.07374,

work page arXiv
[14]

arXiv preprint arXiv:2411.19943 , year =

Zicheng Lin, Tian Liang, Jiahao Xu, Xing Wang, Ruilin Luo, Chufan Shi, Siheng Li, Yujiu Yang, and Zhaopeng Tu. Critical tokens matter: Token-level contrastive estimation enhence llm’s reasoning capability.arXiv preprint arXiv:2411.19943,

work page arXiv
[15]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al

[Ac- cessed 01-05-2025]. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744,

work page 2025
[17]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Understanding the performance gap between online and offline alignment algorithms, 2024

Yunhao Tang, Daniel Zhaohan Guo, Zeyu Zheng, Daniele Calandriello, Yuan Cao, Eugene Tarassov, Rémi Munos, Bernardo Ávila Pires, Michal Valko, Yong Cheng, et al. Understanding the performance gap between online and offline alignment algorithms.arXiv preprint arXiv:2405.08448,

work page arXiv
[21]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haoz...

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Vassoyan, J., Beau, N., and Plaud, R

URL https://qwenlm.github. io/blog/qwq-32b/. Jean Vassoyan, Nathanaël Beau, and Roman Plaud. Ignore the kl penalty! boosting exploration on critical tokens to enhance rl fine-tuning.arXiv preprint arXiv:2502.06533,

work page arXiv
[23]

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Lucas Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, and Yelong Shen. Reinforcement learning for reasoning in large language models with one training example.arXiv preprint arXiv:2504.20571,

work page internal anchor Pith review arXiv
[24]

DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, and Ping Luo. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv
[26]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837, 2025a. Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, Xian...

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335,

work page internal anchor Pith review arXiv

[1] [1]

ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

[Accessed 01-05-2025]. Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-agi-2: A new challenge for frontier ai reasoning systems.arXiv preprint arXiv: 2505.11831,

work page internal anchor Pith review arXiv 2025

[2] [2]

On the Measure of Intelligence

François Chollet. On the measure of intelligence.arXiv preprint arXiv: 1911.01547,

work page internal anchor Pith review arXiv 1911

[3] [3]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V . Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training.arXiv preprint arXiv:2501.17161,

work page internal anchor Pith review arXiv

[4] [4]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306,

work page internal anchor Pith review arXiv

[6] [6]

Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs

Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars.arXiv preprint arXiv:2503.01307,

work page internal anchor Pith review arXiv

[7] [7]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava S...

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang. rstar-math: Small llms can master math reasoning with self-evolved deep thinking.arXiv preprint arXiv:2501.04519,

work page internal anchor Pith review arXiv

[9] [9]

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Jian Hu, Xibin Wu, Zilin Zhu, Xianyu, Weixun Wang, Dehao Zhang, and Yu Cao. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework.arXiv preprint arXiv:2405.11143,

work page internal anchor Pith review arXiv

[10] [10]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290,

work page internal anchor Pith review arXiv

[11] [11]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

17 Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V . Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirz...

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Llms can easily learn to reason from demonstrations structure, not content, is what matters!arXiv preprint arXiv:2502.07374, 2025

Dacheng Li, Shiyi Cao, Tyler Griggs, Shu Liu, Xiangxi Mo, Eric Tang, Sumanth Hegde, Kourosh Hakhamaneshi, Shishir G Patil, Matei Zaharia, et al. Llms can easily learn to reason from demonstrations structure, not content, is what matters!arXiv preprint arXiv:2502.07374,

work page arXiv

[14] [14]

arXiv preprint arXiv:2411.19943 , year =

Zicheng Lin, Tian Liang, Jiahao Xu, Xing Wang, Ruilin Luo, Chufan Shi, Siheng Li, Yujiu Yang, and Zhaopeng Tu. Critical tokens matter: Token-level contrastive estimation enhence llm’s reasoning capability.arXiv preprint arXiv:2411.19943,

work page arXiv

[15] [15]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al

[Ac- cessed 01-05-2025]. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744,

work page 2025

[17] [17]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Understanding the performance gap between online and offline alignment algorithms, 2024

Yunhao Tang, Daniel Zhaohan Guo, Zeyu Zheng, Daniele Calandriello, Yuan Cao, Eugene Tarassov, Rémi Munos, Bernardo Ávila Pires, Michal Valko, Yong Cheng, et al. Understanding the performance gap between online and offline alignment algorithms.arXiv preprint arXiv:2405.08448,

work page arXiv

[21] [21]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haoz...

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Vassoyan, J., Beau, N., and Plaud, R

URL https://qwenlm.github. io/blog/qwq-32b/. Jean Vassoyan, Nathanaël Beau, and Roman Plaud. Ignore the kl penalty! boosting exploration on critical tokens to enhance rl fine-tuning.arXiv preprint arXiv:2502.06533,

work page arXiv

[23] [23]

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Lucas Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, and Yelong Shen. Reinforcement learning for reasoning in large language models with one training example.arXiv preprint arXiv:2504.20571,

work page internal anchor Pith review arXiv

[24] [24]

DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, and Ping Luo. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837, 2025a. Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, Xian...

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335,

work page internal anchor Pith review arXiv