A verifiable empirical win rate reward combined with gradient masking enables RL training of a 7B model to reach betting-market calibration on NFL win probabilities using only outcome data.
hub Canonical reference
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
Canonical reference. 80% of citing Pith papers cite this work as background.
abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs), while its mechanisms are not yet well understood. In this work, we undertake a pioneering exploration of RLVR through the novel perspective of token entropy patterns, comprehensively analyzing how different tokens influence reasoning performance. By examining token entropy patterns in Chain-of-Thought (CoT) reasoning, we observe that only a small fraction of tokens exhibit high entropy, and these tokens act as critical forks that steer the model toward diverse reasoning pathways. Furthermore, studying how entropy patterns evolve during RLVR training reveals that RLVR largely adheres to the base model's entropy patterns, primarily adjusting the entropy of high-entropy tokens. These findings highlight the significance of high-entropy tokens (i.e., forking tokens) to RLVR. We ultimately improve RLVR by restricting policy gradient updates to forking tokens and uncover a finding even beyond the 80/20 rule: utilizing only 20% of the tokens while maintaining performance comparable to full-gradient updates on the Qwen3-8B base model and significantly surpassing full-gradient updates on the Qwen3-32B (+11.04 on AIME'25 and +7.71 on AIME'24) and Qwen3-14B (+4.79 on AIME'25 and +5.21 on AIME'24) base models, highlighting a strong scaling trend. In contrast, training exclusively on the 80% lowest-entropy tokens leads to a marked decline in performance. These findings indicate that the efficacy of RLVR primarily arises from optimizing the high-entropy tokens that decide reasoning directions. Collectively, our results highlight the potential to understand RLVR through a token-entropy perspective and optimize RLVR by leveraging high-entropy minority tokens to further improve LLM reasoning.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
DynaSteer dynamically steers LLM reasoning trajectories toward truth via pattern clustering, Fisher-LDA projection, and entropy-triggered representation edits, improving performance on MATH and generalizing to coding.
DICE formalizes multi-agent LLM coordination as discounted incomplete-information Markov games and introduces Heterogeneous Quantal Response Equilibrium (HQRE) to achieve unique stable equilibria with bounded regret, demonstrated via prompt-control and fine-tuning algorithms on eleven benchmarks.
DistIL applies distributional DAgger with forward cross-entropy to achieve monotonic policy improvement and better Pass@N from rich feedback in RL for reasoning tasks.
ARCA assigns token credit in LoRA-based LLM RL from the norm of adapter-induced hidden state changes, yielding non-degenerate distributions and competitive performance on MATH tasks with Qwen3-1.7B under GRPO.
Temporal scheduling of credit allocation criteria over RLVR training, using trajectory percentiles to target heterogeneous behaviors, yields more stable policy entropy and better reasoning benchmark results than static allocation.
VA-OPD improves VLM performance over standard on-policy distillation by reweighting rollouts and separating KL terms according to token-level visual advantage on math and visual benchmarks.
DISA decouples partition function estimation using offline importance sampling for distribution-matching LLM-RL, matching or exceeding online baselines like FlowRL on math and code benchmarks while retaining more strategy diversity.
PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.
PCM uses success-failure action variance to probabilistically select and mask chunks for gradient updates in GRPO, matching standard success rates with 2.38x wall-clock speedup and 60% lower memory on LIBERO benchmarks.
The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interventions that enhance performance.
Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.
HORA adaptively allocates rollouts using hit utility to improve Pass@K over compute-matched GRPO on math reasoning benchmarks while preserving Pass@1.
Embedding-based defenses fail against crafted attacks in LLM MAS; confidence scores from logits improve robustness but decay over communication rounds.
The Stepwise Informativeness Assumption explains the correlation between LLM entropy dynamics and reasoning correctness by positing that correct traces accumulate answer-relevant information stepwise during generation.
SDRL trains LLMs via self-generated multi-path debates and joint optimization of standalone plus debate-conditioned responses to boost both single-model reasoning and multi-agent debate performance.
Looped language models with latent iterative computation and entropy-regularized depth allocation achieve performance matching up to 12B standard LLMs through superior knowledge manipulation.
Prefix-RFT blends SFT and RFT via prefix sampling from demonstrations to outperform standalone SFT, RFT, and mixed-policy baselines on math reasoning problems.
The L2 norm of LLM hidden states signals reasoning intensity, with a theoretical bound on SAE feature activations, enabling three new test-time scaling techniques that boost performance.
TOPD improves on-policy distillation for LLM reasoning by using near-future guidance to identify divergent states, raising average accuracy from 47.8% to 52.2% on math benchmarks including AIME24 and AIME25.
Post-training quantization increases overthinking errors in reasoning models; a logit penalty on curated overthinking markers reduces CoT length 12-23% without accuracy loss.
GCPO performs per-token credit assignment in discrete policy optimization by setting token advantages proportional to the difference in model predictions under positive versus negative prompts, outperforming GRPO and DAPO on text-to-image and chain-of-thought tasks.
CIE-Scorer detects unfaithful CoT by tracing compact sentence-level circuits, building internal-external reasoning graphs, and scoring their discrepancy with Fused Gromov-Wasserstein distance, reporting SOTA results on FaithCoT-Bench with reduced circuit cost.
High-Entropy Sum (HES) selects high-quality reasoning data for LLMs by summing entropy of the top highest-entropy tokens, matching full-dataset performance with top 20% in SFT and outperforming baselines in RFT and RL.
citing papers explorer
-
Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization
OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.