SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.
hub
The unreasonable effectiveness of entropy minimization in llm reasoning
11 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
method 1polarities
use method 1representative citing papers
GCPO shifts RLVR from rollout competition to team cooperation by assigning advantages via marginal contributions to a determinant-based coverage volume over semantic embeddings, yielding higher accuracy and solution diversity on reasoning benchmarks.
Derives a rigorous entropy minimization formulation for autoregressive test-time adaptation that decomposes into policy gradient and entropy terms, reinterpreting prior methods and improving Whisper ASR across 20+ domains.
Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing identified as favorable on the stability-support trade-off.
RL for LLM reasoning acts as sparse policy selection at high-entropy tokens already present in the base model, enabling ReasonMaxxer—an efficient contrastive method that recovers most RL gains at three orders of magnitude lower cost.
TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.
HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
Online Label Refinement lets LLMs learn robust reasoning from noisy supervision by correcting labels when majority answers show rising rollout success and stable history, delivering 3-4% gains on math and reasoning benchmarks even at high noise levels.
FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
citing papers explorer
-
SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology
SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.
-
Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning
GCPO shifts RLVR from rollout competition to team cooperation by assigning advantages via marginal contributions to a determinant-based coverage volume over semantic embeddings, yielding higher accuracy and solution diversity on reasoning benchmarks.
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.