hub Canonical reference

A Survey of Reinforcement Learning for Large Reasoning Models

Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang · 2025 · cs.CL · arXiv 2509.08827

Canonical reference. 100% of citing Pith papers cite this work as background.

39 Pith papers citing it

Background 100% of classified citations

open full Pith review browse 39 citing papers arXiv PDF

abstract

In this paper, we survey recent advances in Reinforcement Learning (RL) for reasoning with Large Language Models (LLMs). RL has achieved remarkable success in advancing the frontier of LLM capabilities, particularly in addressing complex logical tasks such as mathematics and coding. As a result, RL has emerged as a foundational methodology for transforming LLMs into LRMs. With the rapid progress of the field, further scaling of RL for LRMs now faces foundational challenges not only in computational resources but also in algorithm design, training data, and infrastructure. To this end, it is timely to revisit the development of this domain, reassess its trajectory, and explore strategies to enhance the scalability of RL toward Artificial SuperIntelligence (ASI). In particular, we examine research applying RL to LLMs and LRMs for reasoning abilities, especially since the release of DeepSeek-R1, including foundational components, core problems, training resources, and downstream applications, to identify future opportunities and directions for this rapidly evolving area. We hope this review will promote future research on RL for broader reasoning models. Github: https://github.com/TsinghuaC3I/Awesome-RL-for-LRMs

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7

citation-polarity summary

background 7

representative citing papers

Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

cs.LG · 2026-03-13 · unverdicted · novelty 8.0

Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.

Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR

cs.LG · 2026-05-19 · conditional · novelty 7.0

Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.

Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning

cs.CV · 2026-05-10 · unverdicted · novelty 7.0

RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multimodal models.

SCORP: Scene-Consistent Multi-agent Diffusion Planning with Stable Online Reinforcement Post-Training for Cooperative Driving

cs.RO · 2026-04-13 · unverdicted · novelty 7.0 · 2 refs

SCORP delivers 10-28% gains in safety and 2-7% in efficiency metrics on WOMD by using dual-path scene conditioning in diffusion planning plus variance-gated group-relative policy optimization for closed-loop stability.

Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

cs.CL · 2026-04-09 · unverdicted · novelty 7.0

OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

cs.LG · 2026-04-08 · unverdicted · novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

S$^2$GR: Stepwise Semantic-Guided Reasoning in Latent Space for Generative Recommendation

cs.IR · 2026-01-26 · unverdicted · novelty 7.0

S²GR adds stepwise thinking tokens with contrastive supervision on codebook clusters to balance computational focus and ground reasoning paths in generative recommendation.

Mental-R1: Aligning LLM Reasoning for Mental Health Assessment

cs.AI · 2026-06-11 · unverdicted · novelty 6.0

CRPO extends group relative policy optimization with stage-dependent uncertainty modeling and reports a 10.4 percentage point weighted F1 gain over RL baselines across 8 mental health datasets.

FGRPO: Federated GRPO with Adaptive Aggregation on Non-IID Data

cs.LG · 2026-06-02 · unverdicted · novelty 6.0

FGRPO decentralizes GRPO fine-tuning via adaptive aggregation based on relative performance gain to achieve robust convergence on non-IID data while preserving privacy.

Traj-Evolve: A Self-Evolving Multi-Agent System for Patient Trajectory Modeling in Lung Cancer Early Detection

cs.AI · 2026-06-01 · unverdicted · novelty 6.0

Traj-Evolve combines non-parametric experience retrieval and multi-agent RL with a leave-one-out unification strategy to outperform baselines on lung cancer prediction from up to five years of multimodal EHRs, including in never-smokers.

Combinatorial Synthesis: Scaling Code RLVR via Atomic Decomposition and Recombination

cs.CL · 2026-05-29 · unverdicted · novelty 6.0

ADR generates novel verifiable code tasks via atomic decomposition and recombination, outperforming heuristic baselines in originality, difficulty, and downstream RLVR gains across coding domains.

Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.

Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

cs.AI · 2026-05-19 · unverdicted · novelty 6.0

DMPO approximates forward KL minimization in on-policy RL by aligning the policy to a group-level reward-proportional target distribution, yielding 9-12% relative gains over GRPO on NP-Bench and smaller gains on math reasoning.

Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control

cs.LG · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

Entropy polarity is a signed token-level quantity derived from a first-order approximation of entropy change that predicts whether RL updates expand or contract policy entropy in LLM fine-tuning, revealing an asymmetry between high- and low-probability tokens.

Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

cs.CV · 2026-05-11 · unverdicted · novelty 6.0

Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to reduce reward hacking and improve performance over GRPO baselines.

Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning

cs.CL · 2026-05-07 · unverdicted · novelty 6.0

A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and optimizing for pass@k during SFT before stable RLVR.

The Geometric Reasoner: Manifold-Informed Latent Foresight Search for Long-Context Reasoning

cs.LG · 2026-01-25 · unverdicted · novelty 6.0

TGR performs manifold-informed latent foresight search to boost trajectory coverage in long-context reasoning tasks by up to 13 AUC points with minimal overhead.

Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning

cs.CL · 2025-12-08 · unverdicted · novelty 6.0

NPR trains LLMs to reason in parallel via self-distilled RL, delivering up to 24.5% performance gains and 4.6x speedups with 100% genuine parallel execution on reasoning benchmarks.

When Importance Sampling Misallocates Credit: Asymmetric Ratios for Outcome-Supervised RL

cs.CL · 2025-10-07 · unverdicted · novelty 6.0

The paper identifies that importance sampling ratios in outcome-supervised RL misallocate credit by creating unbalanced token updates, and introduces ASPO to correct the asymmetry for positive-advantage tokens.

Thinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training

cs.AI · 2025-09-30 · unverdicted · novelty 6.0

Post-training on reasoning tasks sparks the emergence of specialized attention heads that enable structured computation, with SFT adding stable heads while GRPO uses dynamic activation and pruning tied to reward signals, and controllable think models relying on compensatory heads instead of specific

IAPO: Input Attribution-Aware Policy Optimization for Tool Use in Small Multimodal Agents

cs.LG · 2026-06-10 · unverdicted · novelty 5.0

IAPO is an RL method that aligns model input attributions with a teacher to improve tool-calling in multimodal SLMs, reporting 3% average VQA accuracy gains on Qwen2.5-VL-3B across six tests.

When RL Fails after SFT: Rejuvenating Model Plasticity for Robust SFT-to-RL Handoff

cs.LG · 2026-06-07 · unverdicted · novelty 5.0

Excessive SFT reduces LLM plasticity for RL; Rejuvenation restores it via base-anchored fusion and targeted neuron resets, yielding better RL performance and OOD generalization.

GeoMin: Data-Efficient Semi-Supervised RLVR via Geometric Distribution Modeling

cs.LG · 2026-06-03 · unverdicted · novelty 5.0

GeoMin uses geometric distribution modeling on labeled data to assess self-reward reliability, enabling better performance in semi-supervised RLVR with only 10% of typical annotations.

RLVR without Ineffective Samples: Group Prioritized Off-Policy Optimization for LLM Reasoning

cs.LG · 2026-05-31 · unverdicted · novelty 5.0

POPO uses recency-based prioritized group replay and decoupled off-policy optimization to avoid zero-variance ineffective samples in RLVR, accelerating LLM reasoning finetuning with fewer rollouts.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR cs.LG · 2026-05-19 · conditional · none · ref 41 · internal anchor
Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.

A Survey of Reinforcement Learning for Large Reasoning Models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer