hub Canonical reference

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Liyuan Liu, Baolin Peng · 2025 · cs.LG · arXiv 2504.20571

Canonical reference. 71% of citing Pith papers cite this work as background.

39 Pith papers citing it

Background 71% of classified citations

open full Pith review browse 39 citing papers arXiv PDF

abstract

We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the math reasoning capabilities of large language models (LLMs). Applying RLVR to the base model Qwen2.5-Math-1.5B, we identify a single example that elevates model performance on MATH500 from 36.0% to 73.6% (8.6% improvement beyond format correction), and improves the average performance across six common mathematical reasoning benchmarks from 17.6% to 35.7% (7.0% non-format gain). This result matches the performance obtained using the 1.2k DeepScaleR subset (MATH500: 73.6%, average: 35.9%), which contains the aforementioned example. Furthermore, RLVR with only two examples even slightly exceeds these results (MATH500: 74.8%, average: 36.6%). Similar substantial improvements are observed across various models (Qwen2.5-Math-7B, Llama3.2-3B-Instruct, DeepSeek-R1-Distill-Qwen-1.5B), RL algorithms (GRPO and PPO), and different math examples. In addition, we identify some interesting phenomena during 1-shot RLVR, including cross-category generalization, increased frequency of self-reflection, and sustained test performance improvement even after the training accuracy has saturated, a phenomenon we term post-saturation generalization. Moreover, we verify that the effectiveness of 1-shot RLVR primarily arises from the policy gradient loss, distinguishing it from the "grokking" phenomenon. We also show the critical role of promoting exploration (e.g., by incorporating entropy loss with an appropriate coefficient) in 1-shot RLVR training. We also further discuss related observations about format correction, label robustness and prompt modification. These findings can inspire future work on RLVR efficiency and encourage a re-examination of recent progress and the underlying mechanisms in RLVR. All resources are open source at https://github.com/ypwang61/One-Shot-RLVR.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 dataset 1 method 1

citation-polarity summary

background 5 use dataset 1 use method 1

representative citing papers

Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on AIME 2025.

Selector-Guided Autonomous Curriculum for One-Shot Reinforcement Learning from Verifiable Rewards

cs.LG · 2026-05-03 · unverdicted · novelty 7.0

SGAC replaces reward-variance heuristics with a multi-feature learnable selector emphasizing output entropy, yielding 68% accuracy on Hendrycks MATH with Qwen2.5-Math-1.5B versus 64-66% baselines.

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

cs.LG · 2026-04-08 · unverdicted · novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing

cs.LG · 2026-02-03 · unverdicted · novelty 7.0

Positive-negative prompt pairing with weighted GRPO improves RLVR sample efficiency, raising AIME 2025 Pass@8 from 16.8 to 22.2 on Qwen2.5-Math-7B while matching large-scale training.

Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models

cs.AI · 2026-02-02 · unverdicted · novelty 7.0

GPS trains a small model on optimization history to predict prompt difficulty and select intermediate-difficulty diverse batches, yielding better training efficiency, final performance, and test-time allocation than baselines on reasoning benchmarks.

Training Reasoning Models on Saturated Problems via Failure-Prefix Conditioning

cs.LG · 2026-01-28 · unverdicted · novelty 7.0

Failure-prefix conditioning unlocks learning from saturated reasoning problems by conditioning on failure prefixes, improving recovery from misleading early steps and matching gains from new medium-difficulty problems.

Learning to Discover at Test Time

cs.LG · 2026-01-22 · unverdicted · novelty 7.0

TTT-Discover applies test-time RL to set new state-of-the-art results on math inequalities, GPU kernels, algorithm contests, and single-cell denoising using an open model and public code.

Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning

cs.AI · 2026-05-21 · unverdicted · novelty 6.0

Spreadsheet-RL applies RL fine-tuning and a custom Gym environment to raise LLM agent Pass@1 scores on spreadsheet benchmarks from roughly 8-12% to 17-23%.

Reasoning Can Be Restored by Correcting a Few Decision Tokens

cs.AI · 2026-05-16 · conditional · novelty 6.0

Reasoning gaps between base LLMs and LRMs concentrate on ~8% of early planning tokens; intervening with the reasoning model only at high-disagreement positions recovers performance.

Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting

cs.CL · 2026-05-12 · unverdicted · novelty 6.0

Covariance-weighted GRPO with Gaussian-kernel reweighting tames extreme tokens to stabilize training and boost reasoning performance over standard GRPO.

Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs

cs.AI · 2026-05-09 · unverdicted · novelty 6.0

OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.

HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchmarks over DAPO.

Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing identified as favorable on the stability-support trade-off.

Gradient Extrapolation-Based Policy Optimization

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

GXPO approximates longer local lookahead in GRPO training via gradient extrapolation from two optimizer steps using three backward passes total, improving pass@1 accuracy by 1.65-5.00 points over GRPO and delivering up to 4x step speedup.

Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

cs.CL · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

RL for LLM reasoning acts as sparse policy selection at high-entropy tokens already present in the base model, enabling ReasonMaxxer—an efficient contrastive method that recovers most RL gains at three orders of magnitude lower cost.

Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning

cs.LG · 2026-04-30 · unverdicted · novelty 6.0 · 2 refs

Kernel smoothing enables accurate low-variance value and gradient estimates for policy optimization in LLM reasoning under tight sampling constraints per prompt.

Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

Infection-Reasoner, a 4B VLM, reaches 86.8% accuracy on wound infection classification while producing rationales rated mostly correct by experts, via GPT-5.1 distillation followed by reinforcement learning.

HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.

EvoRAG: Making Knowledge Graph-based RAG Automatically Evolve through Feedback-driven Backpropagation

cs.DB · 2026-04-17 · unverdicted · novelty 6.0

EvoRAG adds a feedback-driven backpropagation step that attributes response quality to individual knowledge-graph triplets and updates the graph to raise reasoning accuracy by 7.34 percent over prior KG-RAG methods.

Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation

cs.CL · 2026-01-29 · unverdicted · novelty 6.0

CoNL lets LLMs self-improve on non-verifiable tasks by rewarding critiques that produce better solutions in multi-agent conversations, jointly optimizing generation and judging without external feedback.

Differentiable Evolutionary Reinforcement Learning

cs.AI · 2025-12-15 · unverdicted · novelty 6.0

DERL is a differentiable bi-level method that evolves optimal reward structures for RL policies by composing atomic primitives and using meta-gradients from validation performance.

Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning

cs.AI · 2025-10-12 · unverdicted · novelty 6.0

UCAS refines RLVR advantage signals with a logit-space self-confidence proxy for response-level modulation and asymmetric token-level penalties based on raw logit certainty to boost exploration and reduce entropy collapse.

DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search

cs.AI · 2025-09-29 · unverdicted · novelty 6.0

DeepSearch embeds MCTS into RLVR training with global frontier selection, entropy guidance, and adaptive replay to achieve 62.95% average accuracy on math reasoning benchmarks while using 5.7x fewer GPU hours than extended training.

RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization

cs.AI · 2025-07-31 · unverdicted · novelty 6.0

RL-PLUS is a hybrid RL approach for LLMs that combines internal exploitation with external data via importance sampling and exploration advantages to prevent capability boundary collapse and achieve gains on math and OOD reasoning benchmarks.

citing papers explorer

Showing 39 of 39 citing papers.

Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients cs.CL · 2026-05-07 · unverdicted · none · ref 58 · internal anchor
POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on AIME 2025.
Selector-Guided Autonomous Curriculum for One-Shot Reinforcement Learning from Verifiable Rewards cs.LG · 2026-05-03 · unverdicted · none · ref 5 · internal anchor
SGAC replaces reward-variance heuristics with a multi-feature learnable selector emphasizing output entropy, yielding 68% accuracy on Hendrycks MATH with Qwen2.5-Math-1.5B versus 64-66% baselines.
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning cs.LG · 2026-04-08 · unverdicted · none · ref 124 · internal anchor
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing cs.LG · 2026-02-03 · unverdicted · none · ref 22 · internal anchor
Positive-negative prompt pairing with weighted GRPO improves RLVR sample efficiency, raising AIME 2025 Pass@8 from 16.8 to 22.2 on Qwen2.5-Math-7B while matching large-scale training.
Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models cs.AI · 2026-02-02 · unverdicted · none · ref 34 · internal anchor
GPS trains a small model on optimization history to predict prompt difficulty and select intermediate-difficulty diverse batches, yielding better training efficiency, final performance, and test-time allocation than baselines on reasoning benchmarks.
Training Reasoning Models on Saturated Problems via Failure-Prefix Conditioning cs.LG · 2026-01-28 · unverdicted · none · ref 19 · internal anchor
Failure-prefix conditioning unlocks learning from saturated reasoning problems by conditioning on failure prefixes, improving recovery from misleading early steps and matching gains from new medium-difficulty problems.
Learning to Discover at Test Time cs.LG · 2026-01-22 · unverdicted · none · ref 79 · internal anchor
TTT-Discover applies test-time RL to set new state-of-the-art results on math inequalities, GPU kernels, algorithm contests, and single-cell denoising using an open model and public code.
Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning cs.AI · 2026-05-21 · unverdicted · none · ref 35 · internal anchor
Spreadsheet-RL applies RL fine-tuning and a custom Gym environment to raise LLM agent Pass@1 scores on spreadsheet benchmarks from roughly 8-12% to 17-23%.
Reasoning Can Be Restored by Correcting a Few Decision Tokens cs.AI · 2026-05-16 · conditional · none · ref 23 · internal anchor
Reasoning gaps between base LLMs and LRMs concentrate on ~8% of early planning tokens; intervening with the reasoning model only at high-disagreement positions recovers performance.
Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting cs.CL · 2026-05-12 · unverdicted · none · ref 9 · internal anchor
Covariance-weighted GRPO with Gaussian-kernel reweighting tames extreme tokens to stabilize training and boost reasoning performance over standard GRPO.
Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs cs.AI · 2026-05-09 · unverdicted · none · ref 25 · internal anchor
OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.
HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control cs.LG · 2026-05-08 · unverdicted · none · ref 39 · internal anchor
HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchmarks over DAPO.
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models cs.LG · 2026-05-08 · unverdicted · none · ref 93 · internal anchor
Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing identified as favorable on the stability-support trade-off.
Gradient Extrapolation-Based Policy Optimization cs.LG · 2026-05-07 · unverdicted · none · ref 34 · internal anchor
GXPO approximates longer local lookahead in GRPO training via gradient extrapolation from two optimizer steps using three backward passes total, improving pass@1 accuracy by 1.65-5.00 points over GRPO and delivering up to 4x step speedup.
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning cs.CL · 2026-05-07 · unverdicted · none · ref 22 · 2 links · internal anchor
RL for LLM reasoning acts as sparse policy selection at high-entropy tokens already present in the base model, enabling ReasonMaxxer—an efficient contrastive method that recovers most RL gains at three orders of magnitude lower cost.
Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning cs.LG · 2026-04-30 · unverdicted · none · ref 24 · 2 links · internal anchor
Kernel smoothing enables accurate low-variance value and gradient estimates for policy optimization in LLM reasoning under tight sampling constraints per prompt.
Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning cs.CV · 2026-04-21 · unverdicted · none · ref 64 · internal anchor
Infection-Reasoner, a 4B VLM, reaches 86.8% accuracy on wound infection classification while producing rationales rated mostly correct by experts, via GPT-5.1 distillation followed by reinforcement learning.
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment cs.LG · 2026-04-20 · unverdicted · none · ref 59 · internal anchor
HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
EvoRAG: Making Knowledge Graph-based RAG Automatically Evolve through Feedback-driven Backpropagation cs.DB · 2026-04-17 · unverdicted · none · ref 85 · internal anchor
EvoRAG adds a feedback-driven backpropagation step that attributes response quality to individual knowledge-graph triplets and updates the graph to raise reasoning accuracy by 7.34 percent over prior KG-RAG methods.
Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation cs.CL · 2026-01-29 · unverdicted · none · ref 24 · internal anchor
CoNL lets LLMs self-improve on non-verifiable tasks by rewarding critiques that produce better solutions in multi-agent conversations, jointly optimizing generation and judging without external feedback.
Differentiable Evolutionary Reinforcement Learning cs.AI · 2025-12-15 · unverdicted · none · ref 24 · internal anchor
DERL is a differentiable bi-level method that evolves optimal reward structures for RL policies by composing atomic primitives and using meta-gradients from validation performance.
Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning cs.AI · 2025-10-12 · unverdicted · none · ref 16 · internal anchor
UCAS refines RLVR advantage signals with a logit-space self-confidence proxy for response-level modulation and asymmetric token-level penalties based on raw logit certainty to boost exploration and reduce entropy collapse.
DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search cs.AI · 2025-09-29 · unverdicted · none · ref 23 · internal anchor
DeepSearch embeds MCTS into RLVR training with global frontier selection, entropy guidance, and adaptive replay to achieve 62.95% average accuracy on math reasoning benchmarks while using 5.7x fewer GPU hours than extended training.
RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization cs.AI · 2025-07-31 · unverdicted · none · ref 17 · internal anchor
RL-PLUS is a hybrid RL approach for LLMs that combines internal exploitation with external data via importance sampling and exploration advantages to prevent capability boundary collapse and achieve gains on math and OOD reasoning benchmarks.
Generalizing Verifiable Instruction Following cs.CL · 2025-07-03 · unverdicted · none · ref 29 · internal anchor
Introduces IFBench benchmark with 58 new constraints and demonstrates RLVR training improves generalization of language models to unseen verifiable output constraints.
Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource cs.CL · 2025-06-13 · conditional · none · ref 37 · internal anchor
MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning cs.CL · 2025-06-02 · conditional · none · ref 23 · internal anchor
High-entropy minority tokens drive RLVR gains, so restricting gradients to the top 20% maintains or improves performance over full updates on Qwen3 models, especially larger ones.
The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning cs.LG · 2025-05-21 · unverdicted · none · ref 88 · internal anchor
Entropy minimization on self-generated outputs elicits strong reasoning in pretrained LLMs, matching or exceeding supervised RL methods on benchmarks.
On the Generalization Gap in Self-Evolving Language Model Reasoning cs.CL · 2026-05-31 · unverdicted · none · ref 36 · internal anchor
Closed-loop self-evolution on LLMs improves reasoning on Knights and Knaves tasks but plateaus short of oracle-supervised levels, with multi-turn revision nearly matching it for large models.
Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs cs.AI · 2026-05-27 · unverdicted · none · ref 39 · internal anchor
Sample difficulty in RLVR shows non-monotonic effects on LLM reasoning, with easy/medium problems strengthening computation and reasoning features while hard problems often yield weak or harmful signals.
OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering cs.CV · 2026-04-09 · unverdicted · none · ref 39 · internal anchor
OmniJigsaw is a self-supervised proxy task that reconstructs shuffled audio-visual clips via joint integration, sample-level selection, and clip-level masking strategies, yielding gains on 15 video, audio, and reasoning benchmarks.
LLM Reasoning with Process Rewards for Outcome-Guided Steps cs.LG · 2026-02-08 · unverdicted · none · ref 5 · internal anchor
PROGRS uses outcome-conditioned centering on PRM scores to safely integrate process rewards into GRPO for improved Pass@1 on math benchmarks.
The Serial Scaling Hypothesis cs.LG · 2025-07-16 · unverdicted · none · ref 123 · internal anchor
The serial scaling hypothesis formalizes inherently serial problems in complexity theory and demonstrates that diffusion models cannot solve them.
Hierarchical Reasoning Model cs.AI · 2025-06-26 · unverdicted · none · ref 99 · internal anchor
HRM is a recurrent architecture with high-level planning and low-level execution modules that reaches near-perfect accuracy on complex Sudoku, maze navigation, and ARC benchmarks using 27M parameters and 1000 samples without pre-training or CoT supervision.
Sharpness-Guided Group Relative Policy Optimization via Probability Shaping cs.LG · 2025-10-29 · unverdicted · none · ref 35 · internal anchor
GRPO-SG is a sharpness-guided token-weighted variant of GRPO that downweights high-gradient tokens to stabilize optimization and improve generalization in reinforcement learning with verifiable rewards.
Reinforcement Learning for Scalable and Trustworthy Intelligent Systems cs.LG · 2026-05-08 · unverdicted · none · ref 153 · internal anchor
Reinforcement learning is advanced for communication-efficient federated optimization and for preference-aligned, contextually safe policies in large language models.
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems cs.AI · 2025-03-31 · unverdicted · none · ref 165 · internal anchor
This survey frames foundation agents using brain-inspired modular architectures and reviews challenges in evolution, collaboration, and safety.
FBOS-RL: Feedback-Driven Bi-Objective Synergistic Reinforcement Learning cs.LG · 2026-05-18 · unreviewed · ref 25 · internal anchor
Cost-Aware Learning cs.LG · 2026-04-30 · unreviewed · ref 17 · internal anchor

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer