hub

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue · 2025 · cs.AI · arXiv 2504.13837

56 Pith papers cite this work. Polarity classification is still indexing.

56 Pith papers citing it

open full Pith review browse 56 citing papers arXiv PDF

abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs), particularly on mathematics and programming tasks. Similar to how traditional RL helps agents explore and learn new strategies, RLVR is believed to enable LLMs to continuously self-improve, thus acquiring novel reasoning abilities beyond those of the corresponding base models. In this study we critically examine the current state of RLVR by systematically probing the reasoning capability boundaries of RLVR-trained LLMs across various model families, RL algorithms, and math, coding, and visual reasoning benchmarks, using pass@k at large k values as the evaluation metric. Surprisingly, we find that the current training setup does not elicit fundamentally new reasoning patterns. While RLVR-trained models outperform their base models at small k (e.g., k = 1), the base models achieve a higher pass@k score when k is large. Coverage and perplexity analyses show that the observed reasoning abilities originate from and are bounded by the base model. Treating the base model as an upper bound, our quantitative analysis shows that six popular RLVR algorithms perform similarly and remain far from optimal in leveraging the potential of the base model. By contrast, we find that distillation can introduce new reasoning patterns from the teacher and genuinely expand the model's reasoning capabilities. Overall, our findings suggest that current RLVR methods have not yet realized the potential of RL to elicit truly novel reasoning abilities in LLMs. This highlights the need for improved RL paradigms, such as continual scaling and multi-turn agent-environment interaction, to unlock this potential.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 1

citation-polarity summary

use dataset 1

claims ledger

abstract Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs), particularly on mathematics and programming tasks. Similar to how traditional RL helps agents explore and learn new strategies, RLVR is believed to enable LLMs to continuously self-improve, thus acquiring novel reasoning abilities beyond those of the corresponding base models. In this study we critically examine the current state of RLVR by systematically probing the reasoning capability boundaries of RLVR-trained LLMs across va

co-cited works

representative citing papers

GIANTS: Generative Insight Anticipation from Scientific Literature

cs.CL · 2026-04-10 · unverdicted · novelty 8.0

GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.

Assessing the Creativity of Large Language Models: Testing, Limits, and New Frontiers

cs.AI · 2026-05-13 · conditional · novelty 7.0

The Divergent Remote Association Test (DRAT) is the first creativity test that significantly predicts LLMs' scientific ideation ability, unlike prior tests such as DAT or RAT.

Learning Agentic Policy from Action Guidance

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.

StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning

cs.SE · 2026-05-12 · unverdicted · novelty 7.0

StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.

SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning

cs.AI · 2026-05-10 · unverdicted · novelty 7.0 · 2 refs

SeePhys Pro benchmark reveals multimodal models degrade on physics reasoning as information transfers from text to images, with blind training improvements often stemming from textual cues rather than visual evidence.

Where to Spend Rollouts: Hit-Utility Optimal Rollout Allocation for Group-Based RLVR

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

HORA adaptively allocates rollouts using hit utility to improve Pass@K over compute-matched GRPO on math reasoning benchmarks while preserving Pass@1.

Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on AIME 2025.

ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

cs.LG · 2026-05-01 · unverdicted · novelty 7.0 · 2 refs

ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.

Render-in-the-Loop: Vector Graphics Generation via Visual Self-Feedback

cs.CV · 2026-04-22 · unverdicted · novelty 7.0

Render-in-the-Loop reformulates SVG generation as a step-wise visual-context-aware process using self-feedback from rendered intermediate states, VSF training, and RaV inference to outperform baselines on MMSVGBench for Text-to-SVG and Image-to-SVG.

Fine-Tuning Small Reasoning Models for Quantum Field Theory

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.

Skill-Conditioned Visual Geolocation for Vision-Language Models

cs.CV · 2026-04-10 · unverdicted · novelty 7.0 · 2 refs

GeoSkill lets vision-language models improve geolocation accuracy and reasoning by maintaining an evolving Skill-Graph that grows through autonomous analysis of successful and failed rollouts on web-scale image data.

SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions

cs.AI · 2026-04-09 · unverdicted · novelty 7.0

SUPERNOVA adapts instruction-tuning data for RLVR and achieves up to 52.8% relative gains on general reasoning benchmarks like BBEH through targeted task selection and mixing.

Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent

cs.LG · 2026-05-04 · unverdicted · novelty 7.0

Reference-sampled weighted SFT with prompt-normalized Boltzmann weights induces the same policy as fixed-reference KL-regularized RLVR, with BOLT as the estimator and a finite one-shot error decomposition separating coverage, variance, and other terms.

Near-Future Policy Optimization

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

NPO uses a policy's own near-future checkpoint as auxiliary trajectories to maximize effective learning signal S = Q/V, improving performance from 57.88 to 63.15 on Qwen3-VL-8B-Instruct with GRPO while accelerating convergence.

GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

cs.AI · 2026-04-15 · unverdicted · novelty 7.0

GFT uses group advantage learning and dynamic coefficient rectification to fix reward sparsity and optimization instability in SFT for LLMs, yielding better policies than standard SFT.

When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduced to preserve diversity and improve quality.

DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

DARE co-evolves difficulty estimation and policy in RL for LLMs to improve training efficiency, final performance, and inference speed by using tailored strategies for different difficulty levels.

AIPO: : Learning to Reason from Active Interaction

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, then drops the agents at inference.

Gradient Starvation in Binary-Reward GRPO: Why Group-Mean Centering Fails and Why the Simplest Fix Works

cs.LG · 2026-05-08 · conditional · novelty 6.0

Group-mean centering in binary-reward GRPO produces gradient starvation; the fixed sign advantage A=2r-1 raises GSM8K accuracy from 28.4% to 73.8% at group size 4.

HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchmarks over DAPO.

Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training

cs.AI · 2026-05-08 · unverdicted · novelty 6.0

ICR creates a virtual shorter distribution from shortest correct on-policy responses to regularize RL post-training toward concise yet accurate reasoning, improving the accuracy-length Pareto frontier on math and knowledge benchmarks.

Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

cs.CL · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

RL for LLM reasoning acts as sparse policy selection at high-entropy tokens already present in the base model, enabling ReasonMaxxer—an efficient contrastive method that recovers most RL gains at three orders of magnitude lower cost.

Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding

cs.AI · 2026-05-04 · unverdicted · novelty 6.0

CoRD uses collaborative multi-teacher step-wise decoding with perplexity-guided beam search to generate higher-quality Long-CoT data that lets smaller models reach near-teacher performance with less supervision.

LLM Ghostbusters: Surgical Hallucination Suppression via Adaptive Unlearning

cs.CR · 2026-05-01 · unverdicted · novelty 6.0

Adaptive Unlearning suppresses package hallucinations in code-generating LLMs by 81% while preserving benchmark performance, using model-generated data and no human labels.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer