arxiv: 2505.19590 · v4 · submitted 2025-05-26 · 💻 cs.LG · cs.CL

Recognition: 1 theorem link

Learning to Reason without External Rewards

Xuandong Zhao , Zhewei Kang , Aosong Feng , Sergey Levine , Dawn Song

Authors on Pith no claims yet

Pith reviewed 2026-05-15 21:13 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords large language modelsreasoningreinforcement learninginternal feedbackself-certaintyunsupervised traininggeneralization

0 comments

The pith

Large language models can improve at reasoning by using only their own internal confidence as the reward signal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that large language models can acquire reasoning capabilities through reinforcement learning driven solely by internal signals. It replaces the need for external verifiable rewards with the model's self-assessment of certainty in its outputs. This leads to performance on par with traditional methods on math problems and superior results when applied to code generation tasks. Such an approach matters because it removes the bottleneck of requiring expensive domain-specific supervision for training advanced AI.

Core claim

By using the model's self-certainty as the reward signal in place of external verification, Intuitor trains large language models on reasoning tasks in a fully unsupervised manner. This method matches the benchmark performance of Group Relative Policy Optimization that relies on gold solutions for mathematical reasoning while showing stronger generalization to out-of-domain tasks such as code generation.

What carries the argument

Intuitor, which substitutes the model's self-certainty score for external rewards inside the Group Relative Policy Optimization algorithm.

If this is right

Reasoning training becomes possible in domains where gold answers or test cases cannot be obtained.
Models show stronger transfer to new task types than those trained with external reward signals.
Autonomous improvement cycles can run without repeated collection of domain-specific verification data.
The same internal signal supports learning across multiple problem areas without additional labeling effort.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Repeated application of the same loop could allow models to keep improving their own outputs over multiple rounds without outside input.
The approach may extend to tasks outside language, such as planning or decision sequences, if an analogous internal certainty measure exists.
Safeguards against over may become necessary to prevent the model from reinforcing its own mistakes at scale.

Load-bearing premise

The model's self-certainty score accurately reflects the correctness of its reasoning rather than rewarding fluent but wrong outputs.

What would settle it

A controlled test showing that after Intuitor training the model produces more incorrect answers on a new benchmark while reporting higher self-certainty on those errors.

read the original abstract

Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence-termed self-certainty-as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving better generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. Code is available at https://github.com/sunblaze-ucb/Intuitor

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Intuitor swaps external rewards for self-certainty inside GRPO and matches supervised math performance while improving code generalization, but the signal's reliability is the open question.

read the letter

The main thing to know is that this paper replaces external verifiable rewards with the model's own self-certainty scores inside Group Relative Policy Optimization, producing an unsupervised method called Intuitor that reaches GRPO-level results on math benchmarks and does better on out-of-domain code generation. No gold solutions or test cases are needed at training time. That is the concrete advance: a working RLIF framework that treats internal confidence as the sole training signal for reasoning tasks. The experiments are the part that lands cleanly. They show the unsupervised run holds its own on standard math sets and pulls ahead on code, which suggests the internal signal can transfer across domains where external supervision is sparse. That outcome is worth noting because most prior RLVR work stays tied to domain-specific verifiers. The soft spots sit where the stress-test note points. Self-certainty can reinforce overconfident wrong paths if the base model already assigns high scores to flawed chains, and the paper does not yet show enough controls or diagnostics to rule that out. Details on how self-certainty is computed, whether accuracy tracks certainty during training, and any checks against reward hacking are thin in the available description. Statistical significance and fuller baseline comparisons would also help ground the generalization claim. This work is aimed at groups trying to scale reasoning without constant external labels or domain-specific verifiers. Readers focused on unsupervised RL for LLMs or autonomous training loops will find the setup and results directly relevant. It deserves a serious referee because the core substitution is novel, the empirical pattern is plausible, and the open questions around signal faithfulness are addressable with targeted experiments. I would send it out for review with the expectation that reviewers will press on the internal-signal validation.

Referee Report

3 major / 2 minor

Summary. The paper proposes Intuitor, a Reinforcement Learning from Internal Feedback (RLIF) method that substitutes a model's self-certainty score for external verifiable rewards inside the Group Relative Policy Optimization (GRPO) algorithm. It claims this enables fully unsupervised training of LLMs on reasoning tasks, with experiments showing Intuitor matching GRPO performance on mathematical benchmarks while generalizing better to out-of-domain code generation, all without gold solutions or test cases.

Significance. If the central claim holds, the result is significant for the field because it provides evidence that intrinsic model signals can drive policy improvement in RLVR-style training without domain-specific supervision. The open-source code release supports reproducibility and allows direct inspection of the self-certainty implementation.

major comments (3)

[§3.2] §3.2 (Self-Certainty Reward): The computation of the self-certainty score is described at a high level but lacks the precise formula or pseudocode needed to verify it is not simply a monotonic function of sequence length or token entropy; without this, it is impossible to assess whether the reported gains are driven by the intended signal or by an artifact of the scoring procedure.
[§5.1, Table 2] §5.1 and Table 2: The claim that Intuitor 'matches GRPO's performance' on math benchmarks is presented without error bars, number of random seeds, or statistical significance tests; the tables report point estimates only, so it is unclear whether the observed parity is robust or within the variance of the baseline.
[§5.3] §5.3 (Generalization Experiments): No analysis is provided of the correlation between self-certainty and actual correctness on held-out incorrect reasoning traces; if high-certainty incorrect chains receive positive relative advantage, the policy update could reinforce overconfident errors rather than accuracy, which is a load-bearing assumption for the unsupervised regime.

minor comments (2)

[Abstract] The abstract states 'without requiring gold solutions or test cases' but the method description should explicitly note whether any unlabeled data or prompt templates are still required for the unsupervised setting.
[Figure 3] Figure 3 caption and axis labels use inconsistent terminology ('self-certainty' vs. 'certainty score'); standardize notation across text and figures.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for the detailed and insightful comments on our manuscript. We believe the suggested revisions will strengthen the paper by providing more transparency on the method and additional empirical support for the claims. We address each major comment below.

read point-by-point responses

Referee: [§3.2] §3.2 (Self-Certainty Reward): The computation of the self-certainty score is described at a high level but lacks the precise formula or pseudocode needed to verify it is not simply a monotonic function of sequence length or token entropy; without this, it is impossible to assess whether the reported gains are driven by the intended signal or by an artifact of the scoring procedure.

Authors: We agree that the description in §3.2 is high-level. In the revised manuscript, we will include the exact formula for the self-certainty score and pseudocode to allow verification that it is not an artifact of sequence length or token entropy. This addition will clarify how the self-certainty signal is computed from the model's internal probabilities. revision: yes
Referee: [§5.1, Table 2] §5.1 and Table 2: The claim that Intuitor 'matches GRPO's performance' on math benchmarks is presented without error bars, number of random seeds, or statistical significance tests; the tables report point estimates only, so it is unclear whether the observed parity is robust or within the variance of the baseline.

Authors: We acknowledge this limitation in the current presentation. We will revise §5.1 and Table 2 to include error bars from multiple random seeds and report results of statistical significance tests to confirm the robustness of the performance matching between Intuitor and GRPO. revision: yes
Referee: [§5.3] §5.3 (Generalization Experiments): No analysis is provided of the correlation between self-certainty and actual correctness on held-out incorrect reasoning traces; if high-certainty incorrect chains receive positive relative advantage, the policy update could reinforce overconfident errors rather than accuracy, which is a load-bearing assumption for the unsupervised regime.

Authors: We recognize the importance of this analysis for validating the unsupervised approach. In the revision, we will add to §5.3 a correlation analysis between self-certainty scores and correctness on held-out incorrect reasoning traces to address concerns about reinforcing overconfident errors. revision: yes

Circularity Check

0 steps flagged

No significant circularity; reward substitution is an explicit ansatz evaluated on independent benchmarks

full rationale

The paper defines Intuitor by direct substitution of GRPO's verifiable reward with the model's internal self-certainty score, then reports empirical results on standard math and code benchmarks. No equations or derivations reduce the performance claims to quantities defined by the method itself; the self-certainty signal is introduced as a modeling choice rather than derived from the target accuracy metric. Validation relies on external test sets that are not part of the training loop or fitted parameters, satisfying the criteria for a self-contained, non-circular derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that internal model confidence serves as a valid learning signal; no free parameters or invented entities with independent evidence are specified in the abstract.

axioms (1)

domain assumption A model's self-certainty score can serve as a reliable proxy for reasoning quality without external verification.
This assumption underpins the replacement of external rewards in the RLIF method.

invented entities (1)

Intuitor no independent evidence
purpose: RLIF implementation that uses self-certainty as sole reward in GRPO
New method name and framework introduced to enable unsupervised reasoning training.

pith-pipeline@v0.9.0 · 5474 in / 1260 out tokens · 23847 ms · 2026-05-15T21:13:34.177291+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology
cs.AI 2026-03 conditional novelty 8.0

SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.
Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards
cs.CV 2026-03 unverdicted novelty 7.0

SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.
PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

PDCR improves vision-language reasoning by computing separate normalized confidence advantages for perception steps and reasoning steps after unsupervised decomposition.
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
cs.LG 2026-05 unverdicted novelty 6.0

Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

ResRL boosts LLM reasoning by modulating negative gradients with SVD-based projection residuals from negative samples, outperforming NSR by 9.4% Avg@16 on math benchmarks while preserving diversity across 12 tasks.
Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance
cs.CL 2026-04 unverdicted novelty 6.0

Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math a...
Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

DDRL reduces spurious reward noise in test-time RL for math by excluding ambiguous samples, using fixed advantages, and adding consensus-based updates, outperforming prior TTRL methods on math benchmarks.
TEMPO: Scaling Test-time Training for Large Reasoning Models
cs.LG 2026-04 unverdicted novelty 6.0

TEMPO scales test-time training for large reasoning models by interleaving policy refinement on unlabeled data with critic recalibration on labeled data via an EM formulation, yielding large gains on AIME tasks.
Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data
cs.LG 2026-04 unverdicted novelty 6.0

A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
cs.LG 2026-04 unverdicted novelty 6.0

HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
Characterizing Model-Native Skills
cs.AI 2026-04 conditional novelty 6.0

Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming...
SceneCritic: A Symbolic Evaluator for 3D Indoor Scene Synthesis
cs.CV 2026-04 unverdicted novelty 6.0

SceneCritic is a symbolic, ontology-grounded evaluator for floor-plan layouts that identifies specific semantic, orientation, and geometric violations and aligns better with human judgments than VLM-based scorers.
Think Before you Write: QA-Guided Reasoning for Character Descriptions in Books
cs.CL 2026-04 unverdicted novelty 6.0

QA-guided reasoning via a separate model producing structured traces improves faithfulness, informativeness, and grounding in character description generation from books over long-context LLM baselines.
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
cs.LG 2025-07 unverdicted novelty 6.0

RaR uses aggregated rubric feedback as rewards in on-policy RL, delivering up to 31% relative gains on HealthBench and 7% on GPQA-Diamond versus direct Likert LLM-as-judge baselines.
StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction
cs.CL 2026-05 unverdicted novelty 5.0

StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.
Your Model Diversity, Not Method, Determines Reasoning Strategy
cs.AI 2026-04 unverdicted novelty 5.0

The optimal reasoning strategy for LLMs depends on the model's diversity profile rather than the exploration method itself.
Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs
cs.CL 2026-04 unverdicted novelty 5.0

FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.