Recognition: 1 theorem link
Learning to Reason without External Rewards
Pith reviewed 2026-05-15 21:13 UTC · model grok-4.3
The pith
Large language models can improve at reasoning by using only their own internal confidence as the reward signal.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By using the model's self-certainty as the reward signal in place of external verification, Intuitor trains large language models on reasoning tasks in a fully unsupervised manner. This method matches the benchmark performance of Group Relative Policy Optimization that relies on gold solutions for mathematical reasoning while showing stronger generalization to out-of-domain tasks such as code generation.
What carries the argument
Intuitor, which substitutes the model's self-certainty score for external rewards inside the Group Relative Policy Optimization algorithm.
If this is right
- Reasoning training becomes possible in domains where gold answers or test cases cannot be obtained.
- Models show stronger transfer to new task types than those trained with external reward signals.
- Autonomous improvement cycles can run without repeated collection of domain-specific verification data.
- The same internal signal supports learning across multiple problem areas without additional labeling effort.
Where Pith is reading between the lines
- Repeated application of the same loop could allow models to keep improving their own outputs over multiple rounds without outside input.
- The approach may extend to tasks outside language, such as planning or decision sequences, if an analogous internal certainty measure exists.
- Safeguards against over may become necessary to prevent the model from reinforcing its own mistakes at scale.
Load-bearing premise
The model's self-certainty score accurately reflects the correctness of its reasoning rather than rewarding fluent but wrong outputs.
What would settle it
A controlled test showing that after Intuitor training the model produces more incorrect answers on a new benchmark while reporting higher self-certainty on those errors.
read the original abstract
Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence-termed self-certainty-as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving better generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. Code is available at https://github.com/sunblaze-ucb/Intuitor
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Intuitor, a Reinforcement Learning from Internal Feedback (RLIF) method that substitutes a model's self-certainty score for external verifiable rewards inside the Group Relative Policy Optimization (GRPO) algorithm. It claims this enables fully unsupervised training of LLMs on reasoning tasks, with experiments showing Intuitor matching GRPO performance on mathematical benchmarks while generalizing better to out-of-domain code generation, all without gold solutions or test cases.
Significance. If the central claim holds, the result is significant for the field because it provides evidence that intrinsic model signals can drive policy improvement in RLVR-style training without domain-specific supervision. The open-source code release supports reproducibility and allows direct inspection of the self-certainty implementation.
major comments (3)
- [§3.2] §3.2 (Self-Certainty Reward): The computation of the self-certainty score is described at a high level but lacks the precise formula or pseudocode needed to verify it is not simply a monotonic function of sequence length or token entropy; without this, it is impossible to assess whether the reported gains are driven by the intended signal or by an artifact of the scoring procedure.
- [§5.1, Table 2] §5.1 and Table 2: The claim that Intuitor 'matches GRPO's performance' on math benchmarks is presented without error bars, number of random seeds, or statistical significance tests; the tables report point estimates only, so it is unclear whether the observed parity is robust or within the variance of the baseline.
- [§5.3] §5.3 (Generalization Experiments): No analysis is provided of the correlation between self-certainty and actual correctness on held-out incorrect reasoning traces; if high-certainty incorrect chains receive positive relative advantage, the policy update could reinforce overconfident errors rather than accuracy, which is a load-bearing assumption for the unsupervised regime.
minor comments (2)
- [Abstract] The abstract states 'without requiring gold solutions or test cases' but the method description should explicitly note whether any unlabeled data or prompt templates are still required for the unsupervised setting.
- [Figure 3] Figure 3 caption and axis labels use inconsistent terminology ('self-certainty' vs. 'certainty score'); standardize notation across text and figures.
Simulated Author's Rebuttal
We are grateful to the referee for the detailed and insightful comments on our manuscript. We believe the suggested revisions will strengthen the paper by providing more transparency on the method and additional empirical support for the claims. We address each major comment below.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Self-Certainty Reward): The computation of the self-certainty score is described at a high level but lacks the precise formula or pseudocode needed to verify it is not simply a monotonic function of sequence length or token entropy; without this, it is impossible to assess whether the reported gains are driven by the intended signal or by an artifact of the scoring procedure.
Authors: We agree that the description in §3.2 is high-level. In the revised manuscript, we will include the exact formula for the self-certainty score and pseudocode to allow verification that it is not an artifact of sequence length or token entropy. This addition will clarify how the self-certainty signal is computed from the model's internal probabilities. revision: yes
-
Referee: [§5.1, Table 2] §5.1 and Table 2: The claim that Intuitor 'matches GRPO's performance' on math benchmarks is presented without error bars, number of random seeds, or statistical significance tests; the tables report point estimates only, so it is unclear whether the observed parity is robust or within the variance of the baseline.
Authors: We acknowledge this limitation in the current presentation. We will revise §5.1 and Table 2 to include error bars from multiple random seeds and report results of statistical significance tests to confirm the robustness of the performance matching between Intuitor and GRPO. revision: yes
-
Referee: [§5.3] §5.3 (Generalization Experiments): No analysis is provided of the correlation between self-certainty and actual correctness on held-out incorrect reasoning traces; if high-certainty incorrect chains receive positive relative advantage, the policy update could reinforce overconfident errors rather than accuracy, which is a load-bearing assumption for the unsupervised regime.
Authors: We recognize the importance of this analysis for validating the unsupervised approach. In the revision, we will add to §5.3 a correlation analysis between self-certainty scores and correctness on held-out incorrect reasoning traces to address concerns about reinforcing overconfident errors. revision: yes
Circularity Check
No significant circularity; reward substitution is an explicit ansatz evaluated on independent benchmarks
full rationale
The paper defines Intuitor by direct substitution of GRPO's verifiable reward with the model's internal self-certainty score, then reports empirical results on standard math and code benchmarks. No equations or derivations reduce the performance claims to quantities defined by the method itself; the self-certainty signal is introduced as a modeling choice rather than derived from the target accuracy metric. Validation relies on external test sets that are not part of the training loop or fitted parameters, satisfying the criteria for a self-contained, non-circular derivation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A model's self-certainty score can serve as a reliable proxy for reasoning quality without external verification.
invented entities (1)
-
Intuitor
no independent evidence
Forward citations
Cited by 19 Pith papers
-
SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology
SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.
-
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning
ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.
-
Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards
SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.
-
PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning
PDCR improves vision-language reasoning by computing separate normalized confidence advantages for perception steps and reasoning steps after unsupervised decomposition.
-
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
Anti-Self-Distillation reverses self-distillation signals via PMI to fix overconfidence on structural tokens, matching GRPO baseline accuracy 2-10x faster with up to 11.5 point gains across 4B-30B models.
-
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...
-
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning
ResRL boosts LLM reasoning by modulating negative gradients with SVD-based projection residuals from negative samples, outperforming NSR by 9.4% Avg@16 on math benchmarks while preserving diversity across 12 tasks.
-
Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance
Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math a...
-
Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning
DDRL reduces spurious reward noise in test-time RL for math by excluding ambiguous samples, using fixed advantages, and adding consensus-based updates, outperforming prior TTRL methods on math benchmarks.
-
TEMPO: Scaling Test-time Training for Large Reasoning Models
TEMPO scales test-time training for large reasoning models by interleaving policy refinement on unlabeled data with critic recalibration on labeled data via an EM formulation, yielding large gains on AIME tasks.
-
Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data
A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.
-
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
-
Characterizing Model-Native Skills
Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming...
-
SceneCritic: A Symbolic Evaluator for 3D Indoor Scene Synthesis
SceneCritic is a symbolic, ontology-grounded evaluator for floor-plan layouts that identifies specific semantic, orientation, and geometric violations and aligns better with human judgments than VLM-based scorers.
-
Think Before you Write: QA-Guided Reasoning for Character Descriptions in Books
QA-guided reasoning via a separate model producing structured traces improves faithfulness, informativeness, and grounding in character description generation from books over long-context LLM baselines.
-
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
RaR uses aggregated rubric feedback as rewards in on-policy RL, delivering up to 31% relative gains on HealthBench and 7% on GPQA-Diamond versus direct Likert LLM-as-judge baselines.
-
StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction
StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.
-
Your Model Diversity, Not Method, Determines Reasoning Strategy
The optimal reasoning strategy for LLMs depends on the model's diversity profile rather than the exploration method itself.
-
Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs
FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.