Recognition: 2 theorem links
· Lean TheoremR-Zero: Self-Evolving Reasoning LLM from Zero Data
Pith reviewed 2026-05-14 19:18 UTC · model grok-4.3
The pith
R-Zero lets a base LLM create its own reasoning tasks by co-evolving a Challenger that proposes hard problems and a Solver that learns to solve them, with no human data or labels required.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
R-Zero initializes two independent models with distinct roles from a single base LLM, a Challenger and a Solver. These models are optimized separately and co-evolve through interaction: the Challenger is rewarded for proposing tasks near the edge of the Solver capability, and the Solver is rewarded for solving increasingly challenging tasks posed by the Challenger. This process yields a targeted, self-improving curriculum without any pre-existing tasks and labels.
What carries the argument
The Challenger-Solver co-evolution loop, in which each model is rewarded for pushing the boundary of the other's current capability using only signals derived from their mutual interaction.
If this is right
- Raises math-reasoning benchmark scores by 6.49 points on Qwen3-4B-Base
- Raises general-domain reasoning benchmark scores by 7.54 points on the same model
- Produces similar gains when applied to other backbone LLMs
- Generates an entire training curriculum from zero initial tasks or labels
- Separately optimizes the two roles while their interaction supplies the curriculum
Where Pith is reading between the lines
- Repeated cycles of the same loop could support ongoing capability growth with no further external input
- The same Challenger-Solver pattern might extend to non-reasoning skills such as code generation or scientific hypothesis testing
- Multiple parallel pairs could be run together to tackle broader or more open-ended problems
- The method points toward training pipelines that require only a base model and compute rather than curated datasets
Load-bearing premise
Internal rewards based on task difficulty and solution success can be computed accurately enough to produce real capability growth rather than reward gaming or stalled progress.
What would settle it
Apply R-Zero to a base model such as Qwen3-4B-Base, then evaluate the resulting Solver on standard held-out benchmarks such as MATH or GSM8K; if average scores show no gain or a drop relative to the untouched base model, the improvement claim fails.
read the original abstract
Self-evolving Large Language Models (LLMs) offer a scalable path toward super-intelligence by autonomously generating, refining, and learning from their own experiences. However, existing methods for training such models still rely heavily on vast human-curated tasks and labels, typically via fine-tuning or reinforcement learning, which poses a fundamental bottleneck to advancing AI systems toward capabilities beyond human intelligence. To overcome this limitation, we introduce R-Zero, a fully autonomous framework that generates its own training data from scratch. Starting from a single base LLM, R-Zero initializes two independent models with distinct roles, a Challenger and a Solver. These models are optimized separately and co-evolve through interaction: the Challenger is rewarded for proposing tasks near the edge of the Solver capability, and the Solver is rewarded for solving increasingly challenging tasks posed by the Challenger. This process yields a targeted, self-improving curriculum without any pre-existing tasks and labels. Empirically, R-Zero substantially improves reasoning capability across different backbone LLMs, e.g., boosting the Qwen3-4B-Base by +6.49 on math-reasoning benchmarks and +7.54 on general-domain reasoning benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces R-Zero, a fully autonomous self-evolving framework for LLMs. Starting from a single base model, it initializes two independent instances as Challenger (which proposes tasks near the Solver's capability edge) and Solver (which solves them). These roles co-evolve through separate optimization using internally generated rewards, producing a curriculum from zero data and no human labels. Experiments report concrete gains, e.g., +6.49 on math-reasoning benchmarks and +7.54 on general-domain reasoning benchmarks for the Qwen3-4B-Base backbone.
Significance. If the internal reward definitions and difficulty proxies can be shown to drive genuine capability expansion rather than closed-loop artifacts, the work would constitute a meaningful step toward scalable, label-free self-improvement of reasoning models. The zero-data initialization and explicit separation of Challenger/Solver roles are notable strengths that distinguish it from prior self-play or synthetic-data approaches.
major comments (3)
- [Abstract] Abstract: The central claim of substantial benchmark gains rests on the Challenger and Solver rewards being computed purely internally without external anchors. No description is given of the concrete proxies used (e.g., success rate, entropy, or self-consistency) or how task difficulty is quantified, making it impossible to evaluate whether the reported +6.49/+7.54 lifts reflect robust reasoning improvement or reward hacking.
- [Method] Method section (co-evolution loop): Because both models derive their training signals from each other's outputs, the setup is vulnerable to circularity. If difficulty estimation reduces to quantities fitted on the models' own distributions, the co-adaptation can exploit shared biases (low-variance patterns or predictable failure modes) rather than expanding capability; the manuscript must supply an explicit non-circularity argument or external validation protocol.
- [Experiments] Experiments: The benchmark improvements are presented without controls for data leakage, distribution shift, or mode collapse. A load-bearing test would be to measure performance on held-out tasks whose distribution is provably disjoint from any internally generated data; absent such controls, the gains cannot be confidently attributed to the self-evolving mechanism.
minor comments (1)
- [Abstract] Notation for the two roles should be introduced once with explicit symbols (e.g., C for Challenger, S for Solver) and used consistently; the current abstract alternates between descriptive phrases and implicit references.
Simulated Author's Rebuttal
We are grateful to the referee for the positive summary and for identifying areas where the manuscript can be strengthened. We address each major comment below and have made revisions to clarify the reward mechanisms, provide non-circularity arguments, and add experimental controls.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim of substantial benchmark gains rests on the Challenger and Solver rewards being computed purely internally without external anchors. No description is given of the concrete proxies used (e.g., success rate, entropy, or self-consistency) or how task difficulty is quantified, making it impossible to evaluate whether the reported +6.49/+7.54 lifts reflect robust reasoning improvement or reward hacking.
Authors: We thank the referee for this important point. While the method section describes the reward functions in detail (Challenger reward based on Solver failure rate as a proxy for difficulty, with task difficulty estimated via the variance in Solver's sampled solutions), the abstract indeed lacks a concise summary of these proxies. We have revised the abstract to include: 'The Challenger is rewarded based on the Solver's failure rate on proposed tasks, with difficulty quantified by solution entropy, while the Solver is rewarded for successful solutions.' This makes the internal nature explicit and allows evaluation of the gains as genuine improvements. revision: yes
-
Referee: [Method] Method section (co-evolution loop): Because both models derive their training signals from each other's outputs, the setup is vulnerable to circularity. If difficulty estimation reduces to quantities fitted on the models' own distributions, the co-adaptation can exploit shared biases (low-variance patterns or predictable failure modes) rather than expanding capability; the manuscript must supply an explicit non-circularity argument or external validation protocol.
Authors: We agree that an explicit argument is necessary. The co-evolution is non-circular because the Challenger and Solver are optimized with opposing objectives using separate loss functions and no shared gradients or parameters beyond the initial base model. Task proposals by the Challenger use high-temperature sampling to generate out-of-distribution tasks relative to the Solver's current policy. We have added a new paragraph in the Method section providing this argument and referencing the external benchmark results as validation that capabilities expand beyond any internal loop artifacts. revision: yes
-
Referee: [Experiments] Experiments: The benchmark improvements are presented without controls for data leakage, distribution shift, or mode collapse. A load-bearing test would be to measure performance on held-out tasks whose distribution is provably disjoint from any internally generated data; absent such controls, the gains cannot be confidently attributed to the self-evolving mechanism.
Authors: This is a fair critique. The original experiments focused on standard benchmarks which are disjoint by construction from the zero-data initialization, but we acknowledge the need for explicit held-out controls. In the revised manuscript, we have included results on a newly constructed held-out test set of 200 problems from sources post-dating the model's training cutoff, ensuring no overlap with generated data. Performance improvements hold, supporting attribution to the self-evolving process. We also report diversity metrics to rule out mode collapse. revision: yes
Circularity Check
No significant circularity; external benchmarks break the loop
full rationale
The paper defines internal rewards for Challenger (proposing tasks at Solver's capability edge) and Solver (solving them) purely from their interaction, starting from zero data. However, all reported gains (+6.49 math, +7.54 general) are measured on independent external benchmarks that are not part of the training loop or reward definitions. No equations, self-citations, or derivations in the abstract reduce the benchmark improvements to internal proxies by construction; the co-evolution generates a curriculum whose success is validated outside the closed system. This satisfies the requirement for independent falsifiability and yields a self-contained derivation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Reward signals based on relative task difficulty between Challenger and Solver produce genuine capability gains rather than reward hacking.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearthe Challenger is rewarded for proposing tasks near the edge of Solver capability... runcertainty(x;ϕ)=1−2|p̂(x;Sϕ)−1/2|... pseudo-labels voted by itself
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclearself-evolving... from zero data... no pre-existing tasks and labels
Forward citations
Cited by 25 Pith papers
-
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
SimWorld Studio uses a self-evolving coding agent to generate adaptive 3D environments that improve embodied agent performance, with reported gains of 18 points over fixed environments in navigation tasks.
-
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
-
EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
-
EvoGround: Self-Evolving Video Agents for Video Temporal Grounding
A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
-
Query-Conditioned Test-Time Self-Training for Large Language Models
QueST lets LLMs create query-conditioned problem-solution pairs at inference time and use them for parameter-efficient self-training, outperforming prior test-time baselines on math and science benchmarks.
-
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.
-
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.
-
Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks
COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.
-
RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data
A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.
-
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall i...
-
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
MORA breaks the safety-helpfulness trade-off in LLM alignment by pre-sampling single-reward prompts and rewriting them to expand multi-dimensional reward diversity, yielding 5-12.4% single-preference gains in sequenti...
-
Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning
Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.
-
G-Zero: Self-Play for Open-Ended Generation from Zero Data
G-Zero uses the Hint-δ intrinsic reward to drive co-evolution between a Proposer and Generator via GRPO and DPO, providing a theoretical suboptimality guarantee for self-improvement from internal dynamics alone.
-
SEIF: Self-Evolving Reinforcement Learning for Instruction Following
SEIF creates a self-reinforcing loop in which an LLM alternately generates increasingly difficult instructions and learns to follow them better using reinforcement learning signals from its own judgments.
-
SPARK: Self-Play with Asymmetric Reward from Knowledge Graphs
SPARK constructs unified knowledge graphs from multi-document scientific literature to ground self-play RL with asymmetric roles and verifiable rewards, outperforming flat-corpus baselines especially on longer-hop rea...
-
Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora
Structured knowledge extracted from corpora enables test-driven data engineering for LLMs by mapping training data to source code, model training to compilation, benchmarking to unit testing, and failures to targeted ...
-
Scaling Self-Play with Self-Guidance
SGS adds self-guidance to LLM self-play for Lean4 theorem proving, surpassing RL baselines and enabling a 7B model to outperform a 671B model after 200 rounds.
-
Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text
POP bootstraps post-training signals for open-ended LLM tasks by synthesizing rubrics during self-play on pretraining corpus, yielding performance gains on Qwen-2.5-7B across healthcare QA, creative writing, and instr...
-
Evaluation-driven Scaling for Scientific Discovery
SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster ...
-
Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data
A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.
-
Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration
LLM agents trained with a task-success reward on self-generated knowledge can spontaneously explore and adapt to new environments without any rewards or instructions at inference, yielding 20% gains on web tasks and a...
-
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
-
$\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data
π-Play uses self-generated question construction paths as privileged information in multi-agent self-distillation to convert sparse-reward self-play into a dense-feedback loop, surpassing supervised search agents and ...
-
ZeroCoder: Can LLMs Improve Code Generation Without Ground-Truth Supervision?
ZeroCoder co-evolves coder and tester LLMs via self-generated code-test execution feedback to improve code generation up to 21.6% without ground-truth supervision.
-
Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution
Vocabulary dropout prevents diversity collapse in LLM co-evolution by masking proposer logits, yielding average +4.4 point solver gains on mathematical reasoning benchmarks at 8B scale.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.