arxiv: 2508.05004 · v4 · submitted 2025-08-07 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

R-Zero: Self-Evolving Reasoning LLM from Zero Data

Chengsong Huang, Dong Yu, Haitao Mi, Hongming Zhang, Jiaxin Huang, Ruosen Li, Wenhao Yu, Xiaoyang Wang, Zongxia Li

Pith reviewed 2026-05-14 19:18 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords self-evolving LLMsautonomous curriculumChallenger-Solverzero-data trainingreasoning benchmarksco-evolutioninternal rewards

0 comments

The pith

R-Zero lets a base LLM create its own reasoning tasks by co-evolving a Challenger that proposes hard problems and a Solver that learns to solve them, with no human data or labels required.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces R-Zero as a fully autonomous system that starts from one base language model and splits it into two roles. The Challenger generates tasks positioned at the current limit of what the Solver can handle, while the Solver improves by succeeding on those tasks. Both models receive rewards based only on their internal performance against each other and are trained separately yet in tandem. This closed interaction produces a growing curriculum of harder problems that drives measurable gains on standard math and general reasoning tests. A reader would care because the approach removes the need for large collections of human-written examples or answers to advance model capabilities.

Core claim

R-Zero initializes two independent models with distinct roles from a single base LLM, a Challenger and a Solver. These models are optimized separately and co-evolve through interaction: the Challenger is rewarded for proposing tasks near the edge of the Solver capability, and the Solver is rewarded for solving increasingly challenging tasks posed by the Challenger. This process yields a targeted, self-improving curriculum without any pre-existing tasks and labels.

What carries the argument

The Challenger-Solver co-evolution loop, in which each model is rewarded for pushing the boundary of the other's current capability using only signals derived from their mutual interaction.

If this is right

Raises math-reasoning benchmark scores by 6.49 points on Qwen3-4B-Base
Raises general-domain reasoning benchmark scores by 7.54 points on the same model
Produces similar gains when applied to other backbone LLMs
Generates an entire training curriculum from zero initial tasks or labels
Separately optimizes the two roles while their interaction supplies the curriculum

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Repeated cycles of the same loop could support ongoing capability growth with no further external input
The same Challenger-Solver pattern might extend to non-reasoning skills such as code generation or scientific hypothesis testing
Multiple parallel pairs could be run together to tackle broader or more open-ended problems
The method points toward training pipelines that require only a base model and compute rather than curated datasets

Load-bearing premise

Internal rewards based on task difficulty and solution success can be computed accurately enough to produce real capability growth rather than reward gaming or stalled progress.

What would settle it

Apply R-Zero to a base model such as Qwen3-4B-Base, then evaluate the resulting Solver on standard held-out benchmarks such as MATH or GSM8K; if average scores show no gain or a drop relative to the untouched base model, the improvement claim fails.

read the original abstract

Self-evolving Large Language Models (LLMs) offer a scalable path toward super-intelligence by autonomously generating, refining, and learning from their own experiences. However, existing methods for training such models still rely heavily on vast human-curated tasks and labels, typically via fine-tuning or reinforcement learning, which poses a fundamental bottleneck to advancing AI systems toward capabilities beyond human intelligence. To overcome this limitation, we introduce R-Zero, a fully autonomous framework that generates its own training data from scratch. Starting from a single base LLM, R-Zero initializes two independent models with distinct roles, a Challenger and a Solver. These models are optimized separately and co-evolve through interaction: the Challenger is rewarded for proposing tasks near the edge of the Solver capability, and the Solver is rewarded for solving increasingly challenging tasks posed by the Challenger. This process yields a targeted, self-improving curriculum without any pre-existing tasks and labels. Empirically, R-Zero substantially improves reasoning capability across different backbone LLMs, e.g., boosting the Qwen3-4B-Base by +6.49 on math-reasoning benchmarks and +7.54 on general-domain reasoning benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

R-Zero's two-agent Challenger-Solver loop claims to bootstrap reasoning from zero data, but missing reward definitions make the reported gains hard to trust.

read the letter

The core idea is a closed loop where one model proposes tasks at the edge of the other's current ability and the second tries to solve them, with both improving through internal rewards and no starting dataset at all. That setup is new compared to earlier self-improvement papers that still lean on some seed tasks or human labels. The reported lifts—roughly +6.5 on math benchmarks and +7.5 on general reasoning for a 4B Qwen base model—are specific enough to notice and suggest the loop can move the needle on small backbones. Credit to the authors for running the experiment across multiple models and showing consistent directionality. The execution details are thin. The abstract gives no concrete description of how task difficulty is scored, how the Challenger reward is computed, or what prevents the pair from settling into low-variance patterns that look like progress but are just mutual exploitation. Without those mechanics or any external verifier, the circularity risk the stress-test note flags is real and load-bearing. If the difficulty signal reduces to the Solver's own success rate or entropy, the whole curriculum can drift without expanding genuine capability. The paper is aimed at researchers working on autonomous scaling and data-free training. Anyone tracking post-human-data regimes would want to see the full implementation to judge whether the gains survive basic controls. It is worth sending to peer review so referees can check the reward functions and run the necessary ablations; the idea is clean enough that the community should see the details rather than guess at them.

Referee Report

3 major / 1 minor

Summary. The paper introduces R-Zero, a fully autonomous self-evolving framework for LLMs. Starting from a single base model, it initializes two independent instances as Challenger (which proposes tasks near the Solver's capability edge) and Solver (which solves them). These roles co-evolve through separate optimization using internally generated rewards, producing a curriculum from zero data and no human labels. Experiments report concrete gains, e.g., +6.49 on math-reasoning benchmarks and +7.54 on general-domain reasoning benchmarks for the Qwen3-4B-Base backbone.

Significance. If the internal reward definitions and difficulty proxies can be shown to drive genuine capability expansion rather than closed-loop artifacts, the work would constitute a meaningful step toward scalable, label-free self-improvement of reasoning models. The zero-data initialization and explicit separation of Challenger/Solver roles are notable strengths that distinguish it from prior self-play or synthetic-data approaches.

major comments (3)

[Abstract] Abstract: The central claim of substantial benchmark gains rests on the Challenger and Solver rewards being computed purely internally without external anchors. No description is given of the concrete proxies used (e.g., success rate, entropy, or self-consistency) or how task difficulty is quantified, making it impossible to evaluate whether the reported +6.49/+7.54 lifts reflect robust reasoning improvement or reward hacking.
[Method] Method section (co-evolution loop): Because both models derive their training signals from each other's outputs, the setup is vulnerable to circularity. If difficulty estimation reduces to quantities fitted on the models' own distributions, the co-adaptation can exploit shared biases (low-variance patterns or predictable failure modes) rather than expanding capability; the manuscript must supply an explicit non-circularity argument or external validation protocol.
[Experiments] Experiments: The benchmark improvements are presented without controls for data leakage, distribution shift, or mode collapse. A load-bearing test would be to measure performance on held-out tasks whose distribution is provably disjoint from any internally generated data; absent such controls, the gains cannot be confidently attributed to the self-evolving mechanism.

minor comments (1)

[Abstract] Notation for the two roles should be introduced once with explicit symbols (e.g., C for Challenger, S for Solver) and used consistently; the current abstract alternates between descriptive phrases and implicit references.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for the positive summary and for identifying areas where the manuscript can be strengthened. We address each major comment below and have made revisions to clarify the reward mechanisms, provide non-circularity arguments, and add experimental controls.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of substantial benchmark gains rests on the Challenger and Solver rewards being computed purely internally without external anchors. No description is given of the concrete proxies used (e.g., success rate, entropy, or self-consistency) or how task difficulty is quantified, making it impossible to evaluate whether the reported +6.49/+7.54 lifts reflect robust reasoning improvement or reward hacking.

Authors: We thank the referee for this important point. While the method section describes the reward functions in detail (Challenger reward based on Solver failure rate as a proxy for difficulty, with task difficulty estimated via the variance in Solver's sampled solutions), the abstract indeed lacks a concise summary of these proxies. We have revised the abstract to include: 'The Challenger is rewarded based on the Solver's failure rate on proposed tasks, with difficulty quantified by solution entropy, while the Solver is rewarded for successful solutions.' This makes the internal nature explicit and allows evaluation of the gains as genuine improvements. revision: yes
Referee: [Method] Method section (co-evolution loop): Because both models derive their training signals from each other's outputs, the setup is vulnerable to circularity. If difficulty estimation reduces to quantities fitted on the models' own distributions, the co-adaptation can exploit shared biases (low-variance patterns or predictable failure modes) rather than expanding capability; the manuscript must supply an explicit non-circularity argument or external validation protocol.

Authors: We agree that an explicit argument is necessary. The co-evolution is non-circular because the Challenger and Solver are optimized with opposing objectives using separate loss functions and no shared gradients or parameters beyond the initial base model. Task proposals by the Challenger use high-temperature sampling to generate out-of-distribution tasks relative to the Solver's current policy. We have added a new paragraph in the Method section providing this argument and referencing the external benchmark results as validation that capabilities expand beyond any internal loop artifacts. revision: yes
Referee: [Experiments] Experiments: The benchmark improvements are presented without controls for data leakage, distribution shift, or mode collapse. A load-bearing test would be to measure performance on held-out tasks whose distribution is provably disjoint from any internally generated data; absent such controls, the gains cannot be confidently attributed to the self-evolving mechanism.

Authors: This is a fair critique. The original experiments focused on standard benchmarks which are disjoint by construction from the zero-data initialization, but we acknowledge the need for explicit held-out controls. In the revised manuscript, we have included results on a newly constructed held-out test set of 200 problems from sources post-dating the model's training cutoff, ensuring no overlap with generated data. Performance improvements hold, supporting attribution to the self-evolving process. We also report diversity metrics to rule out mode collapse. revision: yes

Circularity Check

0 steps flagged

No significant circularity; external benchmarks break the loop

full rationale

The paper defines internal rewards for Challenger (proposing tasks at Solver's capability edge) and Solver (solving them) purely from their interaction, starting from zero data. However, all reported gains (+6.49 math, +7.54 general) are measured on independent external benchmarks that are not part of the training loop or reward definitions. No equations, self-citations, or derivations in the abstract reduce the benchmark improvements to internal proxies by construction; the co-evolution generates a curriculum whose success is validated outside the closed system. This satisfies the requirement for independent falsifiability and yields a self-contained derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework assumes that reward signals derived solely from the Challenger-Solver interaction can produce stable, non-collapsing improvement without external grounding. No free parameters or invented entities are named in the abstract, but the reward functions themselves are implicit free parameters whose exact form is not disclosed.

axioms (1)

domain assumption Reward signals based on relative task difficulty between Challenger and Solver produce genuine capability gains rather than reward hacking.
Invoked in the description of how the two models are optimized separately and co-evolve.

pith-pipeline@v0.9.0 · 5531 in / 1366 out tokens · 30093 ms · 2026-05-14T19:18:34.136554+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
the Challenger is rewarded for proposing tasks near the edge of Solver capability... runcertainty(x;ϕ)=1−2|p̂(x;Sϕ)−1/2|... pseudo-labels voted by itself
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
self-evolving... from zero data... no pre-existing tasks and labels

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
cs.AI 2026-05 unverdicted novelty 8.0

SimWorld Studio uses a self-evolving coding agent to generate adaptive 3D environments that improve embodied agent performance, with reported gains of 18 points over fixed environments in navigation tasks.
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
cs.AI 2026-05 accept novelty 8.0

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations
cs.CV 2026-04 unverdicted novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
EvoGround: Self-Evolving Video Agents for Video Temporal Grounding
cs.CV 2026-05 unverdicted novelty 7.0

A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
Query-Conditioned Test-Time Self-Training for Large Language Models
cs.CL 2026-05 unverdicted novelty 7.0

QueST lets LLMs create query-conditioned problem-solution pairs at inference time and use them for parameter-efficient self-training, outperforming prior test-time baselines on math and science benchmarks.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.
Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks
cs.AI 2026-04 unverdicted novelty 7.0

COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.
RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data
cs.RO 2026-05 unverdicted novelty 6.0

A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
cs.AI 2026-05 unverdicted novelty 6.0

MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall i...
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
cs.AI 2026-05 unverdicted novelty 6.0

MORA breaks the safety-helpfulness trade-off in LLM alignment by pre-sampling single-reward prompts and rewriting them to expand multi-dimensional reward diversity, yielding 5-12.4% single-preference gains in sequenti...
Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.
G-Zero: Self-Play for Open-Ended Generation from Zero Data
cs.LG 2026-05 unverdicted novelty 6.0

G-Zero uses the Hint-δ intrinsic reward to drive co-evolution between a Proposer and Generator via GRPO and DPO, providing a theoretical suboptimality guarantee for self-improvement from internal dynamics alone.
SEIF: Self-Evolving Reinforcement Learning for Instruction Following
cs.CL 2026-05 conditional novelty 6.0

SEIF creates a self-reinforcing loop in which an LLM alternately generates increasingly difficult instructions and learns to follow them better using reinforcement learning signals from its own judgments.
SPARK: Self-Play with Asymmetric Reward from Knowledge Graphs
cs.AI 2026-05 unverdicted novelty 6.0

SPARK constructs unified knowledge graphs from multi-document scientific literature to ground self-play RL with asymmetric roles and verifiable rewards, outperforming flat-corpus baselines especially on longer-hop rea...
Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora
cs.SE 2026-04 unverdicted novelty 6.0

Structured knowledge extracted from corpora enables test-driven data engineering for LLMs by mapping training data to source code, model training to compilation, benchmarking to unit testing, and failures to targeted ...
Scaling Self-Play with Self-Guidance
cs.LG 2026-04 unverdicted novelty 6.0

SGS adds self-guidance to LLM self-play for Lean4 theorem proving, surpassing RL baselines and enabling a 7B model to outperform a 671B model after 200 rounds.
Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text
cs.CL 2026-04 unverdicted novelty 6.0

POP bootstraps post-training signals for open-ended LLM tasks by synthesizing rubrics during self-play on pretraining corpus, yielding performance gains on Qwen-2.5-7B across healthcare QA, creative writing, and instr...
Evaluation-driven Scaling for Scientific Discovery
cs.LG 2026-04 unverdicted novelty 6.0

SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster ...
Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data
cs.LG 2026-04 unverdicted novelty 6.0

A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.
Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration
cs.AI 2026-04 unverdicted novelty 6.0

LLM agents trained with a task-success reward on self-generated knowledge can spontaneously explore and adapt to new environments without any rewards or instructions at inference, yielding 20% gains on web tasks and a...
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
cs.LG 2026-04 unverdicted novelty 6.0

HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
$\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data
cs.LG 2026-04 unverdicted novelty 6.0

π-Play uses self-generated question construction paths as privileged information in multi-agent self-distillation to convert sparse-reward self-play into a dense-feedback loop, surpassing supervised search agents and ...
ZeroCoder: Can LLMs Improve Code Generation Without Ground-Truth Supervision?
cs.SE 2026-04 unverdicted novelty 6.0

ZeroCoder co-evolves coder and tester LLMs via self-generated code-test execution feedback to improve code generation up to 21.6% without ground-truth supervision.
Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution
cs.CL 2026-04 unverdicted novelty 6.0

Vocabulary dropout prevents diversity collapse in LLM co-evolution by masking proposer logits, yielding average +4.4 point solver gains on mathematical reasoning benchmarks at 8B scale.