citation dossier

R-zero: Self-evolving reasoning llm from zero data

Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu · 2025 · arXiv 2508.05004

19Pith papers citing it

20reference links

cs.AItop field · 7 papers

UNVERDICTEDtop verdict bucket · 16 papers

This arXiv-backed work is queued for full Pith review when it crosses the high-inbound sweep. That review runs reader · skeptic · desk-editor · referee · rebuttal · circularity · lean confirmation · RS check · pith extraction.

read on arXiv PDF

why this work matters in Pith

Pith has found this work in 19 reviewed papers. Its strongest current cluster is cs.AI (7 papers). The largest review-status bucket among citing papers is UNVERDICTED (16 papers). For highly cited works, this page shows a dossier first and a bounded explorer second; it never tries to render every citing paper at once.

representative citing papers

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

cs.AI · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.

Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

cs.AI · 2026-04-22 · unverdicted · novelty 7.0

COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.

Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.

G-Zero: Self-Play for Open-Ended Generation from Zero Data

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

G-Zero uses the Hint-δ intrinsic reward to drive co-evolution between a Proposer and Generator via GRPO and DPO, providing a theoretical suboptimality guarantee for self-improvement from internal dynamics alone.

SEIF: Self-Evolving Reinforcement Learning for Instruction Following

cs.CL · 2026-05-08 · conditional · novelty 6.0

SEIF creates a self-reinforcing loop in which an LLM alternately generates increasingly difficult instructions and learns to follow them better using reinforcement learning signals from its own judgments.

SPARK: Self-Play with Asymmetric Reward from Knowledge Graphs

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

SPARK constructs unified knowledge graphs from multi-document scientific literature to ground self-play RL with asymmetric roles and verifiable rewards, outperforming flat-corpus baselines especially on longer-hop reasoning tasks.

Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora

cs.SE · 2026-04-27 · unverdicted · novelty 6.0

Structured knowledge extracted from corpora enables test-driven data engineering for LLMs by mapping training data to source code, model training to compilation, benchmarking to unit testing, and failures to targeted data repairs, demonstrated across 16 disciplines.

Scaling Self-Play with Self-Guidance

cs.LG · 2026-04-22 · unverdicted · novelty 6.0

SGS adds self-guidance to LLM self-play for Lean4 theorem proving, surpassing RL baselines and enabling a 7B model to outperform a 671B model after 200 rounds.

Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text

cs.CL · 2026-04-21 · unverdicted · novelty 6.0

POP bootstraps post-training signals for open-ended LLM tasks by synthesizing rubrics during self-play on pretraining corpus, yielding performance gains on Qwen-2.5-7B across healthcare QA, creative writing, and instruction following.

Evaluation-driven Scaling for Scientific Discovery

cs.LG · 2026-04-21 · unverdicted · novelty 6.0

SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster LASSO and new Erdos constructions.

Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.

Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

LLM agents trained with a task-success reward on self-generated knowledge can spontaneously explore and adapt to new environments without any rewards or instructions at inference, yielding 20% gains on web tasks and allowing a 14B model to beat Gemini-2.5-Flash.

HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.

$\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data

cs.LG · 2026-04-15 · unverdicted · novelty 6.0

π-Play uses self-generated question construction paths as privileged information in multi-agent self-distillation to convert sparse-reward self-play into a dense-feedback loop, surpassing supervised search agents and improving efficiency 2-3× over standard self-play.

ZeroCoder: Can LLMs Improve Code Generation Without Ground-Truth Supervision?

cs.SE · 2026-04-09 · unverdicted · novelty 6.0

ZeroCoder co-evolves coder and tester LLMs via self-generated code-test execution feedback to improve code generation up to 21.6% without ground-truth supervision.

Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution

cs.CL · 2026-04-03 · unverdicted · novelty 6.0

Vocabulary dropout prevents diversity collapse in LLM co-evolution by masking proposer logits, yielding average +4.4 point solver gains on mathematical reasoning benchmarks at 8B scale.

Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

cs.AI · 2026-05-12

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

cs.AI · 2026-05-10

citing papers explorer

Showing 19 of 19 citing papers.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations cs.CV · 2026-04-20 · unverdicted · none · ref 14
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key cs.AI · 2026-05-07 · unverdicted · none · ref 71 · 2 links
RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.
Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks cs.AI · 2026-04-22 · unverdicted · none · ref 9
COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.
Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning cs.AI · 2026-05-12 · unverdicted · none · ref 10
Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.
G-Zero: Self-Play for Open-Ended Generation from Zero Data cs.LG · 2026-05-11 · unverdicted · none · ref 10
G-Zero uses the Hint-δ intrinsic reward to drive co-evolution between a Proposer and Generator via GRPO and DPO, providing a theoretical suboptimality guarantee for self-improvement from internal dynamics alone.
SEIF: Self-Evolving Reinforcement Learning for Instruction Following cs.CL · 2026-05-08 · conditional · none · ref 12
SEIF creates a self-reinforcing loop in which an LLM alternately generates increasingly difficult instructions and learns to follow them better using reinforcement learning signals from its own judgments.
SPARK: Self-Play with Asymmetric Reward from Knowledge Graphs cs.AI · 2026-05-07 · unverdicted · none · ref 7
SPARK constructs unified knowledge graphs from multi-document scientific literature to ground self-play RL with asymmetric roles and verifiable rewards, outperforming flat-corpus baselines especially on longer-hop reasoning tasks.
Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora cs.SE · 2026-04-27 · unverdicted · none · ref 16
Structured knowledge extracted from corpora enables test-driven data engineering for LLMs by mapping training data to source code, model training to compilation, benchmarking to unit testing, and failures to targeted data repairs, demonstrated across 16 disciplines.
Scaling Self-Play with Self-Guidance cs.LG · 2026-04-22 · unverdicted · none · ref 11
SGS adds self-guidance to LLM self-play for Lean4 theorem proving, surpassing RL baselines and enabling a 7B model to outperform a 671B model after 200 rounds.
Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text cs.CL · 2026-04-21 · unverdicted · none · ref 14
POP bootstraps post-training signals for open-ended LLM tasks by synthesizing rubrics during self-play on pretraining corpus, yielding performance gains on Qwen-2.5-7B across healthcare QA, creative writing, and instruction following.
Evaluation-driven Scaling for Scientific Discovery cs.LG · 2026-04-21 · unverdicted · none · ref 49
SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster LASSO and new Erdos constructions.
Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data cs.LG · 2026-04-20 · unverdicted · none · ref 4
A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.
Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration cs.AI · 2026-04-20 · unverdicted · none · ref 31
LLM agents trained with a task-success reward on self-generated knowledge can spontaneously explore and adapt to new environments without any rewards or instructions at inference, yielding 20% gains on web tasks and allowing a 14B model to beat Gemini-2.5-Flash.
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment cs.LG · 2026-04-20 · unverdicted · none · ref 43
HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
$\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data cs.LG · 2026-04-15 · unverdicted · none · ref 11
π-Play uses self-generated question construction paths as privileged information in multi-agent self-distillation to convert sparse-reward self-play into a dense-feedback loop, surpassing supervised search agents and improving efficiency 2-3× over standard self-play.
ZeroCoder: Can LLMs Improve Code Generation Without Ground-Truth Supervision? cs.SE · 2026-04-09 · unverdicted · none · ref 12
ZeroCoder co-evolves coder and tester LLMs via self-generated code-test execution feedback to improve code generation up to 21.6% without ground-truth supervision.
Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution cs.CL · 2026-04-03 · unverdicted · none · ref 15
Vocabulary dropout prevents diversity collapse in LLM co-evolution by masking proposer logits, yielding average +4.4 point solver gains on mathematical reasoning benchmarks at 8B scale.
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion cs.AI · 2026-05-12 · unreviewed · ref 62
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning cs.AI · 2026-05-10 · unreviewed · ref 38

R-zero: Self-evolving reasoning llm from zero data

why this work matters in Pith

fields

years

verdicts

representative citing papers

citing papers explorer