Learning to Build the Environment: Self-Evolving Reasoning RL via Verifiable Environment Synthesis

Dian Yu; Haitao Mi; Kishan Panaganti; Wenhao Yu; Yucheng Shi; Zhenwen Liang

arxiv: 2605.14392 · v1 · pith:2FL56AAQnew · submitted 2026-05-14 · 💻 cs.AI

Learning to Build the Environment: Self-Evolving Reasoning RL via Verifiable Environment Synthesis

Yucheng Shi , Zhenwen Liang , Kishan Panaganti , Dian Yu , Wenhao Yu , Haitao Mi This is my paper

Pith reviewed 2026-06-30 20:55 UTC · model grok-4.3

classification 💻 cs.AI

keywords self-improving language modelsreasoning RLenvironment synthesissolve-verify asymmetryverifiable environmentszero-data RLEvoEnvPython environments

0 comments

The pith

Self-improving reasoning RL succeeds when models synthesize their own environments that maintain stable solve-verify asymmetry.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that zero-data reasoning reinforcement learning improves when models shift from generating problems or traces to constructing reusable executable environments that sample instances, compute references, and score responses. These environments must preserve a durable gap where the model can write a verifier it cannot reliably execute on fresh instances in natural language. Fixed public data or hand-crafted environments lose their signal and can reduce performance as the model strengthens, while the proposed method avoids this by enforcing the asymmetry through algorithmic hardness or intrinsic verification ease. EvoEnv realizes the loop with a single policy that generates Python environments from seeds and filters them via staged validation, self-review, difficulty calibration, and novelty checks. On Qwen3-4B-Thinking the approach raises average performance from 72.4 to 74.8 while the static baselines lower it.

Core claim

The central claim is that stable self-improvement arises from an environment-construction loop in which each artifact is a reusable executable object that samples instances, computes references, and scores responses, provided the environments exhibit stable solve-verify asymmetry. This asymmetry takes two forms: tasks algorithmically hard to reason through but trivial as code, or tasks intrinsically hard to solve but easy to verify. Both keep the reward informative because the policy cannot close the gap by gaming the verifier. EvoEnv implements the loop by synthesizing Python environments from ten seeds and admitting them only after staged validation, semantic self-review, solver-relative d

What carries the argument

Stable solve-verify asymmetry: the durable gap in which the model can write an oracle once that it cannot reliably execute in natural language on fresh instances, keeping the reward signal informative as the solver improves.

If this is right

On already strong models, fixed public-data RLVR and fixed hand-crafted environment RLVR reduce average performance while environment synthesis raises it.
Self-improvement requires environments whose difficulty stays structurally beyond the model's reach rather than more synthetic data.
A single policy can serve as both generator and solver when environments are admitted only after validation and calibration.
Two complementary environment types sustain the asymmetry: algorithmically hard but trivial as code, and intrinsically hard to solve but easy to verify.
The reward signal stays useful only while the generator continues to produce environments that the current solver cannot reliably handle in natural language.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same asymmetry principle could be tested in non-reasoning domains such as code generation if analogous oracle-solver gaps can be engineered.
Running the loop for more iterations would show whether the generator itself improves at creating harder environments or whether asymmetry eventually saturates.
Applying the method to larger base models might produce larger absolute gains if the generator scales in its ability to maintain the gap.
The approach suggests that future self-improvement systems should prioritize verifiable environment construction over pure data synthesis.

Load-bearing premise

The environments the generator produces will keep showing stable solve-verify asymmetry on new instances even after the solver policy strengthens, so the reward remains informative instead of being gamed or saturated.

What would settle it

After multiple training rounds with EvoEnv, measure whether the solver policy now solves the generated environments reliably in natural language without needing the code oracle; if it does, the asymmetry has collapsed and further improvement should cease.

read the original abstract

We pursue a vision for self-improving language models in which the model does not merely generate problems or traces to imitate, but constructs the environments that train it. In zero-data reasoning RL, this reframes self-improvement from a data-generation loop into an environment-construction loop, where each artifact is a reusable executable object that samples instances, computes references, and scores responses. Whether this vision sustains improvement hinges on a single property: the environments must exhibit stable solve--verify asymmetry, the model must be able to write an oracle once that it cannot reliably execute in natural language on fresh instances. This asymmetry takes two complementary forms. Some tasks are algorithmically hard to reason through but trivial as code: a dynamic program or graph traversal, compiled once, yields unboundedly many calibrated instances. Others are intrinsically hard to solve but easy to verify, like planted subset-sum or constraint satisfaction. Both create a durable gap between proposing and solving that the policy cannot close by gaming the verifier, and it is this gap that keeps reward informative as the learner improves. We instantiate this view in EvoEnv, a single-policy generator, solver method that synthesizes Python environments from ten seeds and admits them only after staged validation, semantic self-review, solver-relative difficulty calibration, and novelty checks. The strongest evidence comes from the already-strong regime: on Qwen3-4B-Thinking, fixed public-data RLVR and fixed hand-crafted environment RLVR reduce the average, while EvoEnv improves it from 72.4 to 74.8, a relative gain of 3.3%. Stable self-improvement, we suggest, depends not on producing more synthetic data, but on models learning to construct worlds whose difficulty stays structurally beyond their own reach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EvoEnv shows a modest gain on Qwen3-4B-Thinking by synthesizing environments with staged validation for solve-verify asymmetry, but the abstract leaves the durability of that asymmetry after training untested.

read the letter

The paper's core claim is that self-improvement in reasoning RL works better when the model learns to construct verifiable environments rather than just generating data or traces. On Qwen3-4B-Thinking, EvoEnv raises average performance from 72.4 to 74.8 while fixed public-data RLVR and fixed hand-crafted environment RLVR both lower it. That contrast is the clearest empirical point.

What is new is the single-policy generator-solver setup combined with explicit admission rules: semantic self-review, solver-relative difficulty calibration, and novelty checks. The framing around stable algorithmic asymmetry (hard to solve in language, easy to verify in code) is laid out directly as the condition needed to keep rewards informative.

The result on an already strong model is useful because it suggests the method sidesteps saturation that hits fixed setups. The baselines provide a reasonable control.

The soft spot is the missing check on whether asymmetry persists. Environments are admitted only after calibration against the current solver, so the reported gain could partly reflect environments the model already handles well. The abstract gives no post-training measurement on fresh instances to confirm the gap stays open or that the verifier does not saturate. Without that, the self-evolution loop is not shown to be durable.

This is for researchers focused on RL post-training for reasoning models. The idea and the reported delta are enough to send it to peer review so the full methods, run counts, and any later asymmetry tests can be examined.

Referee Report

2 major / 1 minor

Summary. The paper proposes EvoEnv, a single-policy generator-solver framework in which an LLM synthesizes reusable Python environments from ten seeds; environments are admitted only after staged validation (semantic self-review, solver-relative difficulty calibration, novelty checks). The central claim is that these environments maintain stable solve-verify asymmetry (algorithmically hard to solve but trivial to verify, or intrinsically hard to solve but easy to verify), enabling durable self-improving reasoning RL. On Qwen3-4B-Thinking this yields an average performance increase from 72.4 to 74.8 (3.3% relative), while fixed public-data RLVR and fixed hand-crafted environment RLVR both decrease performance.

Significance. If the durability of the solve-verify gap is demonstrated, the work offers a principled alternative to data-imitation loops for self-improvement, shifting focus to construction of executable environments whose difficulty remains structurally beyond the solver's reach. The distinction between algorithmic and intrinsic asymmetry forms is a clear conceptual contribution.

major comments (2)

[Abstract] Abstract: the reported gain (72.4 o 74.8) is presented without any mention of number of runs, standard deviation, statistical significance, or controls for selection bias introduced by the solver-relative difficulty calibration step; this makes it impossible to judge whether the 3.3% improvement is robust or partly an artifact of the admission filter.
[Abstract] Abstract (and implied results): no post-training measurement is described that checks whether the final solver still exhibits a stable solve-verify gap on newly sampled environments; without this, the claim that the asymmetry 'stays structurally beyond their own reach' remains unverified and the self-evolution loop could saturate after the first iteration.

minor comments (1)

[Abstract] The abstract refers to 'ten seeds' and 'staged validation' but does not list the concrete benchmarks or task families used to compute the reported average; adding this would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need for greater statistical transparency in the abstract and for a direct verification of persistent solve-verify asymmetry. Both comments identify genuine gaps in the current presentation. We address each below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the reported gain (72.4 o 74.8) is presented without any mention of number of runs, standard deviation, statistical significance, or controls for selection bias introduced by the solver-relative difficulty calibration step; this makes it impossible to judge whether the 3.3% improvement is robust or partly an artifact of the admission filter.

Authors: We agree that the abstract lacks the requested statistical details and that this omission limits assessment of robustness. The solver-relative calibration is an integral part of the admission process, and while the fixed-baseline comparisons provide some control, explicit reporting is needed. In the revised manuscript we will update the abstract to state the number of independent runs, report standard deviation, note statistical significance where applicable, and briefly describe how selection bias is mitigated by the overall experimental design. revision: yes
Referee: [Abstract] Abstract (and implied results): no post-training measurement is described that checks whether the final solver still exhibits a stable solve-verify gap on newly sampled environments; without this, the claim that the asymmetry 'stays structurally beyond their own reach' remains unverified and the self-evolution loop could saturate after the first iteration.

Authors: The referee correctly notes that the manuscript does not include an explicit post-training evaluation of the solve-verify gap on newly generated environments. The observed performance lift provides supporting evidence, yet a direct measurement would more rigorously substantiate durability. We will add this analysis in the revision by evaluating the trained solver on a fresh set of environments sampled after the final iteration and reporting the resulting solve-verify gap. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central claim is an empirical result: EvoEnv training yields a 72.4 to 74.8 gain on Qwen3-4B-Thinking while fixed baselines degrade performance. The derivation chain consists of an environment-generation procedure (ten seeds, staged validation, semantic self-review, solver-relative difficulty calibration, novelty checks) followed by RL training and external evaluation. No equations, fitted parameters renamed as predictions, or self-citations appear in the supplied text. The calibration step selects training environments relative to the current solver, but the reported metric is performance on (presumably held-out) benchmarks, not a quantity forced by the admission filter itself. The stability of solve-verify asymmetry is presented as a necessary assumption rather than a result derived from the inputs by construction. This is therefore a standard empirical claim whose validity can be checked against external benchmarks without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the existence of a durable solve-verify asymmetry that survives policy improvement and on the staged validation pipeline successfully filtering for that property. No explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.1-grok · 5864 in / 1405 out tokens · 20429 ms · 2026-06-30T20:55:45.880831+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 24 canonical work pages · 12 internal anchors

[1]

MathArena: Evaluating LLMs on Uncontaminated Math Competitions

Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions. 2026. URLhttps://arxiv.org/abs/2505.23281

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Safe and scalable web agent learning via recreated websites,

Hyungjoo Chae, Jungsoo Park, and Alan Ritter. Safe and scalable web agent learning via recreated websites,
[3]

URLhttps://arxiv.org/abs/2603.10505

work page arXiv
[4]

Spc: Evolving self-play critic via adversarial games for llm reasoning, 2025

Jiaqi Chen, Bang Zhang, Ruotian Ma, Peisong Wang, Xiaodan Liang, Zhaopeng Tu, Xiaolong Li, and Kwan-Yee K Wong. Spc: Evolving self-play critic via adversarial games for llm reasoning, 2025

2025
[5]

Self-questioning language models, 2025

Lili Chen, Mihir Prabhudesai, Katerina Fragkiadaki, Hao Liu, and Deepak Pathak. Self-questioning language models, 2025

2025
[6]

Multi-agent evolve: Llm self-improve through co-evolution, 2025

Yixing Chen, Yiding Wang, Siqi Zhu, Haofei Yu, Tao Feng, Muhan Zhang, Mostofa Patwary, and Jiaxuan You. Multi-agent evolve: Llm self-improve through co-evolution, 2025

2025
[7]

Scaling agent learning via experience synthesis, 2025

Zhaorun Chen, Zhuokai Zhao, Kai Zhang, Bo Liu, Qi Qi, Yifan Wu, Tarun Kalluri, Sara Cao, Yuanhao Xiong, Haibo Tong, Huaxiu Yao, Hengduo Li, Jiacheng Zhu, Xian Li, Dawn Song, Bo Li, Jason Weston, and Dat Huynh. Scaling agent learning via experience synthesis, 2025. URLhttps://arxiv.org/abs/2511.03773

work page arXiv 2025
[8]

Self-play fine-tuning converts weak language models to strong language models, 2024

Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models, 2024

2024
[9]

Webevolver: Enhancing web agent self-improvement with coevolving world model, 2025

Tianqing Fang, Hongming Zhang, Zhisong Zhang, Kaixin Ma, Wenhao Yu, Haitao Mi, and Dong Yu. Webevolver: Enhancing web agent self-improvement with coevolving world model, 2025. URLhttps://arxiv.org/abs/2504. 21024

2025
[10]

Serl: Self-play reinforcement learning for large language models with limited data, 2025

Wenkai Fang, Shunyu Liu, Yang Zhou, Kongcheng Zhang, Tongya Zheng, Kaixuan Chen, Mingli Song, and Dacheng Tao. Serl: Self-play reinforcement learning for large language models with limited data, 2025

2025
[11]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

2025
[12]

How far can unsupervised rlvr scale llm training?, 2026

Bingxiang He, Yuxin Zuo, Zeyuan Liu, Shangziqi Zhao, Zixuan Fu, Junlin Yang, Cheng Qian, Kaiyan Zhang, Yuchen Fan, Ganqu Cui, et al. How far can unsupervised rlvr scale llm training?, 2026

2026
[13]

V-star: Training verifiers for self-taught reasoners, 2024

Arian Hosseini, Xingdi Yuan, Nikolay Malkin, Aaron Courville, Alessandro Sordoni, and Rishabh Agarwal. V-star: Training verifiers for self-taught reasoners, 2024

2024
[14]

R-Zero: Self-Evolving Reasoning LLM from Zero Data

Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-Zero: Self-evolving reasoning LLM from zero data, 2025. URLhttps://arxiv.org/abs/ 2508.05004

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. 2024. URLhttps://arxiv.org/abs/2403.07974

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Language self-play for data-free training, 2025

Jakub Grudzien Kuba, Mengting Gu, Qi Ma, Yuandong Tian, Vijai Mohan, and Jason Chen. Language self-play for data-free training, 2025. URLhttps://arxiv.org/abs/2509.07414

work page arXiv 2025
[17]

Opensir: Open-ended self-improving reasoner, 2025

Wai-Chung Kwan, Joshua Ong Jun Leang, Pavlos Vougiouklis, Jeff Z Pan, Marco Valentino, and Pasquale Minervini. Opensir: Open-ended self-improving reasoner, 2025

2025
[18]

Embomatrix: A scalable training-ground for embodied decision-making,

Zixing Lei, Sheng Yin, Yichen Xiong, Yuanzhuo Ding, Wenhao Huang, Yuxi Wei, Qingyao Xu, Yiming Li, Weixin Li, Yunhong Wang, and Siheng Chen. Embomatrix: A scalable training-ground for embodied decision-making,
[19]

URLhttps://arxiv.org/abs/2510.12072

work page arXiv
[20]

Spiral: Self-play on zero-sum games incentivizes reasoning via multi-agent multi-turn reinforcement learning, 2025

Bo Liu, Leon Guertler, Simon Yu, Zichen Liu, Penghui Qi, Daniel Balcells, Mickel Liu, Cheston Tan, Weiyan Shi, Min Lin, et al. Spiral: Self-play on zero-sum games incentivizes reasoning via multi-agent multi-turn reinforcement learning, 2025

2025
[21]

Spice: Self-play in corpus environments improves reasoning, 2025

Bo Liu, Chuanyang Jin, Seungone Kim, Weizhe Yuan, Wenting Zhao, Ilia Kulikov, Xian Li, Sainbayar Sukhbaatar, Jack Lanchantin, and Jason Weston. Spice: Self-play in corpus environments improves reasoning, 2025. 11

2025
[22]

Chasing moving targets with online self-play reinforcement learning for safer language models, 2025

Mickel Liu, Liwei Jiang, Yancheng Liang, Simon Shaolei Du, Yejin Choi, Tim Althoff, and Natasha Jaques. Chasing moving targets with online self-play reinforcement learning for safer language models, 2025

2025
[23]

Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models, 2025

Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models, 2025

2025
[24]

Search self-play: Pushing the frontier of agent capability without supervision, 2025

Hongliang Lu, Yuhang Wen, Pengyu Cheng, Ruijin Ding, Jiaqi Guo, Haotian Xu, Chutian Wang, Haonan Chen, Xiaoxi Jiang, and Guanjun Jiang. Search self-play: Pushing the frontier of agent capability without supervision, 2025

2025
[25]

Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset

Ivan Moshkov, Darragh Hanley, Ivan Sorokin, Shubham Toshniwal, Christof Henkel, Benedikt Schifferer, Wei Du, and Igor Gitman. Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset. 2025. URLhttps://arxiv.org/abs/2504.16891

work page arXiv 2025
[26]

Self-consistency preference optimization, 2024

Archiki Prasad, Weizhe Yuan, Richard Yuanzhe Pang, Jing Xu, Maryam Fazel-Zarandi, Mohit Bansal, Sainbayar Sukhbaatar, Jason Weston, and Jane Yu. Self-consistency preference optimization, 2024

2024
[27]

Scaling synthetic task generation for agents via exploration, 2025

Ram Ramrakhya, Andrew Szot, Omar Attia, Yuhao Yang, Anh Nguyen, Bogdan Mazoure, Zhe Gan, Harsh Agrawal, and Alexander Toshev. Scaling synthetic task generation for agents via exploration, 2025. URL https://arxiv.org/abs/2509.25047

work page arXiv 2025
[28]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. 2019. URL https://arxiv.org/abs/1908.10084

work page internal anchor Pith review Pith/arXiv arXiv 2019
[29]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark. 2023. URLhttps: //arxiv.org/abs/2311.12022

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

ByteDance Seed, Jiaze Chen, Tiantian Fan, Xin Liu, Lingjun Liu, Zhiqi Lin, Mingxuan Wang, Chengyi Wang, Xiangpeng Wei, Wenyuan Xu, et al. Seed1. 5-thinking: Advancing superb reasoning models with reinforcement learning.arXiv preprint arXiv:2504.13914, 2025

work page arXiv 2025
[31]

Spurious rewards: Rethinking training signals in rlvr, 2025

Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh, Simon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna, et al. Spurious rewards: Rethinking training signals in rlvr, 2025

2025
[32]

Beyond human data: Scaling self-training for problem-solving with language models, 2023

Avi Singh, John D Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J Liu, James Harrison, Jaehoon Lee, Kelvin Xu, et al. Beyond human data: Scaling self-training for problem-solving with language models, 2023

2023
[33]

Envscaler: Scaling tool-interactive environments for llm agent via programmatic synthesis, 2026

Xiaoshuai Song, Haofei Chang, Guanting Dong, Yutao Zhu, Ji-Rong Wen, and Zhicheng Dou. Envscaler: Scaling tool-interactive environments for llm agent via programmatic synthesis, 2026. URLhttps://arxiv.org/abs/2601. 05808

2026
[34]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

Boxin Wang, Chankyu Lee, Nayeon Lee, Sheng-Chieh Lin, Wenliang Dai, Yang Chen, Yangyi Chen, Zhuolin Yang, Zihan Liu, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nemotron-cascade: Scaling cascaded reinforcement learning for general-purpose reasoning models. 2026. URLhttps://arxiv.org/abs/2512.13607

work page arXiv 2026
[36]

Rui Wang, Joel Lehman, Jeff Clune, and Kenneth O. Stanley. Paired open-ended trailblazer (poet): Endlessly generating increasingly complex and diverse learning environments and their solutions, 2019. URL https: //arxiv.org/abs/1901.01753

work page internal anchor Pith review Pith/arXiv arXiv 2019
[37]

Socratic-zero: Bootstrapping reasoning via data-free agent co-evolution, 2025

Shaobo Wang, Zhengbo Jiao, Zifan Zhang, Yilang Peng, Xu Ze, Boyu Yang, Wei Wang, Hu Wei, and Linfeng Zhang. Socratic-zero: Bootstrapping reasoning via data-free agent co-evolution, 2025

2025
[38]

Llms as scalable, general-purpose simulators for evolving digital agent training,

Yiming Wang, Da Yin, Yuedong Cui, Ruichen Zheng, Zhiqian Li, Zongyu Lin, Di Wu, Xueqing Wu, Chenchen Ye, Yu Zhou, and Kai-Wei Chang. Llms as scalable, general-purpose simulators for evolving digital agent training,
[39]

URLhttps://arxiv.org/abs/2510.14969

work page arXiv
[40]

Toward Training Superintelligent Software Agents through Self-Play SWE-RL

Yuxiang Wei, Zhiqing Sun, Emily McMilin, Jonas Gehring, David Zhang, Gabriel Synnaeve, Daniel Fried, Lingming Zhang, and Sida Wang. Toward training superintelligent software agents through self-play swe-rl, 2025. URL https://arxiv.org/abs/2512.18552. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Mirage or method? how model-task alignment induces divergent rl conclusions, 2025

Haoze Wu, Cheng Wang, Wenshuo Zhao, and Junxian He. Mirage or method? how model-task alignment induces divergent rl conclusions, 2025. URLhttps://arxiv.org/abs/2508.21188

work page arXiv 2025
[42]

Autowebworld: Synthesizing infinite verifiable web environments via finite state machines, 2026

Yifan Wu, Yiran Peng, Yiyu Chen, Jianhao Ruan, Zijie Zhuang, Cheng Yang, Jiayi Zhang, Man Chen, Yenchi Tseng, Zhaoyang Yu, Liang Chen, Yuyao Zhai, Bang Liu, Chenglin Wu, and Yuyu Luo. Autowebworld: Synthesizing infinite verifiable web environments via finite state machines, 2026. URLhttps://arxiv.org/abs/2602.14296

work page arXiv 2026
[43]

Agent0: Unleashing self-evolving agents from zero data via tool-integrated reasoning, 2025

Peng Xia, Kaide Zeng, Jiaqi Liu, Can Qin, Fang Wu, Yiyang Zhou, Caiming Xiong, and Huaxiu Yao. Agent0: Unleashing self-evolving agents from zero data via tool-integrated reasoning, 2025

2025
[44]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report, 2025

2025
[45]

Dapo: An open-source llm reinforcement learning system at scale, 2025

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale, 2025

2025
[46]

Self-rewarding language models, 2024

Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models, 2024

2024
[47]

Star: Bootstrapping reasoning with reasoning, 2022

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning, 2022

2022
[48]

GLM-5: from Vibe Coding to Agentic Engineering

Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[49]

RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments

Zhiyuan Zeng, Hamish Ivison, Yiping Wang, Lifan Yuan, Shuyue Stella Li, Zhuorui Ye, Siting Li, Jacqueline He, Runlong Zhou, Tong Chen, Chenyang Zhao, Yulia Tsvetkov, Simon Shaolei Du, Natasha Jaques, Hao Peng, Pang Wei Koh, and Hannaneh Hajishirzi. Rlve: Scaling up reinforcement learning for language models with adaptive verifiable environments, 2025. URL...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Darwin gödel machine: Open-ended evolution of self-improving agents, 2026

Jenny Zhang, Shengran Hu, Cong Lu, Robert Tjarko Lange, and Jeff Clune. Darwin gödel machine: Open-ended evolution of self-improving agents, 2026. URLhttps://openreview.net/forum?id=pUpzQZTvGY

2026
[51]

Right question is already half the answer: Fully unsupervised llm reasoning incentivization, 2025

Qingyang Zhang, Haitao Wu, Changqing Zhang, Peilin Zhao, and Yatao Bian. Right question is already half the answer: Fully unsupervised llm reasoning incentivization, 2025

2025
[52]

Better llm reasoning via dual-play, 2025

Zhengxin Zhang, Chengyu Huang, Aochong Oliver Li, and Claire Cardie. Better llm reasoning via dual-play, 2025

2025
[53]

Infiniteweb: Scalable web environment synthesis for gui agent training, 2026

Ziyun Zhang, Zezhou Wang, Xiaoyi Zhang, Zongyu Guo, Jiahao Li, Bin Li, and Yan Lu. Infiniteweb: Scalable web environment synthesis for gui agent training, 2026. URLhttps://arxiv.org/abs/2601.04126

work page internal anchor Pith review arXiv 2026
[54]

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data, 2025. URL https://arxiv.org/abs/2505.03335

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Learning to reason without external rewards, 2025

Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, and Dawn Song. Learning to reason without external rewards, 2025

2025
[56]

Self-challenging language model agents, 2026

Yifei Zhou, Sergey Levine, Jason E Weston, Xian Li, and Sainbayar Sukhbaatar. Self-challenging language model agents, 2026. URLhttps://openreview.net/forum?id=9yusqX9DpR

2026
[57]

Evolving language models without labels: Majority drives selection, novelty promotes variation, 2025

Yujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao Yu, Kishan Panaganti, Linfeng Song, Dian Yu, Xiangliang Zhang, Haitao Mi, and Dong Yu. Evolving language models without labels: Majority drives selection, novelty promotes variation, 2025

2025
[58]

Training versatile coding agents in synthetic environments, 2025

Yiqi Zhu, Apurva Gandhi, and Graham Neubig. Training versatile coding agents in synthetic environments, 2025. URLhttps://arxiv.org/abs/2512.12216

work page arXiv 2025
[59]

Given the multiset {S}, find a nonempty

Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, et al. Ttrl: Test-time reinforcement learning, 2025. 13 A Detailed positioning against nearby self-improvement methods Table 3 expands the sketch in Section 2 into a family-by-family comparison. Two axes distinguishEvoEnv from each adjacent...

2025

[1] [1]

MathArena: Evaluating LLMs on Uncontaminated Math Competitions

Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions. 2026. URLhttps://arxiv.org/abs/2505.23281

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

Safe and scalable web agent learning via recreated websites,

Hyungjoo Chae, Jungsoo Park, and Alan Ritter. Safe and scalable web agent learning via recreated websites,

[3] [3]

URLhttps://arxiv.org/abs/2603.10505

work page arXiv

[4] [4]

Spc: Evolving self-play critic via adversarial games for llm reasoning, 2025

Jiaqi Chen, Bang Zhang, Ruotian Ma, Peisong Wang, Xiaodan Liang, Zhaopeng Tu, Xiaolong Li, and Kwan-Yee K Wong. Spc: Evolving self-play critic via adversarial games for llm reasoning, 2025

2025

[5] [5]

Self-questioning language models, 2025

Lili Chen, Mihir Prabhudesai, Katerina Fragkiadaki, Hao Liu, and Deepak Pathak. Self-questioning language models, 2025

2025

[6] [6]

Multi-agent evolve: Llm self-improve through co-evolution, 2025

Yixing Chen, Yiding Wang, Siqi Zhu, Haofei Yu, Tao Feng, Muhan Zhang, Mostofa Patwary, and Jiaxuan You. Multi-agent evolve: Llm self-improve through co-evolution, 2025

2025

[7] [7]

Scaling agent learning via experience synthesis, 2025

Zhaorun Chen, Zhuokai Zhao, Kai Zhang, Bo Liu, Qi Qi, Yifan Wu, Tarun Kalluri, Sara Cao, Yuanhao Xiong, Haibo Tong, Huaxiu Yao, Hengduo Li, Jiacheng Zhu, Xian Li, Dawn Song, Bo Li, Jason Weston, and Dat Huynh. Scaling agent learning via experience synthesis, 2025. URLhttps://arxiv.org/abs/2511.03773

work page arXiv 2025

[8] [8]

Self-play fine-tuning converts weak language models to strong language models, 2024

Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models, 2024

2024

[9] [9]

Webevolver: Enhancing web agent self-improvement with coevolving world model, 2025

Tianqing Fang, Hongming Zhang, Zhisong Zhang, Kaixin Ma, Wenhao Yu, Haitao Mi, and Dong Yu. Webevolver: Enhancing web agent self-improvement with coevolving world model, 2025. URLhttps://arxiv.org/abs/2504. 21024

2025

[10] [10]

Serl: Self-play reinforcement learning for large language models with limited data, 2025

Wenkai Fang, Shunyu Liu, Yang Zhou, Kongcheng Zhang, Tongya Zheng, Kaixuan Chen, Mingli Song, and Dacheng Tao. Serl: Self-play reinforcement learning for large language models with limited data, 2025

2025

[11] [11]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

2025

[12] [12]

How far can unsupervised rlvr scale llm training?, 2026

Bingxiang He, Yuxin Zuo, Zeyuan Liu, Shangziqi Zhao, Zixuan Fu, Junlin Yang, Cheng Qian, Kaiyan Zhang, Yuchen Fan, Ganqu Cui, et al. How far can unsupervised rlvr scale llm training?, 2026

2026

[13] [13]

V-star: Training verifiers for self-taught reasoners, 2024

Arian Hosseini, Xingdi Yuan, Nikolay Malkin, Aaron Courville, Alessandro Sordoni, and Rishabh Agarwal. V-star: Training verifiers for self-taught reasoners, 2024

2024

[14] [14]

R-Zero: Self-Evolving Reasoning LLM from Zero Data

Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-Zero: Self-evolving reasoning LLM from zero data, 2025. URLhttps://arxiv.org/abs/ 2508.05004

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. 2024. URLhttps://arxiv.org/abs/2403.07974

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Language self-play for data-free training, 2025

Jakub Grudzien Kuba, Mengting Gu, Qi Ma, Yuandong Tian, Vijai Mohan, and Jason Chen. Language self-play for data-free training, 2025. URLhttps://arxiv.org/abs/2509.07414

work page arXiv 2025

[17] [17]

Opensir: Open-ended self-improving reasoner, 2025

Wai-Chung Kwan, Joshua Ong Jun Leang, Pavlos Vougiouklis, Jeff Z Pan, Marco Valentino, and Pasquale Minervini. Opensir: Open-ended self-improving reasoner, 2025

2025

[18] [18]

Embomatrix: A scalable training-ground for embodied decision-making,

Zixing Lei, Sheng Yin, Yichen Xiong, Yuanzhuo Ding, Wenhao Huang, Yuxi Wei, Qingyao Xu, Yiming Li, Weixin Li, Yunhong Wang, and Siheng Chen. Embomatrix: A scalable training-ground for embodied decision-making,

[19] [19]

URLhttps://arxiv.org/abs/2510.12072

work page arXiv

[20] [20]

Spiral: Self-play on zero-sum games incentivizes reasoning via multi-agent multi-turn reinforcement learning, 2025

Bo Liu, Leon Guertler, Simon Yu, Zichen Liu, Penghui Qi, Daniel Balcells, Mickel Liu, Cheston Tan, Weiyan Shi, Min Lin, et al. Spiral: Self-play on zero-sum games incentivizes reasoning via multi-agent multi-turn reinforcement learning, 2025

2025

[21] [21]

Spice: Self-play in corpus environments improves reasoning, 2025

Bo Liu, Chuanyang Jin, Seungone Kim, Weizhe Yuan, Wenting Zhao, Ilia Kulikov, Xian Li, Sainbayar Sukhbaatar, Jack Lanchantin, and Jason Weston. Spice: Self-play in corpus environments improves reasoning, 2025. 11

2025

[22] [22]

Chasing moving targets with online self-play reinforcement learning for safer language models, 2025

Mickel Liu, Liwei Jiang, Yancheng Liang, Simon Shaolei Du, Yejin Choi, Tim Althoff, and Natasha Jaques. Chasing moving targets with online self-play reinforcement learning for safer language models, 2025

2025

[23] [23]

Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models, 2025

Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models, 2025

2025

[24] [24]

Search self-play: Pushing the frontier of agent capability without supervision, 2025

Hongliang Lu, Yuhang Wen, Pengyu Cheng, Ruijin Ding, Jiaqi Guo, Haotian Xu, Chutian Wang, Haonan Chen, Xiaoxi Jiang, and Guanjun Jiang. Search self-play: Pushing the frontier of agent capability without supervision, 2025

2025

[25] [25]

Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset

Ivan Moshkov, Darragh Hanley, Ivan Sorokin, Shubham Toshniwal, Christof Henkel, Benedikt Schifferer, Wei Du, and Igor Gitman. Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset. 2025. URLhttps://arxiv.org/abs/2504.16891

work page arXiv 2025

[26] [26]

Self-consistency preference optimization, 2024

Archiki Prasad, Weizhe Yuan, Richard Yuanzhe Pang, Jing Xu, Maryam Fazel-Zarandi, Mohit Bansal, Sainbayar Sukhbaatar, Jason Weston, and Jane Yu. Self-consistency preference optimization, 2024

2024

[27] [27]

Scaling synthetic task generation for agents via exploration, 2025

Ram Ramrakhya, Andrew Szot, Omar Attia, Yuhao Yang, Anh Nguyen, Bogdan Mazoure, Zhe Gan, Harsh Agrawal, and Alexander Toshev. Scaling synthetic task generation for agents via exploration, 2025. URL https://arxiv.org/abs/2509.25047

work page arXiv 2025

[28] [28]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. 2019. URL https://arxiv.org/abs/1908.10084

work page internal anchor Pith review Pith/arXiv arXiv 2019

[29] [29]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark. 2023. URLhttps: //arxiv.org/abs/2311.12022

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

ByteDance Seed, Jiaze Chen, Tiantian Fan, Xin Liu, Lingjun Liu, Zhiqi Lin, Mingxuan Wang, Chengyi Wang, Xiangpeng Wei, Wenyuan Xu, et al. Seed1. 5-thinking: Advancing superb reasoning models with reinforcement learning.arXiv preprint arXiv:2504.13914, 2025

work page arXiv 2025

[31] [31]

Spurious rewards: Rethinking training signals in rlvr, 2025

Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh, Simon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna, et al. Spurious rewards: Rethinking training signals in rlvr, 2025

2025

[32] [32]

Beyond human data: Scaling self-training for problem-solving with language models, 2023

Avi Singh, John D Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J Liu, James Harrison, Jaehoon Lee, Kelvin Xu, et al. Beyond human data: Scaling self-training for problem-solving with language models, 2023

2023

[33] [33]

Envscaler: Scaling tool-interactive environments for llm agent via programmatic synthesis, 2026

Xiaoshuai Song, Haofei Chang, Guanting Dong, Yutao Zhu, Ji-Rong Wen, and Zhicheng Dou. Envscaler: Scaling tool-interactive environments for llm agent via programmatic synthesis, 2026. URLhttps://arxiv.org/abs/2601. 05808

2026

[34] [34]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[35] [35]

Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

Boxin Wang, Chankyu Lee, Nayeon Lee, Sheng-Chieh Lin, Wenliang Dai, Yang Chen, Yangyi Chen, Zhuolin Yang, Zihan Liu, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nemotron-cascade: Scaling cascaded reinforcement learning for general-purpose reasoning models. 2026. URLhttps://arxiv.org/abs/2512.13607

work page arXiv 2026

[36] [36]

Rui Wang, Joel Lehman, Jeff Clune, and Kenneth O. Stanley. Paired open-ended trailblazer (poet): Endlessly generating increasingly complex and diverse learning environments and their solutions, 2019. URL https: //arxiv.org/abs/1901.01753

work page internal anchor Pith review Pith/arXiv arXiv 2019

[37] [37]

Socratic-zero: Bootstrapping reasoning via data-free agent co-evolution, 2025

Shaobo Wang, Zhengbo Jiao, Zifan Zhang, Yilang Peng, Xu Ze, Boyu Yang, Wei Wang, Hu Wei, and Linfeng Zhang. Socratic-zero: Bootstrapping reasoning via data-free agent co-evolution, 2025

2025

[38] [38]

Llms as scalable, general-purpose simulators for evolving digital agent training,

Yiming Wang, Da Yin, Yuedong Cui, Ruichen Zheng, Zhiqian Li, Zongyu Lin, Di Wu, Xueqing Wu, Chenchen Ye, Yu Zhou, and Kai-Wei Chang. Llms as scalable, general-purpose simulators for evolving digital agent training,

[39] [39]

URLhttps://arxiv.org/abs/2510.14969

work page arXiv

[40] [40]

Toward Training Superintelligent Software Agents through Self-Play SWE-RL

Yuxiang Wei, Zhiqing Sun, Emily McMilin, Jonas Gehring, David Zhang, Gabriel Synnaeve, Daniel Fried, Lingming Zhang, and Sida Wang. Toward training superintelligent software agents through self-play swe-rl, 2025. URL https://arxiv.org/abs/2512.18552. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Mirage or method? how model-task alignment induces divergent rl conclusions, 2025

Haoze Wu, Cheng Wang, Wenshuo Zhao, and Junxian He. Mirage or method? how model-task alignment induces divergent rl conclusions, 2025. URLhttps://arxiv.org/abs/2508.21188

work page arXiv 2025

[42] [42]

Autowebworld: Synthesizing infinite verifiable web environments via finite state machines, 2026

Yifan Wu, Yiran Peng, Yiyu Chen, Jianhao Ruan, Zijie Zhuang, Cheng Yang, Jiayi Zhang, Man Chen, Yenchi Tseng, Zhaoyang Yu, Liang Chen, Yuyao Zhai, Bang Liu, Chenglin Wu, and Yuyu Luo. Autowebworld: Synthesizing infinite verifiable web environments via finite state machines, 2026. URLhttps://arxiv.org/abs/2602.14296

work page arXiv 2026

[43] [43]

Agent0: Unleashing self-evolving agents from zero data via tool-integrated reasoning, 2025

Peng Xia, Kaide Zeng, Jiaqi Liu, Can Qin, Fang Wu, Yiyang Zhou, Caiming Xiong, and Huaxiu Yao. Agent0: Unleashing self-evolving agents from zero data via tool-integrated reasoning, 2025

2025

[44] [44]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report, 2025

2025

[45] [45]

Dapo: An open-source llm reinforcement learning system at scale, 2025

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale, 2025

2025

[46] [46]

Self-rewarding language models, 2024

Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models, 2024

2024

[47] [47]

Star: Bootstrapping reasoning with reasoning, 2022

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning, 2022

2022

[48] [48]

GLM-5: from Vibe Coding to Agentic Engineering

Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[49] [49]

RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments

Zhiyuan Zeng, Hamish Ivison, Yiping Wang, Lifan Yuan, Shuyue Stella Li, Zhuorui Ye, Siting Li, Jacqueline He, Runlong Zhou, Tong Chen, Chenyang Zhao, Yulia Tsvetkov, Simon Shaolei Du, Natasha Jaques, Hao Peng, Pang Wei Koh, and Hannaneh Hajishirzi. Rlve: Scaling up reinforcement learning for language models with adaptive verifiable environments, 2025. URL...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

Darwin gödel machine: Open-ended evolution of self-improving agents, 2026

Jenny Zhang, Shengran Hu, Cong Lu, Robert Tjarko Lange, and Jeff Clune. Darwin gödel machine: Open-ended evolution of self-improving agents, 2026. URLhttps://openreview.net/forum?id=pUpzQZTvGY

2026

[51] [51]

Right question is already half the answer: Fully unsupervised llm reasoning incentivization, 2025

Qingyang Zhang, Haitao Wu, Changqing Zhang, Peilin Zhao, and Yatao Bian. Right question is already half the answer: Fully unsupervised llm reasoning incentivization, 2025

2025

[52] [52]

Better llm reasoning via dual-play, 2025

Zhengxin Zhang, Chengyu Huang, Aochong Oliver Li, and Claire Cardie. Better llm reasoning via dual-play, 2025

2025

[53] [53]

Infiniteweb: Scalable web environment synthesis for gui agent training, 2026

Ziyun Zhang, Zezhou Wang, Xiaoyi Zhang, Zongyu Guo, Jiahao Li, Bin Li, and Yan Lu. Infiniteweb: Scalable web environment synthesis for gui agent training, 2026. URLhttps://arxiv.org/abs/2601.04126

work page internal anchor Pith review arXiv 2026

[54] [54]

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data, 2025. URL https://arxiv.org/abs/2505.03335

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [55]

Learning to reason without external rewards, 2025

Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, and Dawn Song. Learning to reason without external rewards, 2025

2025

[56] [56]

Self-challenging language model agents, 2026

Yifei Zhou, Sergey Levine, Jason E Weston, Xian Li, and Sainbayar Sukhbaatar. Self-challenging language model agents, 2026. URLhttps://openreview.net/forum?id=9yusqX9DpR

2026

[57] [57]

Evolving language models without labels: Majority drives selection, novelty promotes variation, 2025

Yujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao Yu, Kishan Panaganti, Linfeng Song, Dian Yu, Xiangliang Zhang, Haitao Mi, and Dong Yu. Evolving language models without labels: Majority drives selection, novelty promotes variation, 2025

2025

[58] [58]

Training versatile coding agents in synthetic environments, 2025

Yiqi Zhu, Apurva Gandhi, and Graham Neubig. Training versatile coding agents in synthetic environments, 2025. URLhttps://arxiv.org/abs/2512.12216

work page arXiv 2025

[59] [59]

Given the multiset {S}, find a nonempty

Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, et al. Ttrl: Test-time reinforcement learning, 2025. 13 A Detailed positioning against nearby self-improvement methods Table 3 expands the sketch in Section 2 into a family-by-family comparison. Two axes distinguishEvoEnv from each adjacent...

2025