pith. sign in

arxiv: 2605.14392 · v1 · pith:2FL56AAQnew · submitted 2026-05-14 · 💻 cs.AI

Learning to Build the Environment: Self-Evolving Reasoning RL via Verifiable Environment Synthesis

Pith reviewed 2026-06-30 20:55 UTC · model grok-4.3

classification 💻 cs.AI
keywords self-improving language modelsreasoning RLenvironment synthesissolve-verify asymmetryverifiable environmentszero-data RLEvoEnvPython environments
0
0 comments X

The pith

Self-improving reasoning RL succeeds when models synthesize their own environments that maintain stable solve-verify asymmetry.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that zero-data reasoning reinforcement learning improves when models shift from generating problems or traces to constructing reusable executable environments that sample instances, compute references, and score responses. These environments must preserve a durable gap where the model can write a verifier it cannot reliably execute on fresh instances in natural language. Fixed public data or hand-crafted environments lose their signal and can reduce performance as the model strengthens, while the proposed method avoids this by enforcing the asymmetry through algorithmic hardness or intrinsic verification ease. EvoEnv realizes the loop with a single policy that generates Python environments from seeds and filters them via staged validation, self-review, difficulty calibration, and novelty checks. On Qwen3-4B-Thinking the approach raises average performance from 72.4 to 74.8 while the static baselines lower it.

Core claim

The central claim is that stable self-improvement arises from an environment-construction loop in which each artifact is a reusable executable object that samples instances, computes references, and scores responses, provided the environments exhibit stable solve-verify asymmetry. This asymmetry takes two forms: tasks algorithmically hard to reason through but trivial as code, or tasks intrinsically hard to solve but easy to verify. Both keep the reward informative because the policy cannot close the gap by gaming the verifier. EvoEnv implements the loop by synthesizing Python environments from ten seeds and admitting them only after staged validation, semantic self-review, solver-relative d

What carries the argument

Stable solve-verify asymmetry: the durable gap in which the model can write an oracle once that it cannot reliably execute in natural language on fresh instances, keeping the reward signal informative as the solver improves.

If this is right

  • On already strong models, fixed public-data RLVR and fixed hand-crafted environment RLVR reduce average performance while environment synthesis raises it.
  • Self-improvement requires environments whose difficulty stays structurally beyond the model's reach rather than more synthetic data.
  • A single policy can serve as both generator and solver when environments are admitted only after validation and calibration.
  • Two complementary environment types sustain the asymmetry: algorithmically hard but trivial as code, and intrinsically hard to solve but easy to verify.
  • The reward signal stays useful only while the generator continues to produce environments that the current solver cannot reliably handle in natural language.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same asymmetry principle could be tested in non-reasoning domains such as code generation if analogous oracle-solver gaps can be engineered.
  • Running the loop for more iterations would show whether the generator itself improves at creating harder environments or whether asymmetry eventually saturates.
  • Applying the method to larger base models might produce larger absolute gains if the generator scales in its ability to maintain the gap.
  • The approach suggests that future self-improvement systems should prioritize verifiable environment construction over pure data synthesis.

Load-bearing premise

The environments the generator produces will keep showing stable solve-verify asymmetry on new instances even after the solver policy strengthens, so the reward remains informative instead of being gamed or saturated.

What would settle it

After multiple training rounds with EvoEnv, measure whether the solver policy now solves the generated environments reliably in natural language without needing the code oracle; if it does, the asymmetry has collapsed and further improvement should cease.

read the original abstract

We pursue a vision for self-improving language models in which the model does not merely generate problems or traces to imitate, but constructs the environments that train it. In zero-data reasoning RL, this reframes self-improvement from a data-generation loop into an environment-construction loop, where each artifact is a reusable executable object that samples instances, computes references, and scores responses. Whether this vision sustains improvement hinges on a single property: the environments must exhibit stable solve--verify asymmetry, the model must be able to write an oracle once that it cannot reliably execute in natural language on fresh instances. This asymmetry takes two complementary forms. Some tasks are algorithmically hard to reason through but trivial as code: a dynamic program or graph traversal, compiled once, yields unboundedly many calibrated instances. Others are intrinsically hard to solve but easy to verify, like planted subset-sum or constraint satisfaction. Both create a durable gap between proposing and solving that the policy cannot close by gaming the verifier, and it is this gap that keeps reward informative as the learner improves. We instantiate this view in EvoEnv, a single-policy generator, solver method that synthesizes Python environments from ten seeds and admits them only after staged validation, semantic self-review, solver-relative difficulty calibration, and novelty checks. The strongest evidence comes from the already-strong regime: on Qwen3-4B-Thinking, fixed public-data RLVR and fixed hand-crafted environment RLVR reduce the average, while EvoEnv improves it from 72.4 to 74.8, a relative gain of 3.3%. Stable self-improvement, we suggest, depends not on producing more synthetic data, but on models learning to construct worlds whose difficulty stays structurally beyond their own reach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes EvoEnv, a single-policy generator-solver framework in which an LLM synthesizes reusable Python environments from ten seeds; environments are admitted only after staged validation (semantic self-review, solver-relative difficulty calibration, novelty checks). The central claim is that these environments maintain stable solve-verify asymmetry (algorithmically hard to solve but trivial to verify, or intrinsically hard to solve but easy to verify), enabling durable self-improving reasoning RL. On Qwen3-4B-Thinking this yields an average performance increase from 72.4 to 74.8 (3.3% relative), while fixed public-data RLVR and fixed hand-crafted environment RLVR both decrease performance.

Significance. If the durability of the solve-verify gap is demonstrated, the work offers a principled alternative to data-imitation loops for self-improvement, shifting focus to construction of executable environments whose difficulty remains structurally beyond the solver's reach. The distinction between algorithmic and intrinsic asymmetry forms is a clear conceptual contribution.

major comments (2)
  1. [Abstract] Abstract: the reported gain (72.4 o 74.8) is presented without any mention of number of runs, standard deviation, statistical significance, or controls for selection bias introduced by the solver-relative difficulty calibration step; this makes it impossible to judge whether the 3.3% improvement is robust or partly an artifact of the admission filter.
  2. [Abstract] Abstract (and implied results): no post-training measurement is described that checks whether the final solver still exhibits a stable solve-verify gap on newly sampled environments; without this, the claim that the asymmetry 'stays structurally beyond their own reach' remains unverified and the self-evolution loop could saturate after the first iteration.
minor comments (1)
  1. [Abstract] The abstract refers to 'ten seeds' and 'staged validation' but does not list the concrete benchmarks or task families used to compute the reported average; adding this would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need for greater statistical transparency in the abstract and for a direct verification of persistent solve-verify asymmetry. Both comments identify genuine gaps in the current presentation. We address each below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported gain (72.4 o 74.8) is presented without any mention of number of runs, standard deviation, statistical significance, or controls for selection bias introduced by the solver-relative difficulty calibration step; this makes it impossible to judge whether the 3.3% improvement is robust or partly an artifact of the admission filter.

    Authors: We agree that the abstract lacks the requested statistical details and that this omission limits assessment of robustness. The solver-relative calibration is an integral part of the admission process, and while the fixed-baseline comparisons provide some control, explicit reporting is needed. In the revised manuscript we will update the abstract to state the number of independent runs, report standard deviation, note statistical significance where applicable, and briefly describe how selection bias is mitigated by the overall experimental design. revision: yes

  2. Referee: [Abstract] Abstract (and implied results): no post-training measurement is described that checks whether the final solver still exhibits a stable solve-verify gap on newly sampled environments; without this, the claim that the asymmetry 'stays structurally beyond their own reach' remains unverified and the self-evolution loop could saturate after the first iteration.

    Authors: The referee correctly notes that the manuscript does not include an explicit post-training evaluation of the solve-verify gap on newly generated environments. The observed performance lift provides supporting evidence, yet a direct measurement would more rigorously substantiate durability. We will add this analysis in the revision by evaluating the trained solver on a fresh set of environments sampled after the final iteration and reporting the resulting solve-verify gap. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central claim is an empirical result: EvoEnv training yields a 72.4 to 74.8 gain on Qwen3-4B-Thinking while fixed baselines degrade performance. The derivation chain consists of an environment-generation procedure (ten seeds, staged validation, semantic self-review, solver-relative difficulty calibration, novelty checks) followed by RL training and external evaluation. No equations, fitted parameters renamed as predictions, or self-citations appear in the supplied text. The calibration step selects training environments relative to the current solver, but the reported metric is performance on (presumably held-out) benchmarks, not a quantity forced by the admission filter itself. The stability of solve-verify asymmetry is presented as a necessary assumption rather than a result derived from the inputs by construction. This is therefore a standard empirical claim whose validity can be checked against external benchmarks without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the existence of a durable solve-verify asymmetry that survives policy improvement and on the staged validation pipeline successfully filtering for that property. No explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.1-grok · 5864 in / 1405 out tokens · 20429 ms · 2026-06-30T20:55:45.880831+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 24 canonical work pages · 12 internal anchors

  1. [1]

    MathArena: Evaluating LLMs on Uncontaminated Math Competitions

    Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions. 2026. URLhttps://arxiv.org/abs/2505.23281

  2. [2]

    Safe and scalable web agent learning via recreated websites,

    Hyungjoo Chae, Jungsoo Park, and Alan Ritter. Safe and scalable web agent learning via recreated websites,

  3. [3]

    URLhttps://arxiv.org/abs/2603.10505

  4. [4]

    Spc: Evolving self-play critic via adversarial games for llm reasoning, 2025

    Jiaqi Chen, Bang Zhang, Ruotian Ma, Peisong Wang, Xiaodan Liang, Zhaopeng Tu, Xiaolong Li, and Kwan-Yee K Wong. Spc: Evolving self-play critic via adversarial games for llm reasoning, 2025

  5. [5]

    Self-questioning language models, 2025

    Lili Chen, Mihir Prabhudesai, Katerina Fragkiadaki, Hao Liu, and Deepak Pathak. Self-questioning language models, 2025

  6. [6]

    Multi-agent evolve: Llm self-improve through co-evolution, 2025

    Yixing Chen, Yiding Wang, Siqi Zhu, Haofei Yu, Tao Feng, Muhan Zhang, Mostofa Patwary, and Jiaxuan You. Multi-agent evolve: Llm self-improve through co-evolution, 2025

  7. [7]

    Scaling agent learning via experience synthesis, 2025

    Zhaorun Chen, Zhuokai Zhao, Kai Zhang, Bo Liu, Qi Qi, Yifan Wu, Tarun Kalluri, Sara Cao, Yuanhao Xiong, Haibo Tong, Huaxiu Yao, Hengduo Li, Jiacheng Zhu, Xian Li, Dawn Song, Bo Li, Jason Weston, and Dat Huynh. Scaling agent learning via experience synthesis, 2025. URLhttps://arxiv.org/abs/2511.03773

  8. [8]

    Self-play fine-tuning converts weak language models to strong language models, 2024

    Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models, 2024

  9. [9]

    Webevolver: Enhancing web agent self-improvement with coevolving world model, 2025

    Tianqing Fang, Hongming Zhang, Zhisong Zhang, Kaixin Ma, Wenhao Yu, Haitao Mi, and Dong Yu. Webevolver: Enhancing web agent self-improvement with coevolving world model, 2025. URLhttps://arxiv.org/abs/2504. 21024

  10. [10]

    Serl: Self-play reinforcement learning for large language models with limited data, 2025

    Wenkai Fang, Shunyu Liu, Yang Zhou, Kongcheng Zhang, Tongya Zheng, Kaixuan Chen, Mingli Song, and Dacheng Tao. Serl: Self-play reinforcement learning for large language models with limited data, 2025

  11. [11]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

  12. [12]

    How far can unsupervised rlvr scale llm training?, 2026

    Bingxiang He, Yuxin Zuo, Zeyuan Liu, Shangziqi Zhao, Zixuan Fu, Junlin Yang, Cheng Qian, Kaiyan Zhang, Yuchen Fan, Ganqu Cui, et al. How far can unsupervised rlvr scale llm training?, 2026

  13. [13]

    V-star: Training verifiers for self-taught reasoners, 2024

    Arian Hosseini, Xingdi Yuan, Nikolay Malkin, Aaron Courville, Alessandro Sordoni, and Rishabh Agarwal. V-star: Training verifiers for self-taught reasoners, 2024

  14. [14]

    R-Zero: Self-Evolving Reasoning LLM from Zero Data

    Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-Zero: Self-evolving reasoning LLM from zero data, 2025. URLhttps://arxiv.org/abs/ 2508.05004

  15. [15]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. 2024. URLhttps://arxiv.org/abs/2403.07974

  16. [16]

    Language self-play for data-free training, 2025

    Jakub Grudzien Kuba, Mengting Gu, Qi Ma, Yuandong Tian, Vijai Mohan, and Jason Chen. Language self-play for data-free training, 2025. URLhttps://arxiv.org/abs/2509.07414

  17. [17]

    Opensir: Open-ended self-improving reasoner, 2025

    Wai-Chung Kwan, Joshua Ong Jun Leang, Pavlos Vougiouklis, Jeff Z Pan, Marco Valentino, and Pasquale Minervini. Opensir: Open-ended self-improving reasoner, 2025

  18. [18]

    Embomatrix: A scalable training-ground for embodied decision-making,

    Zixing Lei, Sheng Yin, Yichen Xiong, Yuanzhuo Ding, Wenhao Huang, Yuxi Wei, Qingyao Xu, Yiming Li, Weixin Li, Yunhong Wang, and Siheng Chen. Embomatrix: A scalable training-ground for embodied decision-making,

  19. [19]

    URLhttps://arxiv.org/abs/2510.12072

  20. [20]

    Spiral: Self-play on zero-sum games incentivizes reasoning via multi-agent multi-turn reinforcement learning, 2025

    Bo Liu, Leon Guertler, Simon Yu, Zichen Liu, Penghui Qi, Daniel Balcells, Mickel Liu, Cheston Tan, Weiyan Shi, Min Lin, et al. Spiral: Self-play on zero-sum games incentivizes reasoning via multi-agent multi-turn reinforcement learning, 2025

  21. [21]

    Spice: Self-play in corpus environments improves reasoning, 2025

    Bo Liu, Chuanyang Jin, Seungone Kim, Weizhe Yuan, Wenting Zhao, Ilia Kulikov, Xian Li, Sainbayar Sukhbaatar, Jack Lanchantin, and Jason Weston. Spice: Self-play in corpus environments improves reasoning, 2025. 11

  22. [22]

    Chasing moving targets with online self-play reinforcement learning for safer language models, 2025

    Mickel Liu, Liwei Jiang, Yancheng Liang, Simon Shaolei Du, Yejin Choi, Tim Althoff, and Natasha Jaques. Chasing moving targets with online self-play reinforcement learning for safer language models, 2025

  23. [23]

    Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models, 2025

    Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models, 2025

  24. [24]

    Search self-play: Pushing the frontier of agent capability without supervision, 2025

    Hongliang Lu, Yuhang Wen, Pengyu Cheng, Ruijin Ding, Jiaqi Guo, Haotian Xu, Chutian Wang, Haonan Chen, Xiaoxi Jiang, and Guanjun Jiang. Search self-play: Pushing the frontier of agent capability without supervision, 2025

  25. [25]

    Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset

    Ivan Moshkov, Darragh Hanley, Ivan Sorokin, Shubham Toshniwal, Christof Henkel, Benedikt Schifferer, Wei Du, and Igor Gitman. Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset. 2025. URLhttps://arxiv.org/abs/2504.16891

  26. [26]

    Self-consistency preference optimization, 2024

    Archiki Prasad, Weizhe Yuan, Richard Yuanzhe Pang, Jing Xu, Maryam Fazel-Zarandi, Mohit Bansal, Sainbayar Sukhbaatar, Jason Weston, and Jane Yu. Self-consistency preference optimization, 2024

  27. [27]

    Scaling synthetic task generation for agents via exploration, 2025

    Ram Ramrakhya, Andrew Szot, Omar Attia, Yuhao Yang, Anh Nguyen, Bogdan Mazoure, Zhe Gan, Harsh Agrawal, and Alexander Toshev. Scaling synthetic task generation for agents via exploration, 2025. URL https://arxiv.org/abs/2509.25047

  28. [28]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. 2019. URL https://arxiv.org/abs/1908.10084

  29. [29]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark. 2023. URLhttps: //arxiv.org/abs/2311.12022

  30. [30]

    ByteDance Seed, Jiaze Chen, Tiantian Fan, Xin Liu, Lingjun Liu, Zhiqi Lin, Mingxuan Wang, Chengyi Wang, Xiangpeng Wei, Wenyuan Xu, et al. Seed1. 5-thinking: Advancing superb reasoning models with reinforcement learning.arXiv preprint arXiv:2504.13914, 2025

  31. [31]

    Spurious rewards: Rethinking training signals in rlvr, 2025

    Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh, Simon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna, et al. Spurious rewards: Rethinking training signals in rlvr, 2025

  32. [32]

    Beyond human data: Scaling self-training for problem-solving with language models, 2023

    Avi Singh, John D Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J Liu, James Harrison, Jaehoon Lee, Kelvin Xu, et al. Beyond human data: Scaling self-training for problem-solving with language models, 2023

  33. [33]

    Envscaler: Scaling tool-interactive environments for llm agent via programmatic synthesis, 2026

    Xiaoshuai Song, Haofei Chang, Guanting Dong, Yutao Zhu, Ji-Rong Wen, and Zhicheng Dou. Envscaler: Scaling tool-interactive environments for llm agent via programmatic synthesis, 2026. URLhttps://arxiv.org/abs/2601. 05808

  34. [34]

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

  35. [35]

    Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

    Boxin Wang, Chankyu Lee, Nayeon Lee, Sheng-Chieh Lin, Wenliang Dai, Yang Chen, Yangyi Chen, Zhuolin Yang, Zihan Liu, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nemotron-cascade: Scaling cascaded reinforcement learning for general-purpose reasoning models. 2026. URLhttps://arxiv.org/abs/2512.13607

  36. [36]

    Rui Wang, Joel Lehman, Jeff Clune, and Kenneth O. Stanley. Paired open-ended trailblazer (poet): Endlessly generating increasingly complex and diverse learning environments and their solutions, 2019. URL https: //arxiv.org/abs/1901.01753

  37. [37]

    Socratic-zero: Bootstrapping reasoning via data-free agent co-evolution, 2025

    Shaobo Wang, Zhengbo Jiao, Zifan Zhang, Yilang Peng, Xu Ze, Boyu Yang, Wei Wang, Hu Wei, and Linfeng Zhang. Socratic-zero: Bootstrapping reasoning via data-free agent co-evolution, 2025

  38. [38]

    Llms as scalable, general-purpose simulators for evolving digital agent training,

    Yiming Wang, Da Yin, Yuedong Cui, Ruichen Zheng, Zhiqian Li, Zongyu Lin, Di Wu, Xueqing Wu, Chenchen Ye, Yu Zhou, and Kai-Wei Chang. Llms as scalable, general-purpose simulators for evolving digital agent training,

  39. [39]

    URLhttps://arxiv.org/abs/2510.14969

  40. [40]

    Toward Training Superintelligent Software Agents through Self-Play SWE-RL

    Yuxiang Wei, Zhiqing Sun, Emily McMilin, Jonas Gehring, David Zhang, Gabriel Synnaeve, Daniel Fried, Lingming Zhang, and Sida Wang. Toward training superintelligent software agents through self-play swe-rl, 2025. URL https://arxiv.org/abs/2512.18552. 12

  41. [41]

    Mirage or method? how model-task alignment induces divergent rl conclusions, 2025

    Haoze Wu, Cheng Wang, Wenshuo Zhao, and Junxian He. Mirage or method? how model-task alignment induces divergent rl conclusions, 2025. URLhttps://arxiv.org/abs/2508.21188

  42. [42]

    Autowebworld: Synthesizing infinite verifiable web environments via finite state machines, 2026

    Yifan Wu, Yiran Peng, Yiyu Chen, Jianhao Ruan, Zijie Zhuang, Cheng Yang, Jiayi Zhang, Man Chen, Yenchi Tseng, Zhaoyang Yu, Liang Chen, Yuyao Zhai, Bang Liu, Chenglin Wu, and Yuyu Luo. Autowebworld: Synthesizing infinite verifiable web environments via finite state machines, 2026. URLhttps://arxiv.org/abs/2602.14296

  43. [43]

    Agent0: Unleashing self-evolving agents from zero data via tool-integrated reasoning, 2025

    Peng Xia, Kaide Zeng, Jiaqi Liu, Can Qin, Fang Wu, Yiyang Zhou, Caiming Xiong, and Huaxiu Yao. Agent0: Unleashing self-evolving agents from zero data via tool-integrated reasoning, 2025

  44. [44]

    Qwen3 technical report, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report, 2025

  45. [45]

    Dapo: An open-source llm reinforcement learning system at scale, 2025

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale, 2025

  46. [46]

    Self-rewarding language models, 2024

    Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models, 2024

  47. [47]

    Star: Bootstrapping reasoning with reasoning, 2022

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning, 2022

  48. [48]

    GLM-5: from Vibe Coding to Agentic Engineering

    Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026

  49. [49]

    RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments

    Zhiyuan Zeng, Hamish Ivison, Yiping Wang, Lifan Yuan, Shuyue Stella Li, Zhuorui Ye, Siting Li, Jacqueline He, Runlong Zhou, Tong Chen, Chenyang Zhao, Yulia Tsvetkov, Simon Shaolei Du, Natasha Jaques, Hao Peng, Pang Wei Koh, and Hannaneh Hajishirzi. Rlve: Scaling up reinforcement learning for language models with adaptive verifiable environments, 2025. URL...

  50. [50]

    Darwin gödel machine: Open-ended evolution of self-improving agents, 2026

    Jenny Zhang, Shengran Hu, Cong Lu, Robert Tjarko Lange, and Jeff Clune. Darwin gödel machine: Open-ended evolution of self-improving agents, 2026. URLhttps://openreview.net/forum?id=pUpzQZTvGY

  51. [51]

    Right question is already half the answer: Fully unsupervised llm reasoning incentivization, 2025

    Qingyang Zhang, Haitao Wu, Changqing Zhang, Peilin Zhao, and Yatao Bian. Right question is already half the answer: Fully unsupervised llm reasoning incentivization, 2025

  52. [52]

    Better llm reasoning via dual-play, 2025

    Zhengxin Zhang, Chengyu Huang, Aochong Oliver Li, and Claire Cardie. Better llm reasoning via dual-play, 2025

  53. [53]

    Infiniteweb: Scalable web environment synthesis for gui agent training, 2026

    Ziyun Zhang, Zezhou Wang, Xiaoyi Zhang, Zongyu Guo, Jiahao Li, Bin Li, and Yan Lu. Infiniteweb: Scalable web environment synthesis for gui agent training, 2026. URLhttps://arxiv.org/abs/2601.04126

  54. [54]

    Absolute Zero: Reinforced Self-play Reasoning with Zero Data

    Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data, 2025. URL https://arxiv.org/abs/2505.03335

  55. [55]

    Learning to reason without external rewards, 2025

    Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, and Dawn Song. Learning to reason without external rewards, 2025

  56. [56]

    Self-challenging language model agents, 2026

    Yifei Zhou, Sergey Levine, Jason E Weston, Xian Li, and Sainbayar Sukhbaatar. Self-challenging language model agents, 2026. URLhttps://openreview.net/forum?id=9yusqX9DpR

  57. [57]

    Evolving language models without labels: Majority drives selection, novelty promotes variation, 2025

    Yujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao Yu, Kishan Panaganti, Linfeng Song, Dian Yu, Xiangliang Zhang, Haitao Mi, and Dong Yu. Evolving language models without labels: Majority drives selection, novelty promotes variation, 2025

  58. [58]

    Training versatile coding agents in synthetic environments, 2025

    Yiqi Zhu, Apurva Gandhi, and Graham Neubig. Training versatile coding agents in synthetic environments, 2025. URLhttps://arxiv.org/abs/2512.12216

  59. [59]

    Given the multiset {S}, find a nonempty

    Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, et al. Ttrl: Test-time reinforcement learning, 2025. 13 A Detailed positioning against nearby self-improvement methods Table 3 expands the sketch in Section 2 into a family-by-family comparison. Two axes distinguishEvoEnv from each adjacent...