DemoEvolve: Overcoming Sparse Feedback in Agentic Harness Evolution with Demonstrations

Chuang Wang; Jian Su; Lirong Che; Peiwen Lin; Xueqian Wang; Yuzhe Yang

arxiv: 2605.24539 · v1 · pith:4NAAIVRQnew · submitted 2026-05-23 · 💻 cs.AI

DemoEvolve: Overcoming Sparse Feedback in Agentic Harness Evolution with Demonstrations

Lirong Che , Yuzhe yang , Peiwen lin , Chuang wang , Xueqian wang , Jian su This is my paper

Pith reviewed 2026-06-30 13:10 UTC · model grok-4.3

classification 💻 cs.AI

keywords harness evolutiondemonstrationssparse feedbacklanguage model agentslong-horizon taskssample-efficient adaptationagentic systems

0 comments

The pith

Demonstrations bootstrap harness evolution to overcome sparse feedback in long-horizon agent tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explores evolving executable harnesses around frozen language models to acquire task competence without changing weights. In long-horizon stochastic environments, self-generated rollouts struggle with sparse rewards and hard-to-attribute failures. DemoEvolve uses competent human trajectories as reference experience to guide the coding proposer in diagnosing and editing the harness. This method yields more effective and auditable edits than self-rollout or tutorial approaches alone, as tested in games like Liar's Dice and Balatro under limited budgets.

Core claim

DemoEvolve is a demonstration-bootstrapped approach to harness evolution. When reward-only search is too broad and noisy, competent human trajectories serve as expert reference experience for the coding proposer, guiding harness-level diagnosis and editing. Experiments show that in short-episode settings self-rollout works, but in harder regimes like Balatro, self-rollout is misled by sparse feedback while DemoEvolve produces more effective and auditable harness edits and better performance under the same limited budget.

What carries the argument

Demonstration-bootstrapped harness evolution, using human trajectories to guide the coding proposer for harness diagnosis and editing.

If this is right

Self-rollout evolution succeeds only when episodes are short and failures attributable.
Tutorial-like textual knowledge alone fails to yield stable improvement in stochastic regimes.
DemoEvolve makes harness evolution more diagnosable, localizable, and stable in sparse-feedback settings.
Under limited budget, it achieves better agent performance in long-horizon stochastic environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Human demonstrations could serve as a general bootstrap for agent adaptation in other domains with sparse signals.
Integrating this with model fine-tuning might amplify gains in sample efficiency.
The approach suggests a path for making AI agents more auditable by relying on explicit harness structures guided by examples.

Load-bearing premise

Competent human trajectories provide sufficiently diagnostic and generalizable reference experience that allows the coding proposer to localize and repair harness mechanisms better than self-generated rollouts can.

What would settle it

A controlled comparison in Balatro where DemoEvolve shows no performance gain over self-rollout evolution under identical limited budgets and the same number of edits.

Figures

Figures reproduced from arXiv: 2605.24539 by Chuang Wang, Jian Su, Lirong Che, Peiwen Lin, Xueqian Wang, Yuzhe Yang.

**Figure 1.** Figure 1: Games as a controlled testbed for task-world adaptation. We position domains by task-world dynamics and adaptation requirement. Turn-based games introduce action-conditioned, seed-dependent futures and sparse long-horizon feedback, but remain reproducible through fixed seeds and explicit action interfaces. They therefore test whether harness evolution can move beyond model-native priors toward task-specifi… view at source ↗

**Figure 2.** Figure 2: DemoEvolve information regimes. A coding proposer evolves a task harness around a frozen model using a filesystem archive of prior rollout trajectories, raw execution traces, and scores. We vary only the proposer-visible information: self-rollout archive, archive plus external textual information, or archive plus human trajectories. Meta-Harness denotes the self-rollout evolution condition. In this regime,… view at source ↗

**Figure 3.** Figure 3: Held-out TextArena performance after development-only harness selection. Each bar uses [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Attrition-adjusted shop-entry economy in Balatro. Dead rollouts contribute zero at later [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Development/search progress for TextArena Liar’s Dice. Iteration 0 is the base harness. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Annotated Balatro screenshots illustrating the mechanics used in our analysis. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

read the original abstract

Agent harness evolution improves frozen language-model agents by modifying the executable structures around them. We study this paradigm as a form of sample-efficient fast adaptation: instead of updating model weights, an agent can acquire task-specific competence by changing its external harness, while leaving the base model's general capabilities intact. Prior work shows that self-generated rollouts can support harness search, suggesting that agents may acquire new task competence through practice. Yet in long-horizon stochastic environments, self-practice becomes fragile: rewards are sparse, outcomes are high-variance, and failures are hard to attribute to concrete harness mechanisms. We introduce DemoEvolve, a demonstration-bootstrapped approach to harness evolution. When reward-only search is too broad and noisy, competent human trajectories serve as expert reference experience for the coding proposer, guiding harness-level diagnosis and editing. Experiments on Liar's Dice show that self-rollout evolution can work when episodes are short and failures are attributable. In contrast, Balatro exposes a harder long-horizon stochastic regime, where self-rollout evolution is misled by sparse feedback and candidate-selection noise, while tutorial-like textual knowledge alone does not yield stable improvement. Under the same limited budget, DemoEvolve produces more effective and auditable harness edits and achieves better performance. Overall, demonstrations make sparse-feedback harness evolution more diagnosable, localizable, and stable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DemoEvolve uses human demos to stabilize harness evolution where self-rollouts fail on sparse long-horizon tasks, but the same-budget claim needs checking against demo collection costs.

read the letter

The core idea is straightforward: when self-generated rollouts give noisy or unattributable signals in games like Balatro, feed competent human trajectories to the proposer so it can localize harness problems and propose better edits. The abstract contrasts this with shorter episodes in Liar's Dice where self-rollout already works, and with pure textual tutorials that don't stabilize results. That framing of the failure mode and the targeted use of demonstrations is the actual new piece.

The paper does a clean job laying out why reward-only search becomes fragile in high-variance stochastic settings and why an external reference signal helps diagnosis. The two-game setup gives a minimal but useful contrast between regimes.

The stress-test concern lands. The strongest claim is that DemoEvolve wins under the same limited budget, yet the method's advantage rests on human trajectories that the baselines do not receive. If the budget metric only tracks LLM calls or iterations, the comparison is asymmetric and the gains could trace to the extra reference data rather than the evolution loop itself. Without seeing the experimental section it is impossible to tell whether they measured or controlled for human effort, how the trajectories were selected, or whether results are sensitive to particular demos.

This is niche work for the agent scaffolding and harness-evolution crowd. Readers already tracking self-practice methods will see the practical angle; others will find the scope narrow. The thinking is coherent and directly engages the stated limitation in prior work, so the paper is worth a referee's time to check the numbers, ablations, and cost accounting.

Referee Report

2 major / 2 minor

Summary. The paper introduces DemoEvolve, a demonstration-bootstrapped method for evolving executable harnesses around frozen language-model agents. It argues that self-generated rollouts become fragile in long-horizon stochastic settings with sparse rewards and high-variance outcomes, while tutorial-style textual knowledge alone yields unstable gains; competent human trajectories are instead used as reference experience to guide the coding proposer in localizing and repairing harness mechanisms. Experiments contrast short-episode Liar's Dice (where self-rollout succeeds) with the harder Balatro regime, claiming that DemoEvolve produces more effective, auditable edits and superior performance under the same limited budget.

Significance. If the empirical results hold after addressing budget accounting, the work would demonstrate a practical route to sample-efficient, weight-preserving adaptation for LM agents in regimes where pure self-practice fails, with the added benefit of more diagnosable and auditable harness changes. The explicit regime contrast between the two environments is a strength that helps isolate when demonstrations are necessary.

major comments (2)

[Abstract] Abstract: the central claim of superior performance 'under the same limited budget' is load-bearing for the contribution, yet the budget metric is undefined. If the cost of collecting competent human trajectories is excluded while counting only LLM calls or proposal iterations, the comparison to self-rollout evolution is asymmetric and the reported gains cannot be attributed solely to the demonstration mechanism.
[Experiments] Experiments section (Liar's Dice vs. Balatro comparison): without reported ablations that isolate the contribution of human trajectories from other factors (e.g., proposal prompting style or candidate selection heuristics), it remains unclear whether the performance edge stems from the reference experience itself or from incidental differences in search procedure.

minor comments (2)

[Abstract] The abstract uses 'auditable harness edits' without defining the auditability criterion or providing examples of what makes an edit more or less auditable than a self-rollout edit.
Terminology: 'harness evolution' and 'coding proposer' are introduced without a brief formalization or pseudocode sketch, which would help readers map the method to prior harness or tool-use literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important points about experimental clarity and fair comparison that we will address in revision. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of superior performance 'under the same limited budget' is load-bearing for the contribution, yet the budget metric is undefined. If the cost of collecting competent human trajectories is excluded while counting only LLM calls or proposal iterations, the comparison to self-rollout evolution is asymmetric and the reported gains cannot be attributed solely to the demonstration mechanism.

Authors: We agree that the budget accounting must be defined explicitly for the claim to be interpretable. In the current manuscript the 'limited budget' is intended to denote the number of LLM calls and evolution iterations allocated to the proposer and evaluator, with human demonstrations treated as a fixed, one-time reference input rather than part of the per-run search budget. Nevertheless, the manuscript does not state this distinction clearly. We will revise the abstract, introduction, and experimental setup to provide a precise budget definition together with a cost breakdown that separates demonstration collection from LLM-based search steps. This change will allow readers to evaluate the asymmetry directly. revision: yes
Referee: [Experiments] Experiments section (Liar's Dice vs. Balatro comparison): without reported ablations that isolate the contribution of human trajectories from other factors (e.g., proposal prompting style or candidate selection heuristics), it remains unclear whether the performance edge stems from the reference experience itself or from incidental differences in search procedure.

Authors: The existing regime contrast (short-episode Liar's Dice where self-rollout succeeds versus long-horizon Balatro where it fails) is designed to isolate the value of demonstrations under sparse feedback. However, we acknowledge that this does not fully disentangle the effect of the human trajectories from possible differences in prompting templates or selection heuristics between the self-rollout and DemoEvolve conditions. We will add targeted ablations in the revised experiments section that hold prompting style and selection procedure fixed while varying only the presence and quality of reference trajectories. These results will be reported alongside the existing comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison with independent experimental support

full rationale

The paper advances an empirical method (DemoEvolve) for harness evolution that incorporates human demonstrations to address sparse rewards in long-horizon settings. Its central claims rest on experimental outcomes comparing DemoEvolve against self-rollout and tutorial baselines on Liar's Dice and Balatro, under a stated budget. No equations, fitted parameters, or first-principles derivations appear in the abstract or described structure; the performance advantage is presented as an observed result rather than a quantity forced by construction from the inputs. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing elements. The derivation chain is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; all such elements remain unknown.

pith-pipeline@v0.9.1-grok · 5784 in / 1061 out tokens · 26870 ms · 2026-06-30T13:10:09.720731+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 21 canonical work pages · 9 internal anchors

[1]

Curriculum learning

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InProceedings of the 26th Annual International Conference on Machine Learning, pages 41–48. Association for Computing Machinery, 2009

2009
[2]

Balatrobench.https://balatrobench.com/, 2026

Coder. Balatrobench.https://balatrobench.com/, 2026. Accessed May 2, 2026

2026
[3]

BalatroLLM: Play Balatro with LLMs

Coder. BalatroLLM: Play Balatro with LLMs. https://github.com/coder/balatrollm,
[4]

Accessed May 2, 2026

2026
[5]

Gtbench: Uncovering the strategic reasoning limitations of llms via game-theoretic evaluations, 2024

Jinhao Duan, Renming Zhang, James Diffenderfer, Bhavya Kailkhura, Lichao Sun, Elias Stengel-Eskin, Mohit Bansal, Tianlong Chen, and Kaidi Xu. Gtbench: Uncovering the strategic reasoning limitations of llms via game-theoretic evaluations, 2024. URL https://arxiv. org/abs/2402.12348

work page arXiv 2024
[6]

Stanley, and Jeff Clune

Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O. Stanley, and Jeff Clune. First return, then explore.Nature, 590(7847):580–586, 2021

2021
[7]

Reverse curriculum generation for reinforcement learning

Carlos Florensa, David Held, Markus Wulfmeier, Michael Zhang, and Pieter Abbeel. Reverse curriculum generation for reinforcement learning. InConference on Robot Learning, pages 482–495, 2017

2017
[8]

Catarena: Evaluating evolutionary capabilities of code agents via iterative tournaments, 2025

Lingyue Fu, Xin Ding, Linyue Pan, Yaoming Zhu, Shao Zhang, Lin Qiu, Weiwen Liu, Weinan Zhang, Xuezhi Cao, Xunliang Cai, Jiaxin Ding, and Yong Yu. Catarena: Evaluating evolutionary capabilities of code agents via iterative tournaments, 2025. URL https://arxiv.org/abs/ 2510.26852

work page arXiv 2025
[9]

Textarena,

Leon Guertler, Bobby Cheng, Simon Yu, Bo Liu, Leshem Choshen, and Cheston Tan. Textarena,
[10]

URLhttps://arxiv.org/abs/2504.11442

work page arXiv
[11]

Deep q-learning from demonstrations

Todd Hester, Mel Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Jean-Baptiste Lespiau, Laurent Sartran, and Guillaume Beaudoin. Deep q-learning from demonstrations. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 2018

2018
[12]

Generative adversarial imitation learning

Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. InAdvances in Neural Information Processing Systems, 2016

2016
[13]

Gamearena: Evaluating llm reasoning through live computer games, 2024

Lanxiang Hu, Qiyu Li, Anze Xie, Nan Jiang, Ion Stoica, Haojian Jin, and Hao Zhang. Gamearena: Evaluating llm reasoning through live computer games, 2024. URL https: //arxiv.org/abs/2412.06394

work page arXiv 2024
[14]

Automated design of agentic systems

Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems. InThe Thirteenth International Conference on Learning Representations, October 2024. URL https: //openreview.net/forum?id=t9U3LW7JVX

2024
[15]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, October 2023. URL https://openreview.net/forum?id=VTF8yNQM66

2023
[16]

Pawan Kumar, Benjamin Packer, and Daphne Koller

M. Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models.Advances in Neural Information Processing Systems, 23, 2010

2010
[17]

Meta-Harness: End-to-End Optimization of Model Harnesses

Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses, March 2026. URL http: //arxiv.org/abs/2603.28052

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

Textatari: 100k frames game playing with language agents, 2025

Wenhao Li, Wenwu Li, Chuyun Shen, Junjie Sheng, Zixiao Huang, Di Wu, Yun Hua, Wei Yin, Xiangfeng Wang, Hongyuan Zha, and Bo Jin. Textatari: 100k frames game playing with language agents, 2025. URLhttps://arxiv.org/abs/2506.04098

work page arXiv 2025
[19]

Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

Jiahang Lin, Shichun Liu, Chengjun Pan, Lizhi Lin, Shihan Dou, Xuanjing Huang, Hang Yan, Zhenhua Han, and Tao Gui. Agentic harness engineering: Observability-driven automatic evo- lution of coding-agent harnesses, April 2026. URLhttp://arxiv.org/abs/2604.25850. 10

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

Xinghua Lou, Miguel Lázaro-Gredilla, Antoine Dedieu, Carter Wendelken, Wolfgang Lehrach, and Kevin P. Murphy. Autoharness: improving llm agents by automatically synthesizing a code harness, 2026. URLhttps://arxiv.org/abs/2603.03329

work page arXiv 2026
[21]

SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. Skillclaw: Let skills evolve collectively with agentic evolver, April 2026. URLhttp://arxiv.org/abs/2604.08377

work page internal anchor Pith review Pith/arXiv arXiv 2026
[22]

Self- refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self- refine: Iterative refinement with self-feedback. InThirty-Seventh Conference on Neural Infor- m...

2023
[23]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, An...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

Rusu, Joel Veness, Marc G

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning.Nature, 518(7540):529–533, 2015. doi: 10.1038/nature14236

work page doi:10.1038/nature14236 2015
[25]

Over- coming exploration in reinforcement learning with demonstrations

Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Over- coming exploration in reinforcement learning with demonstrations. In2018 IEEE International Conference on Robotics and Automation (ICRA), pages 6292–6299, 2018

2018
[26]

Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and algor...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Balrog: Benchmarking agentic llm and vlm reasoning on games, 2024

Davide Paglieri, Bartlomiej Cupial, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Lukasz Kucinski, Lerrel Pinto, Rob Fergus, Jakob Nicolaus Foerster, Jack Parker-Holder, and Tim Rocktaschel. Balrog: Benchmarking agentic llm and vlm reasoning on games, 2024. URLhttps://arxiv.org/abs/2411.13543

work page arXiv 2024
[28]

Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games

Dongmin Park, Minkyu Kim, Beongjun Choi, Junhyuck Kim, Keon Lee, Jonghyun Lee, Inkyu Park, Byeong-Uk Lee, Jaeyoung Hwang, Jaewoo Ahn, Ameya S. Mahabaleshwarkar, Bilal Kartal, Pritam Biswas, Yoshi Suhara, Kangwook Lee, and Jaewoong Cho. Orak: A foundational benchmark for training and evaluating llm agents on diverse video games, 2025. URLhttps://arxiv.org/...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Learning complex dexterous manipulation with deep reinforcement learning and demonstrations

Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. InRobotics: Science and Systems, 2018. 11

2018
[30]

A self-improving coding agent,

Maxime Robeyns, Martin Szummer, and Laurence Aitchison. A self-improving coding agent,
[31]

URLhttps://arxiv.org/abs/2504.15228

work page arXiv
[32]

A reduction of imitation learning and structured prediction to no-regret online learning

Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 627–635, 2011

2011
[33]

Openevolve: an open-source evolutionary coding agent.https://github

Asankhaya Sharma. Openevolve: an open-source evolutionary coding agent.https://github. com/algorithmicsuperintelligence/openevolve, 2025. URL https://github.com/ algorithmicsuperintelligence/openevolve. GitHub repository

2025
[34]

Narasimhan, and Shunyu Yao

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R. Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InThirty-Seventh Conference on Neural Information Processing Systems, November 2023. URL https://openreview. net/forum?id=vAElhFcKW6

2023
[35]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, 2 edition, 2018

2018
[36]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models, October 2023. URLhttp://arxiv.org/abs/2305.16291

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Huxley-godel machine: Human-level coding agent development by an approximation of the optimal self-improving machine, 2025

Wenyi Wang, Piotr Piekos, Nanbo Li, Firas Laakom, Yimeng Chen, Mateusz Ostaszewski, Mingchen Zhuge, and Jurgen Schmidhuber. Huxley-godel machine: Human-level coding agent development by an approximation of the optimal self-improving machine, 2025. URL https://arxiv.org/abs/2510.21614

work page arXiv 2025
[38]

Live-swe- agent: Can software engineering agents self-evolve on the fly?

Chunqiu Steven Xia, Zhe Wang, Yan Yang, Yuxiang Wei, and Lingming Zhang. Live-swe-agent: Can software engineering agents self-evolve on the fly?, 2025. URL https://arxiv.org/ abs/2511.13646

work page arXiv 2025
[39]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning, February 2026. URL http://arxiv.org/abs/2602.08234

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

Learning to continually learn via meta-learning agentic memory designs

Yiming Xiong, Shengran Hu, and Jeff Clune. Learning to continually learn via meta-learning agentic memory designs. InOpenReview, 2026. URL https://api.semanticscholar. org/CorpusID:285454009

2026
[41]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=WE_vluYUL-X

2022
[42]

Meta context engineer- ing via agentic skill evolution.arXiv preprint arXiv:2601.21557, 2026

Haoran Ye, Xuning He, Vincent Arak, Haonan Dong, and Guojie Song. Meta context engineer- ing via agentic skill evolution.arXiv preprint arXiv:2601.21557, 2026

work page arXiv 2026
[43]

MemEvolve: Meta-Evolution of Agent Memory Systems

Guibin Zhang, Haotian Ren, Chong Zhan, Zhenhong Zhou, Junhao Wang, He Zhu, Wangchun- shu Zhou, and Shuicheng Yan. Memevolve: Meta-evolution of agent memory systems.arXiv preprint arXiv:2512.18746, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Darwin godel machine: Open-ended evolution of self-improving agents, 2026

Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, and Jeff Clune. Darwin godel machine: Open-ended evolution of self-improving agents, 2026. URL https://arxiv.org/abs/2505. 22954

2026
[45]

Aflow: Automating agentic workflow generation

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xiong-Hui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu. Aflow: Automating agentic workflow generation. InThe Thirteenth International Conference on Learning Representations, October 2024. URL https://openreview.net/ forum?id=z5u...

2024
[46]

Agentic context engineering: Evolving contexts for self-improving language models

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. Agentic context engineering: Evolving contexts for self-improving language models. InThe Fourteenth International Conference on Learning Representations, October 2025. URLh...

2025
[47]

Expel: Llm agents are experiential learners, December 2024

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners, December 2024. URL http://arxiv.org/abs/2308. 10144. 13 A TextArena Liar’s Dice Details This appendix gives additional details for the TextArena Liar’s Dice experiments reported in Sec- tion 4.1. These experiments serve as a light...

2024

[1] [1]

Curriculum learning

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InProceedings of the 26th Annual International Conference on Machine Learning, pages 41–48. Association for Computing Machinery, 2009

2009

[2] [2]

Balatrobench.https://balatrobench.com/, 2026

Coder. Balatrobench.https://balatrobench.com/, 2026. Accessed May 2, 2026

2026

[3] [3]

BalatroLLM: Play Balatro with LLMs

Coder. BalatroLLM: Play Balatro with LLMs. https://github.com/coder/balatrollm,

[4] [4]

Accessed May 2, 2026

2026

[5] [5]

Gtbench: Uncovering the strategic reasoning limitations of llms via game-theoretic evaluations, 2024

Jinhao Duan, Renming Zhang, James Diffenderfer, Bhavya Kailkhura, Lichao Sun, Elias Stengel-Eskin, Mohit Bansal, Tianlong Chen, and Kaidi Xu. Gtbench: Uncovering the strategic reasoning limitations of llms via game-theoretic evaluations, 2024. URL https://arxiv. org/abs/2402.12348

work page arXiv 2024

[6] [6]

Stanley, and Jeff Clune

Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O. Stanley, and Jeff Clune. First return, then explore.Nature, 590(7847):580–586, 2021

2021

[7] [7]

Reverse curriculum generation for reinforcement learning

Carlos Florensa, David Held, Markus Wulfmeier, Michael Zhang, and Pieter Abbeel. Reverse curriculum generation for reinforcement learning. InConference on Robot Learning, pages 482–495, 2017

2017

[8] [8]

Catarena: Evaluating evolutionary capabilities of code agents via iterative tournaments, 2025

Lingyue Fu, Xin Ding, Linyue Pan, Yaoming Zhu, Shao Zhang, Lin Qiu, Weiwen Liu, Weinan Zhang, Xuezhi Cao, Xunliang Cai, Jiaxin Ding, and Yong Yu. Catarena: Evaluating evolutionary capabilities of code agents via iterative tournaments, 2025. URL https://arxiv.org/abs/ 2510.26852

work page arXiv 2025

[9] [9]

Textarena,

Leon Guertler, Bobby Cheng, Simon Yu, Bo Liu, Leshem Choshen, and Cheston Tan. Textarena,

[10] [10]

URLhttps://arxiv.org/abs/2504.11442

work page arXiv

[11] [11]

Deep q-learning from demonstrations

Todd Hester, Mel Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Jean-Baptiste Lespiau, Laurent Sartran, and Guillaume Beaudoin. Deep q-learning from demonstrations. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 2018

2018

[12] [12]

Generative adversarial imitation learning

Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. InAdvances in Neural Information Processing Systems, 2016

2016

[13] [13]

Gamearena: Evaluating llm reasoning through live computer games, 2024

Lanxiang Hu, Qiyu Li, Anze Xie, Nan Jiang, Ion Stoica, Haojian Jin, and Hao Zhang. Gamearena: Evaluating llm reasoning through live computer games, 2024. URL https: //arxiv.org/abs/2412.06394

work page arXiv 2024

[14] [14]

Automated design of agentic systems

Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems. InThe Thirteenth International Conference on Learning Representations, October 2024. URL https: //openreview.net/forum?id=t9U3LW7JVX

2024

[15] [15]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, October 2023. URL https://openreview.net/forum?id=VTF8yNQM66

2023

[16] [16]

Pawan Kumar, Benjamin Packer, and Daphne Koller

M. Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models.Advances in Neural Information Processing Systems, 23, 2010

2010

[17] [17]

Meta-Harness: End-to-End Optimization of Model Harnesses

Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses, March 2026. URL http: //arxiv.org/abs/2603.28052

work page internal anchor Pith review Pith/arXiv arXiv 2026

[18] [18]

Textatari: 100k frames game playing with language agents, 2025

Wenhao Li, Wenwu Li, Chuyun Shen, Junjie Sheng, Zixiao Huang, Di Wu, Yun Hua, Wei Yin, Xiangfeng Wang, Hongyuan Zha, and Bo Jin. Textatari: 100k frames game playing with language agents, 2025. URLhttps://arxiv.org/abs/2506.04098

work page arXiv 2025

[19] [19]

Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

Jiahang Lin, Shichun Liu, Chengjun Pan, Lizhi Lin, Shihan Dou, Xuanjing Huang, Hang Yan, Zhenhua Han, and Tao Gui. Agentic harness engineering: Observability-driven automatic evo- lution of coding-agent harnesses, April 2026. URLhttp://arxiv.org/abs/2604.25850. 10

work page internal anchor Pith review Pith/arXiv arXiv 2026

[20] [20]

Xinghua Lou, Miguel Lázaro-Gredilla, Antoine Dedieu, Carter Wendelken, Wolfgang Lehrach, and Kevin P. Murphy. Autoharness: improving llm agents by automatically synthesizing a code harness, 2026. URLhttps://arxiv.org/abs/2603.03329

work page arXiv 2026

[21] [21]

SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. Skillclaw: Let skills evolve collectively with agentic evolver, April 2026. URLhttp://arxiv.org/abs/2604.08377

work page internal anchor Pith review Pith/arXiv arXiv 2026

[22] [22]

Self- refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self- refine: Iterative refinement with self-feedback. InThirty-Seventh Conference on Neural Infor- m...

2023

[23] [23]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, An...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[24] [24]

Rusu, Joel Veness, Marc G

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning.Nature, 518(7540):529–533, 2015. doi: 10.1038/nature14236

work page doi:10.1038/nature14236 2015

[25] [25]

Over- coming exploration in reinforcement learning with demonstrations

Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Over- coming exploration in reinforcement learning with demonstrations. In2018 IEEE International Conference on Robotics and Automation (ICRA), pages 6292–6299, 2018

2018

[26] [26]

Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and algor...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Balrog: Benchmarking agentic llm and vlm reasoning on games, 2024

Davide Paglieri, Bartlomiej Cupial, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Lukasz Kucinski, Lerrel Pinto, Rob Fergus, Jakob Nicolaus Foerster, Jack Parker-Holder, and Tim Rocktaschel. Balrog: Benchmarking agentic llm and vlm reasoning on games, 2024. URLhttps://arxiv.org/abs/2411.13543

work page arXiv 2024

[28] [28]

Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games

Dongmin Park, Minkyu Kim, Beongjun Choi, Junhyuck Kim, Keon Lee, Jonghyun Lee, Inkyu Park, Byeong-Uk Lee, Jaeyoung Hwang, Jaewoo Ahn, Ameya S. Mahabaleshwarkar, Bilal Kartal, Pritam Biswas, Yoshi Suhara, Kangwook Lee, and Jaewoong Cho. Orak: A foundational benchmark for training and evaluating llm agents on diverse video games, 2025. URLhttps://arxiv.org/...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Learning complex dexterous manipulation with deep reinforcement learning and demonstrations

Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. InRobotics: Science and Systems, 2018. 11

2018

[30] [30]

A self-improving coding agent,

Maxime Robeyns, Martin Szummer, and Laurence Aitchison. A self-improving coding agent,

[31] [31]

URLhttps://arxiv.org/abs/2504.15228

work page arXiv

[32] [32]

A reduction of imitation learning and structured prediction to no-regret online learning

Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 627–635, 2011

2011

[33] [33]

Openevolve: an open-source evolutionary coding agent.https://github

Asankhaya Sharma. Openevolve: an open-source evolutionary coding agent.https://github. com/algorithmicsuperintelligence/openevolve, 2025. URL https://github.com/ algorithmicsuperintelligence/openevolve. GitHub repository

2025

[34] [34]

Narasimhan, and Shunyu Yao

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R. Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InThirty-Seventh Conference on Neural Information Processing Systems, November 2023. URL https://openreview. net/forum?id=vAElhFcKW6

2023

[35] [35]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, 2 edition, 2018

2018

[36] [36]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models, October 2023. URLhttp://arxiv.org/abs/2305.16291

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

Huxley-godel machine: Human-level coding agent development by an approximation of the optimal self-improving machine, 2025

Wenyi Wang, Piotr Piekos, Nanbo Li, Firas Laakom, Yimeng Chen, Mateusz Ostaszewski, Mingchen Zhuge, and Jurgen Schmidhuber. Huxley-godel machine: Human-level coding agent development by an approximation of the optimal self-improving machine, 2025. URL https://arxiv.org/abs/2510.21614

work page arXiv 2025

[38] [38]

Live-swe- agent: Can software engineering agents self-evolve on the fly?

Chunqiu Steven Xia, Zhe Wang, Yan Yang, Yuxiang Wei, and Lingming Zhang. Live-swe-agent: Can software engineering agents self-evolve on the fly?, 2025. URL https://arxiv.org/ abs/2511.13646

work page arXiv 2025

[39] [39]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning, February 2026. URL http://arxiv.org/abs/2602.08234

work page internal anchor Pith review Pith/arXiv arXiv 2026

[40] [40]

Learning to continually learn via meta-learning agentic memory designs

Yiming Xiong, Shengran Hu, and Jeff Clune. Learning to continually learn via meta-learning agentic memory designs. InOpenReview, 2026. URL https://api.semanticscholar. org/CorpusID:285454009

2026

[41] [41]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=WE_vluYUL-X

2022

[42] [42]

Meta context engineer- ing via agentic skill evolution.arXiv preprint arXiv:2601.21557, 2026

Haoran Ye, Xuning He, Vincent Arak, Haonan Dong, and Guojie Song. Meta context engineer- ing via agentic skill evolution.arXiv preprint arXiv:2601.21557, 2026

work page arXiv 2026

[43] [43]

MemEvolve: Meta-Evolution of Agent Memory Systems

Guibin Zhang, Haotian Ren, Chong Zhan, Zhenhong Zhou, Junhao Wang, He Zhu, Wangchun- shu Zhou, and Shuicheng Yan. Memevolve: Meta-evolution of agent memory systems.arXiv preprint arXiv:2512.18746, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

Darwin godel machine: Open-ended evolution of self-improving agents, 2026

Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, and Jeff Clune. Darwin godel machine: Open-ended evolution of self-improving agents, 2026. URL https://arxiv.org/abs/2505. 22954

2026

[45] [45]

Aflow: Automating agentic workflow generation

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xiong-Hui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu. Aflow: Automating agentic workflow generation. InThe Thirteenth International Conference on Learning Representations, October 2024. URL https://openreview.net/ forum?id=z5u...

2024

[46] [46]

Agentic context engineering: Evolving contexts for self-improving language models

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. Agentic context engineering: Evolving contexts for self-improving language models. InThe Fourteenth International Conference on Learning Representations, October 2025. URLh...

2025

[47] [47]

Expel: Llm agents are experiential learners, December 2024

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners, December 2024. URL http://arxiv.org/abs/2308. 10144. 13 A TextArena Liar’s Dice Details This appendix gives additional details for the TextArena Liar’s Dice experiments reported in Sec- tion 4.1. These experiments serve as a light...

2024