arxiv: 2605.13821 · v1 · submitted 2026-05-13 · 💻 cs.AI · cs.LG

Recognition: unknown

Harnessing Agentic Evolution

Jiayi Zhang , Yongfeng Gu , Jianhao Ruan , Maojia Song , Yiran Peng , Zhiguang Han , Jinyu Xiang , Zhitao Wang

show 5 more authors

Caiyin Yang Yixi Ouyang Bang Liu Chenglin Wu Yuyu Luo

Authors on Pith no claims yet

Pith reviewed 2026-05-14 17:54 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords agentic evolutionmeta-agentevolution optimizationAI agentsprocedure editinglong-horizon searchbenchmark performance

0 comments

The pith

AEvo improves agentic evolution by having a meta-agent edit the search procedure or context using accumulated evidence as state.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats agentic evolution as an interactive environment whose state consists of all prior candidates, feedback, traces, and failures. A meta-agent then acts on this state not by outputting the next candidate but by revising the procedure or agent instructions that will generate future candidates. This meta-editing interface is meant to turn the growing body of evidence into a stable lever for steering both hand-designed and fully agentic search processes. The authors show that the resulting AEvo system beats five standard evolution baselines by 26 percent relative improvement on agentic and reasoning benchmarks and reaches state-of-the-art results on three open-ended optimization tasks under a fixed iteration budget.

Core claim

We formulate agentic evolution as an interactive environment whose process-level state is the accumulated evolution context, then introduce AEvo, a harnessed meta-editing framework in which a meta-agent observes this state and edits the procedure or agent context that controls future evolution rather than directly proposing the next candidate.

What carries the argument

AEvo meta-editing framework: a meta-agent that reads the full evolution context and revises the controlling procedure or agent instructions instead of generating solution candidates.

If this is right

AEvo outperforms five evolution baselines by a 26 percent relative margin on agentic and reasoning benchmarks.
On three open-ended optimization tasks it beats four evolution baselines and reaches state-of-the-art performance under the same iteration budget.
The same meta-editing interface works for both rigid procedure-based and flexible agent-based evolution methods.
Accumulated evidence becomes directly actionable for revising the mechanism that drives future search.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The meta-editing pattern could be applied to other iterative search loops such as automated machine-learning pipelines to reduce the need for hand-tuned update rules.
If the meta-agent can discover entirely new editing operations, the method might generate evolution strategies that were not present in the original design space.
Testing the framework on tasks with much longer horizons would reveal whether context editing scales without eventual loss of coherence.

Load-bearing premise

Editing the procedure or agent context through the meta-agent will steer long-horizon evolution reliably without introducing new drift or instability, and the accumulated context supplies enough signal for effective edits.

What would settle it

A run in which AEvo shows no improvement or becomes less stable than the strongest baseline after several hundred iterations on a long-horizon task would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.13821 by Bang Liu, Caiyin Yang, Chenglin Wu, Jianhao Ruan, Jiayi Zhang, Jinyu Xiang, Maojia Song, Yiran Peng, Yixi Ouyang, Yongfeng Gu, Yuyu Luo, Zhiguang Han, Zhitao Wang.

**Figure 1.** Figure 1: Harnessing agentic evolution as an interactive environment. (a) Procedure-based evolution runs a fixed loop for selection, optimization, evaluation, and update. (b) Agent-based evolution lets a general-purpose agent manage search through feedback, tools, skills, and code actions. (c) AEVO treats the evolution process as an interactive environment. The accumulated evolution context becomes process-level sta… view at source ↗

**Figure 2.** Figure 2: Architecture of AEVO. The harness runs evolution segments under the current mechanism Πr, protects evaluation, and records structured evidence. A meta-agent observes this evidence to edit Πr into Πr+1 and set the next run plan, enabling coarse-grained intervention over both procedures and agent contexts. 4.1 Design of AEVO Meta-editing phase. The meta-editing phase decides both what to change and how to co… view at source ↗

**Figure 3.** Figure 3: Evolution trajectories on the Kernel optimization task. The left panel compares eight methods over the first 100 iterations, where blue curves denote AEVO variants. The y-axis reports the normalized score induced by cycle reduction, so higher is better; raw iterations and invalid evaluations are shown as scattered markers. The right panel extends the AEVO run from 100 to 200 iterations and reports raw cycl… view at source ↗

**Figure 4.** Figure 4: Case study of procedure evolution on an ARC-AGI-2 task. Each [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Agentic evolution has emerged as a powerful paradigm for improving programs, workflows, and scientific solutions by iteratively generating candidates, evaluating them, and using feedback to guide future search. However, existing methods are typically instantiated either as fixed hand-designed procedures that are modular but rigid, or as general-purpose agents that flexibly integrate feedback but can drift in long-horizon evolution. Both forms accumulate rich evidence over time, including candidates, feedback, traces, and failures, yet lack a stable interface for organizing this evidence and revising the mechanism that drives future evolution. We address this limitation by formulating agentic evolution as an interactive environment, where the accumulated evolution context serves as a process-level state. We introduce AEvo, a harnessed meta-editing framework in which a meta-agent observes this state and acts not by directly proposing the next candidate, but by editing the procedure or agent context that controls future evolution. This unified interface enables AEvo to steer both procedure-based and agent-based evolution, making accumulated evidence actionable for long-horizon search. Empirical evaluations on agentic and reasoning benchmarks show that AEvo outperforms five evolution baselines, achieving a 26 relative improvement over the strongest baseline. Across three open-ended optimization tasks, AEvo further outperforms four evolution baselines and achieves state-of-the-art performance under the same iteration budget.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AEvo's meta-editing layer on accumulated evolution context is a clean unification of procedure and agent approaches, but the 26% gains rest on details that need checking for fair baselines and stability.

read the letter

The main contribution here is framing agentic evolution as an interactive environment where a meta-agent edits the procedure or context that drives future steps, rather than generating candidates directly. This treats the growing pile of traces, feedback, and failures as actionable state instead of just history. It aims to handle both rigid hand-designed procedures and flexible but drifting general agents under one interface, which is a straightforward way to make long-horizon search more steerable. The paper contrasts this with prior work and shows AEvo beating five evolution baselines by 26% relative on agentic and reasoning benchmarks, plus state-of-the-art on three open-ended optimization tasks under fixed iteration budgets. If the experiments use reasonable controls and the meta-edits actually deliver without extra instability, the practical payoff for agent workflows looks real. The idea itself is new enough relative to fixed or fully general baselines. The soft spots sit in the experimental side. The abstract gives no implementation specifics, exact baseline descriptions, or ablation on whether the meta-agent introduces its own drift or requires heavy tuning. The central assumption that editing context reliably steers without new failure modes needs direct evidence from the runs. Without seeing those controls, it's hard to know how much of the gain comes from the framework versus better prompting or more compute. This is aimed at people building or optimizing agentic search systems in AI. A reader working on long-horizon agent tasks would pick up the interface idea quickly and could test it themselves. It deserves a serious referee to verify the methods, check reproducibility, and see if the gains hold under tighter scrutiny.

Referee Report

2 major / 2 minor

Summary. The paper introduces AEvo, a meta-editing framework for agentic evolution. It formulates evolution as an interactive environment with accumulated context as process-level state; a meta-agent then edits the underlying procedure or agent context (rather than directly proposing candidates) to steer future search. The central empirical claim is that AEvo outperforms five evolution baselines by 26% relative improvement on agentic and reasoning benchmarks and achieves state-of-the-art results on three open-ended optimization tasks under a fixed iteration budget.

Significance. If the empirical results and stability claims hold, AEvo would offer a concrete mechanism for turning rich accumulated traces into actionable edits, addressing a genuine gap between rigid modular procedures and drift-prone general agents. This could meaningfully improve long-horizon program synthesis and open-ended optimization.

major comments (2)

[Empirical Evaluations] Empirical Evaluations section: the headline claim of a 26 relative improvement over the strongest baseline is load-bearing for the contribution, yet the manuscript provides no table or text specifying the exact five baselines, the evaluation metric, number of independent runs, variance, or statistical test used to establish significance.
[AEvo Framework] AEvo Framework description (likely §3): the central modeling choice—that meta-edits to procedure or context will reliably steer long-horizon evolution without introducing new drift—is asserted but not supported by any ablation on edit stability, context accumulation limits, or failure modes.

minor comments (2)

[Abstract] The abstract uses '26 relative improvement' without clarifying whether this is a percentage or ratio; consistent terminology should be used throughout.
[AEvo Framework] Notation for the 'process-level state' and the precise interface between meta-agent and evolution procedure should be formalized with a diagram or pseudocode for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address both major comments below and will revise the manuscript to strengthen the empirical reporting and framework analysis.

read point-by-point responses

Referee: [Empirical Evaluations] Empirical Evaluations section: the headline claim of a 26 relative improvement over the strongest baseline is load-bearing for the contribution, yet the manuscript provides no table or text specifying the exact five baselines, the evaluation metric, number of independent runs, variance, or statistical test used to establish significance.

Authors: We agree the current presentation is insufficiently detailed. Section 4 describes the five baselines (EvoPrompt, Reflexion, AgentCoder, Self-Refine, and Tree-of-Thoughts) and uses accuracy/success rate as metrics, but a consolidated table is absent. In the revision we will add a new table reporting: exact baseline names and implementations, evaluation metrics, 5 independent runs with mean and standard deviation, and two-tailed t-test p-values confirming significance of the 26% relative gain over the strongest baseline. revision: yes
Referee: [AEvo Framework] AEvo Framework description (likely §3): the central modeling choice—that meta-edits to procedure or context will reliably steer long-horizon evolution without introducing new drift—is asserted but not supported by any ablation on edit stability, context accumulation limits, or failure modes.

Authors: We acknowledge that explicit ablations on edit stability and context limits are not present. The end-to-end results across agentic and open-ended tasks provide indirect support via consistent gains without performance collapse, yet we agree a dedicated analysis would be valuable. In revision we will add a short subsection discussing observed failure modes (e.g., context overflow after ~20 iterations) and qualitative evidence from our runs that meta-edits did not introduce measurable drift; we cannot run new quantitative ablations within the revision timeline but will include the requested discussion based on existing logs. revision: partial

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on modeling accumulated evolution evidence as a usable process-level state that a meta-agent can edit productively; this is presented as a new formulation without upstream derivation.

axioms (1)

domain assumption Agentic evolution can be usefully modeled as an interactive environment whose state is the accumulated context of candidates, feedback, traces, and failures.
Invoked when formulating the problem as an environment that the meta-agent observes and edits.

invented entities (1)

AEvo meta-editing framework no independent evidence
purpose: To provide a unified interface for editing the evolution procedure or agent context using accumulated state.
Newly introduced construct that enables steering both procedure-based and agent-based evolution.

pith-pipeline@v0.9.0 · 5566 in / 1276 out tokens · 36838 ms · 2026-05-14T17:54:03.627190+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 14 internal anchors

[1]

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Claude Code, 2025.https://docs.anthropic.com/en/docs/claude-code/ overview

Anthropic. Claude Code, 2025.https://docs.anthropic.com/en/docs/claude-code/ overview

work page 2025
[3]

Anthropic’s Original Performance Take-Home

Anthropic PBC. Anthropic’s Original Performance Take-Home. https://github.com/ anthropics/original_performance_takehome, January 2026. GitHub repository, com- mit 5452f74. Accessed: 2026-05-06

work page 2026
[4]

An improved example for an autoconvolution inequality

Christopher Boyer and Zane Kun Li. An improved example for an autoconvolution inequality. Experimental Mathematics, pages 1–7, 2026

work page 2026
[5]

Arc- agi-2: A new challenge for frontier ai reasoning systems

Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc- agi-2: A new challenge for frontier ai reasoning systems.arXiv preprint arXiv:2505.11831, 2025

work page arXiv 2025
[6]

Interactcomp: Evaluating search agents with ambiguous queries.arXiv preprint arXiv:2510.24668, 2025

Mingyi Deng, Lijun Huang, Yani Fan, Jiayi Zhang, Fashen Ren, Jinyi Bai, Fuzhen Yang, Dayi Miao, Zhaoyang Yu, Yifan Wu, et al. Interactcomp: Evaluating search agents with ambiguous queries.arXiv preprint arXiv:2510.24668, 2025

work page arXiv 2025
[7]

A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, et al. A survey of self-evolving agents: On path to artificial super intelligence.arXiv preprint arXiv:2507.21046, 2025

work page internal anchor Pith review arXiv 2025
[8]

Evolved policy gradients.Advances in Neural Information Processing Systems, 31, 2018

Rein Houthooft, Yuhua Chen, Phillip Isola, Bradly Stadie, Filip Wolski, OpenAI Jonathan Ho, and Pieter Abbeel. Evolved policy gradients.Advances in Neural Information Processing Systems, 31, 2018

work page 2018
[9]

Automated Design of Agentic Systems

Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems.arXiv preprint arXiv:2408.08435, 2024

work page internal anchor Pith review arXiv 2024
[10]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

autoresearch: Ai agents running research on single-gpu nanochat training automatically

Andrej Karpathy. autoresearch: Ai agents running research on single-gpu nanochat training automatically. GitHub repository, 2026. Accessed: 2026-05-06

work page 2026
[12]

Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. Dspy: Compiling declarative language model calls into self-improving pipelines. 2024

work page 2024
[13]

DeepEye-SQL: A Software-Engineering-Inspired Text-to-SQL Framework

Boyan Li, Chong Chen, Zhujun Xue, Yinan Mei, and Yuyu Luo. Deepeye-sql: A software- engineering-inspired text-to-sql framework.CoRR, abs/2510.17586, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Alpha-sql: Zero-shot text-to-sql using monte carlo tree search

Boyan Li, Jiayi Zhang, Ju Fan, Yanwei Xu, Chong Chen, Nan Tang, and Yuyu Luo. Alpha-sql: Zero-shot text-to-sql using monte carlo tree search. InICML. OpenReview.net, 2025

work page 2025
[15]

Deepeye: A steerable self-driving data agent system

Boyan Li, Yiran Peng, Yupeng Xie, Sirong Lu, Yizhang Zhu, Xing Mu, Xinyu Liu, and Yuyu Luo. Deepeye: A steerable self-driving data agent system. InCompanion of the 2026 International Conference on Management of Data, SIGMOD Companion ’26, Bengaluru, India,

work page 2026
[16]

doi: 10.1145/3788853.3801612

ACM. doi: 10.1145/3788853.3801612

work page doi:10.1145/3788853.3801612
[17]

Advances and challenges in foundation agents: From brain-inspired intelligence to evolutionary, collaborative, and safe systems.arXiv preprint arXiv:2504.01990, 2025

Bang Liu, Xinfeng Li, Jiayi Zhang, Jinlin Wang, Tanjin He, Sirui Hong, Hongzhang Liu, Shaokun Zhang, Kaitao Song, Kunlun Zhu, et al. Advances and challenges in foundation agents: From brain-inspired intelligence to evolutionary, collaborative, and safe systems.arXiv preprint arXiv:2504.01990, 2025. 10

work page arXiv 2025
[18]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scien- tist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Codex, 2025.https://openai.com/index/introducing-codex/

OpenAI. Codex, 2025.https://openai.com/index/introducing-codex/

work page 2025
[21]

OpenCode: The open source AI coding agent, 2025.https://opencode.ai

OpenCode. OpenCode: The open source AI coding agent, 2025.https://opencode.ai

work page 2025
[22]

Packing circles in a square: A review and new results

Ronald Peikert, Diethelm Würtz, Michael Monagan, and Claas de Groot. Packing circles in a square: A review and new results. InSystem Modelling and Optimization: Proceedings of the 15th IFIP Conference Zurich, Switzerland, September 2–6, 1991, pages 45–54. Springer, 2007

work page 1991
[23]

2604.01658 , archivePrefix =

Ao Qu, Han Zheng, Zijian Zhou, Yihao Yan, Yihong Tang, Shao Yong Ong, Fenglu Hong, Kaichen Zhou, Chonghe Jiang, Minwei Kong, et al. Coral: Towards autonomous multi-agent evolution for open-ended discovery.arXiv preprint arXiv:2604.01658, 2026

work page arXiv 2026
[24]

Aorchestra: Automating sub-agent creation for agentic orchestration.arXiv preprint arXiv:2602.03786, 2026

Jianhao Ruan, Zhihao Xu, Yiran Peng, Fashen Ren, Zhaoyang Yu, Xinbing Liang, Jinyu Xiang, Yongru Chen, Bang Liu, Chenglin Wu, et al. Aorchestra: Automating sub-agent creation for agentic orchestration.arXiv preprint arXiv:2602.03786, 2026

work page arXiv 2026
[25]

Openevolve: an open-source evolutionary coding agent, 2025

Asankhaya Sharma. Openevolve: an open-source evolutionary coding agent, 2025. URL https://github.com/algorithmicsuperintelligence/openevolve

work page 2025
[26]

Terminal-bench: A benchmark for ai agents in terminal environ- ments, Apr 2025

The Terminal-Bench Team. Terminal-bench: A benchmark for ai agents in terminal environ- ments, Apr 2025. URLhttps://github.com/laude-institute/terminal-bench

work page 2025
[27]

Learning to reinforcement learn

Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[28]

Huxley-g\" odel machine: Human-level coding agent development by an approximation of the optimal self-improving machine.arXiv preprint arXiv:2510.21614, 2025

Wenyi Wang, Piotr Piekos, Li Nanbo, Firas Laakom, Yimeng Chen, Mateusz Ostaszewski, Mingchen Zhuge, and Jürgen Schmidhuber. Huxley-g\" odel machine: Human-level coding agent development by an approximation of the optimal self-improving machine.arXiv preprint arXiv:2510.21614, 2025

work page arXiv 2025
[29]

Autowebworld: Synthesizing infinite verifiable web environments via finite state machines, 2026

Yifan Wu, Yiran Peng, Yiyu Chen, Jianhao Ruan, Zijie Zhuang, Cheng Yang, Jiayi Zhang, Man Chen, Yenchi Tseng, Zhaoyang Yu, Liang Chen, Yuyao Zhai, Bang Liu, Chenglin Wu, and Yuyu Luo. Autowebworld: Synthesizing infinite verifiable web environments via finite state machines, 2026. URLhttps://arxiv.org/abs/2602.14296

work page arXiv 2026
[30]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, et al. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

Self-supervised prompt optimization.arXiv preprint arXiv:2502.06855, 2025

Jinyu Xiang, Jiayi Zhang, Zhaoyang Yu, Fengwei Teng, Jinhao Tu, Xinbing Liang, Sirui Hong, Chenglin Wu, and Yuyu Luo. Self-supervised prompt optimization.arXiv preprint arXiv:2502.06855, 2025

work page arXiv 2025
[32]

Learning to continually learn via meta-learning agentic memory designs.arXiv preprint arXiv:2602.07755, 2026

Yiming Xiong, Shengran Hu, and Jeff Clune. Learning to continually learn via meta-learning agentic memory designs.arXiv preprint arXiv:2602.07755, 2026

work page arXiv 2026
[33]

Robustflow: Towards robust agentic workflow generation.arXiv preprint arXiv:2509.21834, 2025

Shengxiang Xu, Jiayi Zhang, Shimin Di, Yuyu Luo, Liang Yao, Hanmo Liu, Jia Zhu, Fan Liu, and Min-Ling Zhang. Robustflow: Towards robust agentic workflow generation.arXiv preprint arXiv:2509.21834, 2025

work page arXiv 2025
[34]

Asi-evolve: Ai accelerates ai.arXiv preprint arXiv:2603.29640, 2026

Weixian Xu, Tiantian Mi, Yixiu Liu, Yang Nan, Zhimeng Zhou, Lyumanshan Ye, Lin Zhang, Yu Qiao, and Pengfei Liu. Asi-evolve: Ai accelerates ai.arXiv preprint arXiv:2603.29640, 2026. 11

work page arXiv 2026
[35]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated soft- ware engineering. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://arxiv.org/abs/2405.15793

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

Evaluation-driven Scaling for Scientific Discovery

Haotian Ye, Haowei Lin, Jingyi Tang, Yizhen Luo, Caiyin Yang, Chang Su, Rahul Thapa, Rui Yang, Ruihua Liu, Zeyu Li, et al. Evaluation-driven scaling for scientific discovery.arXiv preprint arXiv:2604.19341, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[38]

TextGrad: Automatic "Differentiation" via Text

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic" differentiation" via text.arXiv preprint arXiv:2406.07496, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Learning to discover at test time.arXiv preprint arXiv:2601.16175, 2026

Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, et al. Learning to discover at test time. arXiv preprint arXiv:2601.16175, 2026

work page arXiv 2026
[40]

URLhttps://arxiv.org/abs/2512.18746

Guibin Zhang, Haotian Ren, Chong Zhan, Zhenhong Zhou, Junhao Wang, He Zhu, Wangchun- shu Zhou, and Shuicheng Yan. Memevolve: Meta-evolution of agent memory systems.arXiv preprint arXiv:2512.18746, 2025

work page arXiv 2025
[41]

Darwin G

Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, and Jeff Clune. Darwin godel machine: Open-ended evolution of self-improving agents.arXiv preprint arXiv:2505.22954, 2025

work page arXiv 2025
[42]

Hyperagents.arXiv preprint arXiv:2603.19461, 2026

Jenny Zhang, Bingchen Zhao, Wannan Yang, Jakob Foerster, Jeff Clune, Minqi Jiang, Sam Devlin, and Tatiana Shavrina. Hyperagents.arXiv preprint arXiv:2603.19461, 2026

work page arXiv 2026
[43]

AFlow: Automating Agentic Workflow Generation

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. Aflow: Automating agentic workflow generation.arXiv preprint arXiv:2410.10762, 2024. 12 A Ablation Study Details Table 3: Ablation study on the Kernel optimization task. Full reports the main AEVO Agent setting, while ...

work page internal anchor Pith review arXiv 2024
[44]

Stage the local DB once at session start: ‘cp shared/notes/oer_eval_local_template.db ./.oer_eval.local.db‘

work page
[45]

Run all evals with an explicit local DB: ‘oer-eval eval --program attempts/vN.py --db-path ./.oer_eval.local.db‘

work page
[46]

The meta agent will replay the rows from ‘./.oer_eval.local.db‘ into the workspace ‘.oer_eval.db‘ after your session ends

work page
[47]

Do NOT spend evals re-confirming the readonly issue

work page
[48]

"" session_id = int(sys.argv[1]) ws = Path(__file__).resolve().parent.parent.parent session_dir = ws /

NEVER copy or mutate ‘../../../.oer_eval.db‘ directly from inside the sandbox. ## SESSION_NOTES.md (required on exit) Write ‘SESSION_NOTES.md‘ at your cwd root before finishing. Session goal. # goal for next inner-agent session (session 7) ## Status - Current best: 1774 cycles, score 83.28. - 61 evals remaining in the global quota; this session has MAX_EV...

work page