pith. sign in

arxiv: 2606.03895 · v2 · pith:EATBPAAFnew · submitted 2026-06-02 · 💻 cs.OS · cs.AI· cs.CR

Agent libOS: A Runtime Substrate for Capability-Controlled Self-Evolving LLM Agents

Pith reviewed 2026-06-30 11:03 UTC · model grok-4.3

classification 💻 cs.OS cs.AIcs.CR
keywords LLM agentscapability-based securitylibrary operating systemself-evolving agentsruntime substrateagent safetyprocess abstractionsandboxing
0
0 comments X

The pith

Agent libOS separates model-visible agent affordances from resource authority, changing the latter only via explicit audited primitives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Agent libOS as a library operating system substrate for LLM agents that accumulate memory, skills, tools, and child processes over time. Its design ensures that these evolving mechanisms do not automatically confer authority over files, shells, memory, or external resources. Authority transfers occur only through runtime primitives that are audited and capability-mediated. A sympathetic reader would care because self-evolving agents otherwise create a direct path from new capabilities to permission escalation in long-running systems.

Core claim

Agent libOS represents each agent as an AgentProcess holding process identity, object memory, message queues, tool tables, loaded skills, JIT tools, child processes, budgets, checkpoints, and explicit capabilities. AgentImage objects and checkpoint-derived images capture boot and reusable state. None of these elements grants filesystem, shell, human, memory, process, checkpoint, image, JSON-RPC, MCP, or PTY authority on its own. The prototype implements the required namespaces, observability, approval queues, and syscall mediation. On 27 versioned deterministic tasks the system completed every task plan, blocked all modeled unauthorized side effects, and recorded a 7.0 percent conservative f

What carries the argument

The AgentProcess abstraction, which isolates evolving model-visible affordances from resource authority and routes all authority changes through explicit audited runtime primitives.

If this is right

  • Skills, JIT tools, and checkpoints extend the action surface without granting resource authority by themselves.
  • Checkpoint restore, fork, and commit operations remain under capability control.
  • Human approval queues and budgets integrate directly into the same runtime substrate.
  • Process-local namespaces and observability allow deterministic verification of safety properties.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation of affordance evolution from authority change could be applied to non-LLM autonomous systems that synthesize new behaviors.
  • Extending the side-effect model beyond the fixed benchmark set would be required before claiming protection in open, non-deterministic environments.
  • Reusable AgentImage objects carrying state but not authority suggest a path toward safe migration of long-running agent instances across hosts.

Load-bearing premise

The set of modeled unauthorized side effects in the benchmark accurately captures the real security threats that would arise from self-evolving agents in open environments.

What would settle it

An execution trace in which an agent using Agent libOS performs an unmodeled unauthorized side effect after evolving its tool table or skills.

Figures

Figures reproduced from arXiv: 2606.03895 by Yingqi Zhang.

Figure 1
Figure 1. Figure 1: Agent libOS separates model-facing action schemas from primitive-level authority [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 1
Figure 1. Figure 1: The Agent libOS layer model. Model-visible self [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
read the original abstract

Large language model (LLM) agents are becoming long-running software actors rather than fixed tool users. They accumulate memory, activate skills, synthesize tools, fork children, attach remote resources, and commit checkpoints into reusable execution images. These mechanisms improve adaptability, but also create a systems-security failure mode: if exposing an action also grants the authority needed to perform it, self-evolution becomes a permission-escalation path. This paper presents Agent libOS, an agent-native library-OS substrate for capability-controlled self-evolving agents. Its central invariant is that model-visible affordances may evolve while resource authority changes only through explicit, audited runtime primitives. Agent libOS represents an agent as an AgentProcess with process identity, process-local Object Memory, message queues, a tool table, loaded Skills, process-local Deno/TypeScript JIT tools, child processes, budgets, checkpoints, and explicit capabilities. AgentImage objects define boot-time prompt and tool-table state; Skills and JIT tools extend the action surface; checkpoint-derived images make internal state reusable. None of these mechanisms grants filesystem, shell, human, memory, process, checkpoint, image, JSON-RPC, MCP, or PTY authority by itself. The prototype implements process-local namespaces, persistent runtime state, LLM-call observability, human approval queues, budgets, syscall-mediated JIT tools, trusted Runtime Modules, Object-bound PTY sessions, checkpoint restore/fork/commit, JSON-RPC and MCP providers, and a deterministic runtime-safety benchmark. On 27 versioned deterministic tasks, it completed the task plans while preventing all modeled unauthorized side effects, with a 7.0% conservative false-denial rate. Simple wrapper and sandbox baselines preserved task completion but failed most safety checks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents Agent libOS, a library operating system designed as a runtime substrate for LLM agents that self-evolve through mechanisms like tool synthesis, child forking, JIT compilation, and checkpointing. The central claim is that the system enforces an invariant where model-visible affordances can evolve but resource authority changes only via explicit, audited runtime primitives. The prototype is evaluated on 27 versioned deterministic tasks, completing task plans while preventing all modeled unauthorized side effects with a 7.0% conservative false-denial rate, while simple wrapper and sandbox baselines fail most safety checks.

Significance. If the evaluation is sound, this work offers a significant systems contribution by providing a capability-based runtime for secure self-evolving agents, addressing a key security challenge in long-running LLM agents. The explicit design of primitives that do not grant authority by themselves is a strength, and the benchmark provides initial empirical support for the approach.

major comments (2)
  1. [Abstract] Abstract: The evaluation reports success on 27 tasks but provides no definitions of the tasks, no details on the threat model for 'modeled unauthorized side effects', and no indication that the tasks exercise self-evolution primitives (e.g., tool synthesis or checkpointing). This leaves open whether the benchmark validates the invariant under dynamic evolution or only in static cases.
  2. [Abstract] Abstract: The false-denial rate of 7.0% is described as 'conservative' but without statistical analysis, variance, or per-task breakdown reported, it is difficult to assess the reliability of the safety claims relative to the baselines.
minor comments (1)
  1. The abstract mentions 'versioned deterministic tasks' but does not clarify what versioning entails or how determinism is ensured across the runtime mechanisms.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We agree that additional context is warranted and will revise the abstract and evaluation section accordingly. Point-by-point responses to the major comments are below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The evaluation reports success on 27 tasks but provides no definitions of the tasks, no details on the threat model for 'modeled unauthorized side effects', and no indication that the tasks exercise self-evolution primitives (e.g., tool synthesis or checkpointing). This leaves open whether the benchmark validates the invariant under dynamic evolution or only in static cases.

    Authors: We agree the abstract is too terse on these points. The full manuscript defines the 27 versioned deterministic tasks in Section 4.1 and the threat model (unauthorized access to filesystem, shell, human, memory, process, checkpoint, image, JSON-RPC, MCP, or PTY resources without explicit capabilities) in Section 3.2. The tasks explicitly exercise self-evolution primitives including tool synthesis, child forking, JIT compilation, and checkpointing, as stated in the benchmark description. We will revise the abstract to include a concise reference to the task definitions, threat model, and the fact that the benchmark covers dynamic evolution scenarios, thereby confirming that the invariant is tested under self-evolution rather than only static cases. revision: yes

  2. Referee: [Abstract] Abstract: The false-denial rate of 7.0% is described as 'conservative' but without statistical analysis, variance, or per-task breakdown reported, it is difficult to assess the reliability of the safety claims relative to the baselines.

    Authors: The benchmark consists of 27 fixed deterministic tasks, so no run-to-run variance exists and standard statistical measures such as confidence intervals are not applicable. The 7.0% aggregate false-denial rate counts every denial of a safe action as a false positive (hence 'conservative'). We will add a per-task breakdown of false denials to the evaluation section and insert a brief summary sentence in the abstract. This will improve comparability with the wrapper and sandbox baselines. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark stands on its own

full rationale

The paper's central claim rests on an empirical evaluation: completion of 27 versioned deterministic tasks while blocking all modeled unauthorized side effects (with a reported 7.0% false-denial rate) and comparison against wrapper/sandbox baselines. No equations, fitted parameters renamed as predictions, self-citation chains, uniqueness theorems, or ansatzes are invoked in the provided text. The design invariant (authority changes only via explicit audited primitives) is presented as a stated architectural property rather than derived from the benchmark results. The evaluation is therefore self-contained against external benchmarks and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, domain axioms, or invented scientific entities; the contribution is a software architecture whose correctness rests on unstated implementation details and threat-model assumptions.

pith-pipeline@v0.9.1-grok · 5844 in / 1109 out tokens · 56356 ms · 2026-06-30T11:03:48.741027+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 13 canonical work pages · 4 internal anchors

  1. [1]

    Fire- cracker: Lightweight virtualization for serverless applications

    Alexandru Agache, Marc Brooker, Andreea Florescu, Alexandra Iordache, An- thony Liguori, Rolf Neugebauer, Phil Piwonka, and Diana-Maria Popa. Fire- cracker: Lightweight virtualization for serverless applications. In17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pages 419–434. USENIX Association, 2020. URL https://www.useni...

  2. [2]

    A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

    Huan ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, Hongru Wang, Han Xiao, Yuhang Zhou, Shaokun Zhang, Jiayi Zhang, Jinyu Xiang, Yixiong Fang, Qiwen Zhao, Dongrui Liu, Qihan Ren, Cheng Qian, Zhenhailong Wang, Minda Hu, Huazheng Wang, Qingyun Wu, Heng Ji, and Mengdi Wang. A survey of se...

  3. [3]

    AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents

    Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems Datasets and Bench- marks Track, 2024. URL https://openreview.net/forum?id...

  4. [4]

    18 Brendan Fong and David I

    Jack B. Dennis and Earl C. Van Horn. Programming semantics for multipro- grammed computations.Communications of the ACM, 9(3):143–155, 1966. doi: 10.1145/365230.365252. URL https://doi.org/10.1145/365230.365252

  5. [5]

    D. R. Engler, M. F. Kaashoek, and J. O’Toole. Exokernel: An operating system architecture for application-level resource management. InProceedings of the Fifteenth ACM Symposium on Operating Systems Principles, pages 251–266. As- sociation for Computing Machinery, 1995. doi: 10.1145/224057.224076. URL https://doi.org/10.1145/224057.224076

  6. [6]

    Agentscope: A flexible yet robust multi-agent platform.arXiv preprint arXiv:2402.14034, 2024

    Dawei Gao, Zitao Li, Xuchen Pan, Weirui Kuang, Zhijian Ma, Bingchen Qian, Fei Wei, Wenhao Zhang, Yuexiang Xie, Daoyuan Chen, Liuyi Yao, Hongyi Peng, Zeyu Zhang, Lin Zhu, Chen Cheng, Hongzhu Shi, Yaliang Li, Bolin Ding, and Jingren Zhou. AgentScope: A flexible yet robust multi-agent platform, 2024. URL https://arxiv.org/abs/2402.14034

  7. [7]

    Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security , pages =

    Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Security, AISec ’23, page 79–90, New York, NY, USA, 2023. Association for ...

  8. [8]

    KeyKOS architecture.ACM SIGOPS Operating Systems Review, 19(4):8–25, 1985

    Norman Hardy. KeyKOS architecture.ACM SIGOPS Operating Systems Review, 19(4):8–25, 1985. doi: 10.1145/858336.858337. URL https://doi.org/10.1145/858336. 858337

  9. [9]

    MetaGPT: Meta programming for a multi-agent collaborative framework

    Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. MetaGPT: Meta programming for a multi-agent collaborative framework. In The Twelfth International Conference on Learning Representations, 20...

  10. [10]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=VTF8yNQM66

  11. [11]

    CAMEL: Communicative agents for “mind” exploration of large language model society

    Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. CAMEL: Communicative agents for “mind” exploration of large language model society. InAdvances in Neural Information Processing Systems, volume 36, 2023. URL https://proceedings.neurips.cc/paper_files/paper/ 2023/hash/a3621ee907def47c1b952ade25c67698-Abstract-Confe...

  12. [12]

    Unikernels: Library operating systems for the cloud

    Anil Madhavapeddy, Richard Mortier, Charalampos Rotsos, David Scott, Balraj Singh, Thomas Gazagnaire, Steven Smith, Steven Hand, and Jon Crowcroft. Unikernels: Library operating systems for the cloud. InProceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 461–472. Association for ...

  13. [13]

    AIOS: LLM agent op- erating system

    Kai Mei, Xi Zhu, Wujiang Xu, Mingyu Jin, Wenyue Hua, Zelong Li, Shuyuan Xu, Ruosong Ye, Yingqiang Ge, and Yongfeng Zhang. AIOS: LLM agent op- erating system. InSecond Conference on Language Modeling, 2025. URL https://openreview.net/forum?id=L4HHkCDz2x

  14. [14]

    Docker: Lightweight Linux containers for consistent development and deployment.Linux Journal, 2014(239), 2014

    Dirk Merkel. Docker: Lightweight Linux containers for consistent development and deployment.Linux Journal, 2014(239), 2014. URL https://www.linuxjournal.com/content/docker-lightweight-linux-containers- consistent-development-and-deployment

  15. [15]

    Johns Hopkins University, 2006

    Mark Miller.Robust composition: Towards a uni ed approach to access control and concurrency control. Johns Hopkins University, 2006

  16. [16]

    Patil, Ion Stoica, and Joseph E

    Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT: Towards LLMs as operating systems,

  17. [17]

    URL https://arxiv.org/abs/2310.08560

  18. [18]

    ToolLLM: Facilitating large language models to master 16000+ real-world APIs

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. ToolLLM: Facilitating large language models to master 16000+ real-world APIs. InThe Twelfth International Conference on Learning...

  19. [19]

    Alita: Generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution, 2025

    Jiahao Qiu, Xuan Qi, Tongcheng Zhang, Xinzhe Juan, Jiacheng Guo, Yifu Lu, Yimin Wang, Zixin Yao, Qihan Ren, Xun Jiang, Xing Zhou, Dongrui Liu, Ling Yang, Yue Wu, Kaixuan Huang, Shilong Liu, Hongru Wang, and Mengdi Wang. Alita: Generalist agent enabling scalable agentic reasoning with minimal predef- inition and maximal self-evolution, 2025. URL https://ar...

  20. [20]

    Maddison, and Tatsunori Hashimoto

    Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, and Tatsunori Hashimoto. Identifying the risks of LM agents with an LM-emulated sandbox. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=GEcwtMk1uA

  21. [21]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, vol- ume 36, 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/ d842425e4bf79...

  22. [22]

    Shapiro, Jonathan M

    Jonathan S. Shapiro, Jonathan M. Smith, and David J. Farber. EROS: A fast capability system. InProceedings of the 17th ACM Symposium on Operating Systems Principles, pages 170–185. Association for Computing Machinery, 1999. doi: 10.1145/319151.319163. URL https://doi.org/10.1145/319151.319163

  23. [23]

    HuggingGPT: Solving AI tasks with ChatGPT and its friends in Hugging Face

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yuet- ing Zhuang. HuggingGPT: Solving AI tasks with ChatGPT and its friends in Hugging Face. InAdvances in Neural Information Processing Systems, vol- ume 36, 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/ 77c33e6a367922d003ff102ffb92b658-Abstract-Conference.html

  24. [24]

    Reflexion: Language agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems, vol- ume 36, 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 1b44b878bb782e6954cd888628510e90-Paper-Conference.pdf. 11 Zhang

  25. [25]

    Voyager: An open-ended embodied agent with large language models.Transactions on Machine Learning Research,

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.Transactions on Machine Learning Research,

  26. [26]

    URL https://openreview.net/forum?id=ehfRiF0R3a

  27. [27]

    Executable code actions elicit better LLM agents

    Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better LLM agents. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 50208–50232, 2024. URL https://proceedings. mlr.press/v235/wang24h.html

  28. [28]

    Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. OpenHands: An open platform for AI soft...

  29. [29]

    Robert N. M. Watson, Jonathan Anderson, Ben Laurie, and Kris Kennaway. Capsicum: Practical capabilities for UNIX. In19th USENIX Security Symposium (USENIX Security 10). USENIX Association, 2010. URL https://www.usenix.org/ conference/usenixsecurity10/capsicum-practical-capabilities-unix

  30. [30]

    White, Doug Burger, and Chi Wang

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversations. InFirst Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=BAakY1hNKS

  31. [31]

    OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments. InAdvances in Neural Information P...

  32. [32]

    Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R. Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. InAdvances in Neural Information Processing Systems, volume 37, 2024. URL https://proceedings.neurips.cc/ paper_files/paper/2024/hash/5a7c947568c1b1328ccc5230172e1e7c-...

  33. [33]

    Narasimhan, and Yuan Cao

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations,

  34. [34]

    URL https://openreview.net/forum?id=WE_vluYUL-X

  35. [35]

    https://doi.org/10.18653/v1/2024.findings-acl.624

    Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. InjecAgent: Bench- marking indirect prompt injections in tool-integrated large language model agents. InFindings of the Association for Computational Linguistics: ACL 2024, pages 10471–10506, 2024. doi: 10.18653/v1/2024.findings-acl.624. URL https://aclanthology.org/2024.findings-acl.624/

  36. [36]

    Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents

    Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, and Jeff Clune. Darwin godel machine: Open-ended evolution of self-improving agents, 2025. URL https://arxiv.org/abs/2505.22954

  37. [37]

    AFlow: Automating agen- tic workflow generation

    Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, XiongHui Chen, Ji- aqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu. AFlow: Automating agen- tic workflow generation. In Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu, edi- tors,International Conference on Learning Representations, volume 202...

  38. [38]

    SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

    Boyuan Zheng, Michael Y. Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, Gaowen Liu, Graham Neubig, and Yu Su. SkillWeaver: Web agents can self-improve by discovering and honing skills, 2025. URL https://arxiv.org/abs/2504.07079

  39. [39]

    Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environment for building autonomous agents. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=oKn9c6ytLx. 12