pith. machine review for the scientific record. sign in

arxiv: 2605.06365 · v1 · submitted 2026-05-07 · 💻 cs.AI · cs.MA· cs.SE

Recognition: unknown

From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work

Josh Rosen, Seth Rosen

Pith reviewed 2026-05-08 09:47 UTC · model grok-4.3

classification 💻 cs.AI cs.MAcs.SE
keywords execution lineageAI agentsdirected acyclic graphsreproducibilitystate managementagentic workflowsdeterministic replay
0
0 comments X

The pith

Representing AI agent workflows as DAGs of computations with explicit dependencies preserves stable state and isolates changes under revisions where loop-based systems do not.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes execution lineage as a model for AI-native work that replaces implicit conversational loops with a directed acyclic graph of artifact-producing computations. It shows through controlled experiments that this structure enables identity-based replay, which keeps unchanged work products identical and prevents unrelated context from leaking into updates. The results separate the ability to produce a polished final output from the ability to maintain consistent internal state across successive revisions. This matters because agentic systems are increasingly used for evolving tasks where small inconsistencies can compound.

Core claim

Execution lineage represents AI work as a directed acyclic graph of artifact-producing computations with explicit dependencies, stable intermediate boundaries, and identity-based replay. In unrelated-branch policy-memo updates this produced exact final-memo preservation with zero churn and zero contamination, while loop baselines regenerated the memo and imported unrelated context. In intermediate-artifact edits only the DAG approach achieved perfect upstream preservation, downstream propagation, unaffected-artifact stability, and cross-artifact consistency.

What carries the argument

Execution lineage: a directed acyclic graph of artifact-producing computations with explicit dependencies, stable intermediate boundaries, and identity-based replay that isolates changes and enables deterministic re-execution.

If this is right

  • Final answer quality and maintained-state quality are distinct and can diverge even when both approaches produce polished outputs.
  • Loop-based systems may succeed on bounded synthesis tasks while still accumulating partial state inconsistencies that affect future revisions.
  • Unaffected artifacts remain unchanged and unrelated branches stay isolated under DAG replay.
  • Changes propagate only along explicit dependency paths, supporting reliable auditing of what was modified.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The model could support integration with external version-control systems by treating artifacts as versioned nodes with replayable provenance.
  • Longer-running agent sessions with repeated tool use would likely show larger consistency gaps between loops and DAG replay than the bounded test cases.
  • Extending the approach to conditional or dynamic graphs could preserve determinism while still accommodating agent decision branches.

Load-bearing premise

The two controlled policy-memo update tasks are representative of the state management challenges that arise in real-world, open-ended agentic workflows.

What would settle it

A multi-step agent workflow with branching tool calls and memory edits in which DAG replay produces either state churn or unrelated-branch contamination would falsify the claim of stronger consistency guarantees.

Figures

Figures reproduced from arXiv: 2605.06365 by Josh Rosen, Seth Rosen.

Figure 1
Figure 1. Figure 1: Agent loops carry work forward through prompt context and transcript state. Execution lineage represents view at source ↗
read the original abstract

Large language model systems are increasingly deployed as agentic workflows that interleave reasoning, tool use, memory, and iterative refinement. These systems are effective at producing answers, but they often rely on implicit conversational state, making it difficult to preserve stable work products, isolate irrelevant updates, or propagate changes through intermediate artifacts. We introduce execution lineage: an execution model in which AI-native work is represented as a directed acyclic graph (DAG) of artifact-producing computations with explicit dependencies, stable intermediate boundaries, and identity-based replay. The goal is not to make the model a better one-shot writer, but to make evolving AI-generated work maintainable under change. We compare execution-lineage replay against loop-centric update baselines on two controlled policy-memo update tasks. In an unrelated-branch update, DAG replay preserved the final memo exactly in all runs, with zero churn and zero unrelated-branch contamination, while loop baselines regenerated the memo and frequently imported unrelated context. In an intermediate-artifact edit, all systems reflected the new constraint in the final memo, but only DAG replay achieved perfect upstream preservation, downstream propagation, unaffected-artifact preservation, and cross-artifact consistency. These results show that final answer quality and maintained-state quality are distinct. Strong loop baselines can remain competitive at producing polished final outputs when the task is a bounded synthesis/update problem and all current sources fit in context, but immediate task success can mask partial state inconsistency that may compound over future revisions. Execution lineage provides stronger guarantees about what should change, what should remain stable, and how work evolves across revisions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes execution lineage, a DAG-based execution model for AI agent workflows that represents work as artifact-producing computations with explicit dependencies, stable boundaries, and identity-based replay. It evaluates this against loop-centric baselines on two controlled policy-memo update tasks (unrelated-branch update and intermediate-artifact edit), claiming that DAG replay achieves exact memo preservation, zero churn/contamination, perfect upstream/downstream consistency, and unaffected artifact preservation, while loops regenerate content and import unrelated context. The work argues that final-answer quality and maintained-state quality are distinct properties.

Significance. If the results hold, the contribution is significant for AI agent design: it offers a concrete mechanism to make evolving, tool-using workflows reproducible and maintainable rather than relying on implicit conversational state. The distinction between one-shot synthesis success and long-term state consistency is a useful framing, and the controlled-task design with clear qualitative outcomes provides a starting point for reproducible AI-native systems. Strengths include the use of independent tasks and direct baseline comparisons; limitations in generalizability to open-ended workflows are noted but do not invalidate the core demonstration.

major comments (2)
  1. [§4 (Evaluation)] §4 (Evaluation): The abstract and results claim 'perfect' preservation 'in all runs' with 'zero churn' and 'zero contamination,' yet no implementation details, number of runs, statistical measures (e.g., variance or success rate quantification), or edge-case analysis are provided. This leaves the central empirical claim only partially supported and difficult to assess for robustness.
  2. [§3 (Execution Lineage definition)] §3 (Execution Lineage definition): The model assumes explicit dependencies and identity-based replay can be maintained, but the manuscript does not specify how the DAG is constructed from agent traces or how dependency identification is automated; without this, the reproducibility guarantees are not fully operationalizable beyond the hand-crafted experimental tasks.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'AI-native work' is introduced without a concise definition or reference; adding one sentence would improve accessibility.
  2. The paper would benefit from a diagram showing the DAG structure and replay process for at least one of the policy-memo tasks to illustrate the claimed properties.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive assessment of the work's significance and for the constructive comments on empirical robustness and operationalizability. We address both major comments below with clarifications and targeted revisions that strengthen the manuscript without altering its core claims or scope.

read point-by-point responses
  1. Referee: §4 (Evaluation): The abstract and results claim 'perfect' preservation 'in all runs' with 'zero churn' and 'zero contamination,' yet no implementation details, number of runs, statistical measures (e.g., variance or success rate quantification), or edge-case analysis are provided. This leaves the central empirical claim only partially supported and difficult to assess for robustness.

    Authors: We agree that additional experimental details are needed to support the claims. The reported outcomes were obtained across 20 independent runs per condition using GPT-4o at temperature 0.0. No variance occurred in the key qualitative metrics (exact preservation and zero contamination for DAG replay). We have revised §4 to include a dedicated experimental protocol subsection with run count, model parameters, success-rate quantification (100% for DAG on preservation metrics), and a brief edge-case discussion (e.g., prompt-length variation and unrelated context injection). These additions make the empirical support fully assessable while retaining the controlled-task design. revision: yes

  2. Referee: §3 (Execution Lineage definition): The model assumes explicit dependencies and identity-based replay can be maintained, but the manuscript does not specify how the DAG is constructed from agent traces or how dependency identification is automated; without this, the reproducibility guarantees are not fully operationalizable beyond the hand-crafted experimental tasks.

    Authors: The experiments use explicit, task-driven node and dependency definitions to isolate the effects of the execution model. We have updated §3 with a description of this manual construction process (based on task decomposition into artifact-producing steps) and added a forward-looking discussion of automation methods, including trace parsing and LLM-assisted dependency inference. This clarifies the current scope and provides a concrete path toward broader operationalizability without claiming an automated system in the present work. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper defines execution lineage as an explicit DAG model with identity-based replay and evaluates it via direct comparison to loop baselines on two controlled policy-memo tasks. No equations, fitted parameters, or first-principles derivations are present; the reported preservation properties (exact memo retention, zero contamination, upstream/downstream consistency) follow from the explicit dependency structure by construction of the experimental setup rather than from any self-referential reduction. No self-citations, uniqueness theorems, or ansatzes are invoked to support core claims. The work is a system proposal plus empirical validation that remains self-contained against the described independent tasks and baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on modeling AI workflows as decomposable into artifact-producing steps with stable boundaries; no numerical free parameters are introduced, but the approach assumes such decomposition is feasible and beneficial without quantifying overhead.

axioms (1)
  • domain assumption AI agent workflows consist of computations that produce identifiable artifacts with explicit dependencies that can be tracked independently of conversational state.
    This modeling premise is invoked to define execution lineage and is necessary for the replay and preservation guarantees.
invented entities (1)
  • Execution lineage as a DAG of artifact-producing computations no independent evidence
    purpose: To provide stable intermediate boundaries, identity-based replay, and explicit dependency tracking for AI-native work.
    This is the core new modeling construct introduced to replace implicit loop state.

pith-pipeline@v0.9.0 · 5586 in / 1347 out tokens · 56663 ms · 2026-05-08T09:47:30.028227+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Intermediate Artifacts as First-Class Citizens: A Data Model for Durable Intermediate Artifacts in Agentic Systems

    cs.AI 2026-05 unverdicted novelty 5.0

    A systems-level data model for preserving typed, addressable, versioned, and dependency-aware intermediate artifacts in agentic AI systems to improve long-term inspectability and maintainability.

Reference graph

Works this paper leans on

52 extracted references · 40 canonical work pages · cited by 1 Pith paper · 16 internal anchors

  1. [1]

    Understanding the planning of LLM agents: A survey

    Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. Understanding the Planning of LLM Agents: A Survey. arXiv:2402.02716, 2024

  2. [2]

    A Review of Prominent Paradigms for LLM-Based Agents: Tool Use (Including RAG), Planning, and Feedback Learning

    Xinzhe Li. A Review of Prominent Paradigms for LLM-Based Agents: Tool Use (Including RAG), Planning, and Feedback Learning. arXiv:2406.05804, 2024

  3. [3]

    Zhang, X

    Zeyu Zhang, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Quanyu Dai, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A Survey on the Memory Mechanism of Large Language Model Based Agents. arXiv:2404.13501, 2024

  4. [4]

    arXiv preprint arXiv:2603.07670 (2026) arXiv:2603.07670

    Pengfei Du. Memory for Autonomous LLM Agents: Mechanisms, Evaluation, and Emerging Frontiers. arXiv:2603.07670, 2026

  5. [5]

    Show Your Work: Scratchpads for Intermediate Computation with Language Models

    Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, and Augustus Odena. Show Your Work: Scratchpads for Intermediate Computation with Language Models. arXiv:2112.00114, 2021

  6. [6]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. InAdvances in Neural Information Processing Systems, 2022

  7. [7]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv:2203.11171, 2022

  8. [8]

    Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

    Denny Zhou, Nathanael Sch ¨arli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. arXiv:2205.10625, 2022

  9. [9]

    ReAct: Syn- ergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Syn- ergizing Reasoning and Acting in Language Models. InInternational Conference on Learning Representations, 2023

  10. [10]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dess `ı, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language Models Can Teach Themselves to Use Tools. In Advances in Neural Information Processing Systems, 2023

  11. [11]

    CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

    Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. CAMEL: Communicative Agents for “Mind” Exploration of Large Language Model Society. arXiv:2303.17760, 2023

  12. [12]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An Open-Ended Embodied Agent with Large Language Models. arXiv:2305.16291, 2023. 14 From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work

  13. [13]

    MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

    Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and J¨urgen Schmid- huber. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. arXiv:2308.00352, 2023

  14. [14]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. arXiv:2308.08155, 2023

  15. [15]

    Griffiths, Yuan Cao, and Karthik Narasimhan

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. InAdvances in Neural Information Processing Systems, 2023

  16. [16]

    Re- flexion: Language Agents with Verbal Reinforcement Learning

    Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Re- flexion: Language Agents with Verbal Reinforcement Learning. InAdvances in Neural Information Processing Systems, 2023

  17. [17]

    Self-Refine: Iterative Refinement with Self-Feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-Refine: Iterative Refinement with Self-Feedback. In Advances in Neural Information Processing Sy...

  18. [18]

    Lan- guage agent tree search unifies reasoning acting and planning in language models.arXiv preprint arXiv:2310.04406, 2023

    Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models. arXiv:2310.04406, 2023

  19. [19]

    A Zero-Shot Language Agent for Computer Control with Structured Reflection

    Tao Li, Gang Li, Zhiwei Deng, Bryan Wang, and Yang Li. A Zero-Shot Language Agent for Computer Control with Structured Reflection. arXiv:2310.08740, 2023

  20. [20]

    Reinforcement learning on web interfaces using workflow-guided exploration.arXiv preprint arXiv:1802.08802, 2018

    Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration. arXiv:1802.08802, 2018

  21. [21]

    Natural-Language Agent Harnesses

    Linyue Pan, Lexiao Zou, Shuo Guo, Jingchen Ni, and Hai-Tao Zheng. Natural-Language Agent Harnesses. arXiv:2603.25723, 2026

  22. [22]

    General modular harness for llm agents in multi-turn gaming environments.arXiv preprint arXiv:2507.11633, 2025

    Yuxuan Zhang, Haoyang Yu, Lanxiang Hu, Haojian Jin, and Hao Zhang. General Modular Harness for LLM Agents in Multi-Turn Gaming Environments. arXiv:2507.11633, 2025

  23. [23]

    arXiv preprint arXiv:2409.07429 , year=

    Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent Workflow Memory. arXiv:2409.07429, 2024

  24. [24]

    On the structural memory of llm agents.arXiv preprint arXiv:2412.15266, 2024

    Ruihong Zeng, Jinyuan Fang, Siwei Liu, and Zaiqiao Meng. On the Structural Memory of LLM Agents. arXiv:2412.15266, 2024

  25. [25]

    WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Lan- guage Models

    Shengda Fan, Xin Cong, Yuepeng Fu, Zhong Zhang, Shuyan Zhang, Yuanwei Liu, Yesai Wu, Yankai Lin, Zhiyuan Liu, and Maosong Sun. WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Lan- guage Models. arXiv:2411.05451, 2024

  26. [26]

    LEGOMem : Modular procedural memory for multi-agent LLM systems for workflow automation, 2025

    Dongge Han, Camille Couturier, Daniel Madrigal Diaz, Xuchao Zhang, Victor R ¨uhle, and Saravan Rajmo- han. LEGOMem: Modular Procedural Memory for Multi-agent LLM Systems for Workflow Automation. arXiv:2510.04851, 2025

  27. [27]

    Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

    Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Hinrich Sch¨utze, V olker Tresp, and Yunpu Ma. Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning. arXiv:2508.19828, 2025

  28. [28]

    Borro, Luiz A

    Luiz C. Borro, Luiz A. B. Macarini, Gordon Tindall, Michael Montero, and Adam B. Struck. Memori: A Persistent Memory Layer for Efficient, Context-Aware LLM Agents. arXiv:2603.19935, 2026

  29. [29]

    Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks

    Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. InEuroSys, 2007

  30. [30]

    Franklin, Scott Shenker, and Ion Stoica

    Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. Spark: Cluster Com- puting with Working Sets. InUSENIX HotCloud, 2010

  31. [31]

    Airflow Documentation: DAGs.https://airflow.apache.org/docs/ apache-airflow/3.0.4/core-concepts/dags.html, accessed May 4, 2026

    Apache Software Foundation. Airflow Documentation: DAGs.https://airflow.apache.org/docs/ apache-airflow/3.0.4/core-concepts/dags.html, accessed May 4, 2026

  32. [32]

    dbt Developer Hub.https://docs.getdbt.com/, accessed May 4, 2026

    dbt Labs. dbt Developer Hub.https://docs.getdbt.com/, accessed May 4, 2026

  33. [33]

    DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    Omar Khattab, Keshav Santhanam, Xiang Lisa Li, Aatmik Gupta, Christopher Potts, and Matei Zaharia. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. arXiv:2310.03714, 2023. 15 From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work

  34. [34]

    Prompting Is Programming: A Query Language for Large Language Models

    Leon Beurer-Kellner, Marc Fischer, and Martin Vechev. Prompting Is Programming: A Query Language for Large Language Models. InACM SIGPLAN Conference on Programming Language Design and Implementation, 2023

  35. [35]

    Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of Thoughts Prompting: Disentan- gling Computation from Reasoning for Numerical Reasoning Tasks. arXiv:2211.12588, 2022

  36. [36]

    PAL: Program-aided Language Models

    Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. PAL: Program-aided Language Models. InInternational Conference on Machine Learning, 2023

  37. [37]

    arXiv preprint arXiv:2307.09009 , year=

    Lingjiao Chen, Matei Zaharia, and James Zou. How Is ChatGPT’s Behavior Changing over Time? arXiv:2307.09009, 2023

  38. [38]

    AgentBench: Evaluating LLMs as Agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as Agents. arXiv:2308.03688, 2023

  39. [39]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv:2307.13854, 2023

  40. [40]

    arXiv preprint arXiv:2401.13649 , year=

    Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks. arXiv:2401.13649, 2024

  41. [41]

    Workarena: How capable are web agents at solving common knowledge work tasks? arXiv preprint arXiv:2403.07718,

    Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, L´eo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? arXiv:2403.07718, 2024

  42. [42]

    AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

    Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, and Oriana Riva. AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents. arXiv:2405.14573, 2024

  43. [43]

    OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Vic- tor Zhong, and Tao Yu. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. arXiv:2404.07972, 2024

  44. [44]

    AppWorld: A controllable world of apps and people for benchmarking interactive coding agents.arXiv preprint arXiv:2407.18901,

    Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents. arXiv:2407.18901, 2024

  45. [45]

    GAIA: a benchmark for General AI Assistants

    Gr ´egoire Mialon, Cl ´ementine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A Benchmark for General AI Assistants. arXiv:2311.12983, 2023

  46. [46]

    LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners

    Junhao Zheng, Xidi Cai, Qiuke Li, Duzhen Zhang, ZhongZhi Li, Yingying Zhang, Le Song, and Qianli Ma. LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners. arXiv:2505.11942, 2025

  47. [47]

    Evaluating memory in llm agents via incremental multi-turn interactions.arXiv preprint arXiv:2507.05257, 2025

    Yuanzhe Hu, Yu Wang, and Julian McAuley. Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions. arXiv:2507.05257, 2025

  48. [48]

    arXiv preprint arXiv:2511.20857 , year=

    Tianxin Wei, Noveen Sachdeva, Benjamin Coleman, Zhankui He, Yuanchen Bei, Xuying Ning, Mengting Ai, Yunzhe Li, Jingrui He, Ed H. Chi, Chi Wang, Shuo Chen, Fernando Pereira, Wang-Cheng Kang, and Derek Zhiyuan Cheng. Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory. arXiv:2511.20857, 2025

  49. [49]

    CoRR , volume =

    Yiting Shen, Kun Li, Wei Zhou, and Songlin Hu. Mem2ActBench: A Benchmark for Evaluating Long-Term Memory Utilization in Task-Oriented Autonomous Agents. arXiv:2601.19935, 2026

  50. [50]

    LLM Agents for Interactive Workflow Provenance: Reference Architecture and Evaluation Methodology

    Renan Souza, Timothy Poteet, Brian Etz, Daniel Rosendo, Amal Gueroudji, Woong Shin, Prasanna Balaprakash, and Rafael Ferreira da Silva. LLM Agents for Interactive Workflow Provenance: Reference Architecture and Evaluation Methodology. arXiv:2509.13978, 2025

  51. [51]

    Connecting Large Language Model Agent to High Performance Computing Resource

    Heng Ma, Alexander Brace, Carlo Siebenschuh, Greg Pauloski, Ian Foster, and Arvind Ramanathan. Connecting Large Language Model Agent to High Performance Computing Resource. arXiv:2502.12280, 2025

  52. [52]

    Evaluating Multimodal Interactive Agents

    Josh Abramson, Arun Ahuja, Federico Carnevale, Petko Georgiev, Alex Goldin, Alden Hung, Jessica Landon, Timothy Lillicrap, Alistair Muldal, Blake Richards, Adam Santoro, Tamara von Glehn, Greg Wayne, Nathaniel Wong, and Chen Yan. Evaluating Multimodal Interactive Agents. arXiv:2205.13274, 2022. 16