Recognition: unknown
From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work
Pith reviewed 2026-05-08 09:47 UTC · model grok-4.3
The pith
Representing AI agent workflows as DAGs of computations with explicit dependencies preserves stable state and isolates changes under revisions where loop-based systems do not.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Execution lineage represents AI work as a directed acyclic graph of artifact-producing computations with explicit dependencies, stable intermediate boundaries, and identity-based replay. In unrelated-branch policy-memo updates this produced exact final-memo preservation with zero churn and zero contamination, while loop baselines regenerated the memo and imported unrelated context. In intermediate-artifact edits only the DAG approach achieved perfect upstream preservation, downstream propagation, unaffected-artifact stability, and cross-artifact consistency.
What carries the argument
Execution lineage: a directed acyclic graph of artifact-producing computations with explicit dependencies, stable intermediate boundaries, and identity-based replay that isolates changes and enables deterministic re-execution.
If this is right
- Final answer quality and maintained-state quality are distinct and can diverge even when both approaches produce polished outputs.
- Loop-based systems may succeed on bounded synthesis tasks while still accumulating partial state inconsistencies that affect future revisions.
- Unaffected artifacts remain unchanged and unrelated branches stay isolated under DAG replay.
- Changes propagate only along explicit dependency paths, supporting reliable auditing of what was modified.
Where Pith is reading between the lines
- The model could support integration with external version-control systems by treating artifacts as versioned nodes with replayable provenance.
- Longer-running agent sessions with repeated tool use would likely show larger consistency gaps between loops and DAG replay than the bounded test cases.
- Extending the approach to conditional or dynamic graphs could preserve determinism while still accommodating agent decision branches.
Load-bearing premise
The two controlled policy-memo update tasks are representative of the state management challenges that arise in real-world, open-ended agentic workflows.
What would settle it
A multi-step agent workflow with branching tool calls and memory edits in which DAG replay produces either state churn or unrelated-branch contamination would falsify the claim of stronger consistency guarantees.
Figures
read the original abstract
Large language model systems are increasingly deployed as agentic workflows that interleave reasoning, tool use, memory, and iterative refinement. These systems are effective at producing answers, but they often rely on implicit conversational state, making it difficult to preserve stable work products, isolate irrelevant updates, or propagate changes through intermediate artifacts. We introduce execution lineage: an execution model in which AI-native work is represented as a directed acyclic graph (DAG) of artifact-producing computations with explicit dependencies, stable intermediate boundaries, and identity-based replay. The goal is not to make the model a better one-shot writer, but to make evolving AI-generated work maintainable under change. We compare execution-lineage replay against loop-centric update baselines on two controlled policy-memo update tasks. In an unrelated-branch update, DAG replay preserved the final memo exactly in all runs, with zero churn and zero unrelated-branch contamination, while loop baselines regenerated the memo and frequently imported unrelated context. In an intermediate-artifact edit, all systems reflected the new constraint in the final memo, but only DAG replay achieved perfect upstream preservation, downstream propagation, unaffected-artifact preservation, and cross-artifact consistency. These results show that final answer quality and maintained-state quality are distinct. Strong loop baselines can remain competitive at producing polished final outputs when the task is a bounded synthesis/update problem and all current sources fit in context, but immediate task success can mask partial state inconsistency that may compound over future revisions. Execution lineage provides stronger guarantees about what should change, what should remain stable, and how work evolves across revisions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes execution lineage, a DAG-based execution model for AI agent workflows that represents work as artifact-producing computations with explicit dependencies, stable boundaries, and identity-based replay. It evaluates this against loop-centric baselines on two controlled policy-memo update tasks (unrelated-branch update and intermediate-artifact edit), claiming that DAG replay achieves exact memo preservation, zero churn/contamination, perfect upstream/downstream consistency, and unaffected artifact preservation, while loops regenerate content and import unrelated context. The work argues that final-answer quality and maintained-state quality are distinct properties.
Significance. If the results hold, the contribution is significant for AI agent design: it offers a concrete mechanism to make evolving, tool-using workflows reproducible and maintainable rather than relying on implicit conversational state. The distinction between one-shot synthesis success and long-term state consistency is a useful framing, and the controlled-task design with clear qualitative outcomes provides a starting point for reproducible AI-native systems. Strengths include the use of independent tasks and direct baseline comparisons; limitations in generalizability to open-ended workflows are noted but do not invalidate the core demonstration.
major comments (2)
- [§4 (Evaluation)] §4 (Evaluation): The abstract and results claim 'perfect' preservation 'in all runs' with 'zero churn' and 'zero contamination,' yet no implementation details, number of runs, statistical measures (e.g., variance or success rate quantification), or edge-case analysis are provided. This leaves the central empirical claim only partially supported and difficult to assess for robustness.
- [§3 (Execution Lineage definition)] §3 (Execution Lineage definition): The model assumes explicit dependencies and identity-based replay can be maintained, but the manuscript does not specify how the DAG is constructed from agent traces or how dependency identification is automated; without this, the reproducibility guarantees are not fully operationalizable beyond the hand-crafted experimental tasks.
minor comments (2)
- [Abstract] Abstract: The phrase 'AI-native work' is introduced without a concise definition or reference; adding one sentence would improve accessibility.
- The paper would benefit from a diagram showing the DAG structure and replay process for at least one of the policy-memo tasks to illustrate the claimed properties.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the work's significance and for the constructive comments on empirical robustness and operationalizability. We address both major comments below with clarifications and targeted revisions that strengthen the manuscript without altering its core claims or scope.
read point-by-point responses
-
Referee: §4 (Evaluation): The abstract and results claim 'perfect' preservation 'in all runs' with 'zero churn' and 'zero contamination,' yet no implementation details, number of runs, statistical measures (e.g., variance or success rate quantification), or edge-case analysis are provided. This leaves the central empirical claim only partially supported and difficult to assess for robustness.
Authors: We agree that additional experimental details are needed to support the claims. The reported outcomes were obtained across 20 independent runs per condition using GPT-4o at temperature 0.0. No variance occurred in the key qualitative metrics (exact preservation and zero contamination for DAG replay). We have revised §4 to include a dedicated experimental protocol subsection with run count, model parameters, success-rate quantification (100% for DAG on preservation metrics), and a brief edge-case discussion (e.g., prompt-length variation and unrelated context injection). These additions make the empirical support fully assessable while retaining the controlled-task design. revision: yes
-
Referee: §3 (Execution Lineage definition): The model assumes explicit dependencies and identity-based replay can be maintained, but the manuscript does not specify how the DAG is constructed from agent traces or how dependency identification is automated; without this, the reproducibility guarantees are not fully operationalizable beyond the hand-crafted experimental tasks.
Authors: The experiments use explicit, task-driven node and dependency definitions to isolate the effects of the execution model. We have updated §3 with a description of this manual construction process (based on task decomposition into artifact-producing steps) and added a forward-looking discussion of automation methods, including trace parsing and LLM-assisted dependency inference. This clarifies the current scope and provides a concrete path toward broader operationalizability without claiming an automated system in the present work. revision: partial
Circularity Check
No significant circularity
full rationale
The paper defines execution lineage as an explicit DAG model with identity-based replay and evaluates it via direct comparison to loop baselines on two controlled policy-memo tasks. No equations, fitted parameters, or first-principles derivations are present; the reported preservation properties (exact memo retention, zero contamination, upstream/downstream consistency) follow from the explicit dependency structure by construction of the experimental setup rather than from any self-referential reduction. No self-citations, uniqueness theorems, or ansatzes are invoked to support core claims. The work is a system proposal plus empirical validation that remains self-contained against the described independent tasks and baselines.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption AI agent workflows consist of computations that produce identifiable artifacts with explicit dependencies that can be tracked independently of conversational state.
invented entities (1)
-
Execution lineage as a DAG of artifact-producing computations
no independent evidence
Forward citations
Cited by 1 Pith paper
-
Intermediate Artifacts as First-Class Citizens: A Data Model for Durable Intermediate Artifacts in Agentic Systems
A systems-level data model for preserving typed, addressable, versioned, and dependency-aware intermediate artifacts in agentic AI systems to improve long-term inspectability and maintainability.
Reference graph
Works this paper leans on
-
[1]
Understanding the planning of LLM agents: A survey
Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. Understanding the Planning of LLM Agents: A Survey. arXiv:2402.02716, 2024
work page internal anchor Pith review arXiv 2024
-
[2]
Xinzhe Li. A Review of Prominent Paradigms for LLM-Based Agents: Tool Use (Including RAG), Planning, and Feedback Learning. arXiv:2406.05804, 2024
- [3]
-
[4]
arXiv preprint arXiv:2603.07670 (2026) arXiv:2603.07670
Pengfei Du. Memory for Autonomous LLM Agents: Mechanisms, Evaluation, and Emerging Frontiers. arXiv:2603.07670, 2026
-
[5]
Show Your Work: Scratchpads for Intermediate Computation with Language Models
Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, and Augustus Odena. Show Your Work: Scratchpads for Intermediate Computation with Language Models. arXiv:2112.00114, 2021
work page internal anchor Pith review arXiv 2021
-
[6]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. InAdvances in Neural Information Processing Systems, 2022
2022
-
[7]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv:2203.11171, 2022
work page internal anchor Pith review arXiv 2022
-
[8]
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
Denny Zhou, Nathanael Sch ¨arli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. arXiv:2205.10625, 2022
work page internal anchor Pith review arXiv 2022
-
[9]
ReAct: Syn- ergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Syn- ergizing Reasoning and Acting in Language Models. InInternational Conference on Learning Representations, 2023
2023
-
[10]
Toolformer: Language Models Can Teach Themselves to Use Tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dess `ı, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language Models Can Teach Themselves to Use Tools. In Advances in Neural Information Processing Systems, 2023
2023
-
[11]
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society
Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. CAMEL: Communicative Agents for “Mind” Exploration of Large Language Model Society. arXiv:2303.17760, 2023
work page internal anchor Pith review arXiv 2023
-
[12]
Voyager: An Open-Ended Embodied Agent with Large Language Models
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An Open-Ended Embodied Agent with Large Language Models. arXiv:2305.16291, 2023. 14 From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work
work page internal anchor Pith review arXiv 2023
-
[13]
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and J¨urgen Schmid- huber. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. arXiv:2308.00352, 2023
work page internal anchor Pith review arXiv 2023
-
[14]
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. arXiv:2308.08155, 2023
work page internal anchor Pith review arXiv 2023
-
[15]
Griffiths, Yuan Cao, and Karthik Narasimhan
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. InAdvances in Neural Information Processing Systems, 2023
2023
-
[16]
Re- flexion: Language Agents with Verbal Reinforcement Learning
Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Re- flexion: Language Agents with Verbal Reinforcement Learning. InAdvances in Neural Information Processing Systems, 2023
2023
-
[17]
Self-Refine: Iterative Refinement with Self-Feedback
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-Refine: Iterative Refinement with Self-Feedback. In Advances in Neural Information Processing Sy...
2023
-
[18]
Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models. arXiv:2310.04406, 2023
-
[19]
A Zero-Shot Language Agent for Computer Control with Structured Reflection
Tao Li, Gang Li, Zhiwei Deng, Bryan Wang, and Yang Li. A Zero-Shot Language Agent for Computer Control with Structured Reflection. arXiv:2310.08740, 2023
-
[20]
Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration. arXiv:1802.08802, 2018
-
[21]
Natural-Language Agent Harnesses
Linyue Pan, Lexiao Zou, Shuo Guo, Jingchen Ni, and Hai-Tao Zheng. Natural-Language Agent Harnesses. arXiv:2603.25723, 2026
-
[22]
Yuxuan Zhang, Haoyang Yu, Lanxiang Hu, Haojian Jin, and Hao Zhang. General Modular Harness for LLM Agents in Multi-Turn Gaming Environments. arXiv:2507.11633, 2025
-
[23]
arXiv preprint arXiv:2409.07429 , year=
Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent Workflow Memory. arXiv:2409.07429, 2024
-
[24]
On the structural memory of llm agents.arXiv preprint arXiv:2412.15266, 2024
Ruihong Zeng, Jinyuan Fang, Siwei Liu, and Zaiqiao Meng. On the Structural Memory of LLM Agents. arXiv:2412.15266, 2024
-
[25]
WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Lan- guage Models
Shengda Fan, Xin Cong, Yuepeng Fu, Zhong Zhang, Shuyan Zhang, Yuanwei Liu, Yesai Wu, Yankai Lin, Zhiyuan Liu, and Maosong Sun. WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Lan- guage Models. arXiv:2411.05451, 2024
-
[26]
LEGOMem : Modular procedural memory for multi-agent LLM systems for workflow automation, 2025
Dongge Han, Camille Couturier, Daniel Madrigal Diaz, Xuchao Zhang, Victor R ¨uhle, and Saravan Rajmo- han. LEGOMem: Modular Procedural Memory for Multi-agent LLM Systems for Workflow Automation. arXiv:2510.04851, 2025
-
[27]
Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Hinrich Sch¨utze, V olker Tresp, and Yunpu Ma. Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning. arXiv:2508.19828, 2025
work page internal anchor Pith review arXiv 2025
-
[28]
Luiz C. Borro, Luiz A. B. Macarini, Gordon Tindall, Michael Montero, and Adam B. Struck. Memori: A Persistent Memory Layer for Efficient, Context-Aware LLM Agents. arXiv:2603.19935, 2026
-
[29]
Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. InEuroSys, 2007
2007
-
[30]
Franklin, Scott Shenker, and Ion Stoica
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. Spark: Cluster Com- puting with Working Sets. InUSENIX HotCloud, 2010
2010
-
[31]
Airflow Documentation: DAGs.https://airflow.apache.org/docs/ apache-airflow/3.0.4/core-concepts/dags.html, accessed May 4, 2026
Apache Software Foundation. Airflow Documentation: DAGs.https://airflow.apache.org/docs/ apache-airflow/3.0.4/core-concepts/dags.html, accessed May 4, 2026
2026
-
[32]
dbt Developer Hub.https://docs.getdbt.com/, accessed May 4, 2026
dbt Labs. dbt Developer Hub.https://docs.getdbt.com/, accessed May 4, 2026
2026
-
[33]
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
Omar Khattab, Keshav Santhanam, Xiang Lisa Li, Aatmik Gupta, Christopher Potts, and Matei Zaharia. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. arXiv:2310.03714, 2023. 15 From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work
work page internal anchor Pith review arXiv 2023
-
[34]
Prompting Is Programming: A Query Language for Large Language Models
Leon Beurer-Kellner, Marc Fischer, and Martin Vechev. Prompting Is Programming: A Query Language for Large Language Models. InACM SIGPLAN Conference on Programming Language Design and Implementation, 2023
2023
-
[35]
Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of Thoughts Prompting: Disentan- gling Computation from Reasoning for Numerical Reasoning Tasks. arXiv:2211.12588, 2022
work page internal anchor Pith review arXiv 2022
-
[36]
PAL: Program-aided Language Models
Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. PAL: Program-aided Language Models. InInternational Conference on Machine Learning, 2023
2023
-
[37]
arXiv preprint arXiv:2307.09009 , year=
Lingjiao Chen, Matei Zaharia, and James Zou. How Is ChatGPT’s Behavior Changing over Time? arXiv:2307.09009, 2023
-
[38]
AgentBench: Evaluating LLMs as Agents
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as Agents. arXiv:2308.03688, 2023
work page internal anchor Pith review arXiv 2023
-
[39]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv:2307.13854, 2023
work page internal anchor Pith review arXiv 2023
-
[40]
arXiv preprint arXiv:2401.13649 , year=
Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks. arXiv:2401.13649, 2024
-
[41]
Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, L´eo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? arXiv:2403.07718, 2024
-
[42]
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, and Oriana Riva. AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents. arXiv:2405.14573, 2024
work page internal anchor Pith review arXiv 2024
-
[43]
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Vic- tor Zhong, and Tao Yu. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. arXiv:2404.07972, 2024
work page internal anchor Pith review arXiv 2024
-
[44]
Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents. arXiv:2407.18901, 2024
-
[45]
GAIA: a benchmark for General AI Assistants
Gr ´egoire Mialon, Cl ´ementine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A Benchmark for General AI Assistants. arXiv:2311.12983, 2023
work page internal anchor Pith review arXiv 2023
-
[46]
LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners
Junhao Zheng, Xidi Cai, Qiuke Li, Duzhen Zhang, ZhongZhi Li, Yingying Zhang, Le Song, and Qianli Ma. LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners. arXiv:2505.11942, 2025
-
[47]
Yuanzhe Hu, Yu Wang, and Julian McAuley. Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions. arXiv:2507.05257, 2025
-
[48]
arXiv preprint arXiv:2511.20857 , year=
Tianxin Wei, Noveen Sachdeva, Benjamin Coleman, Zhankui He, Yuanchen Bei, Xuying Ning, Mengting Ai, Yunzhe Li, Jingrui He, Ed H. Chi, Chi Wang, Shuo Chen, Fernando Pereira, Wang-Cheng Kang, and Derek Zhiyuan Cheng. Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory. arXiv:2511.20857, 2025
-
[49]
Yiting Shen, Kun Li, Wei Zhou, and Songlin Hu. Mem2ActBench: A Benchmark for Evaluating Long-Term Memory Utilization in Task-Oriented Autonomous Agents. arXiv:2601.19935, 2026
-
[50]
LLM Agents for Interactive Workflow Provenance: Reference Architecture and Evaluation Methodology
Renan Souza, Timothy Poteet, Brian Etz, Daniel Rosendo, Amal Gueroudji, Woong Shin, Prasanna Balaprakash, and Rafael Ferreira da Silva. LLM Agents for Interactive Workflow Provenance: Reference Architecture and Evaluation Methodology. arXiv:2509.13978, 2025
-
[51]
Connecting Large Language Model Agent to High Performance Computing Resource
Heng Ma, Alexander Brace, Carlo Siebenschuh, Greg Pauloski, Ian Foster, and Arvind Ramanathan. Connecting Large Language Model Agent to High Performance Computing Resource. arXiv:2502.12280, 2025
-
[52]
Evaluating Multimodal Interactive Agents
Josh Abramson, Arun Ahuja, Federico Carnevale, Petko Georgiev, Alex Goldin, Alden Hung, Jessica Landon, Timothy Lillicrap, Alistair Muldal, Blake Richards, Adam Santoro, Tamara von Glehn, Greg Wayne, Nathaniel Wong, and Chen Yan. Evaluating Multimodal Interactive Agents. arXiv:2205.13274, 2022. 16
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.