pith. sign in

arxiv: 2605.26302 · v1 · pith:WM3QS3MDnew · submitted 2026-05-25 · 💻 cs.AI · cs.CL· cs.MA

Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

Pith reviewed 2026-06-29 21:10 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.MA
keywords AI agentsagent lifespanmemory agingreliability benchmarklong-term deploymenttemporal dependency graphscounterfactual probesmemory pipeline
0
0 comments X

The pith

AI agents with frozen model weights still degrade over sessions through four memory aging mechanisms, requiring lifespan benchmarks and stage-targeted repairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that long-lived AI agents lose reliability as persistent systems because their memory compresses history, creates interference between facts, revises information, and performs maintenance, even when base model weights remain unchanged. Day-one benchmarks miss these changes because they evaluate only initial performance rather than behavior across multiple sessions. AgingBench tracks the process by building temporal dependency graphs and running paired counterfactual probes that isolate problems in the write, retrieval, or utilization stages of the memory pipeline. Tests across seven scenarios, fourteen models, and roughly four hundred runs reveal that degradation is multi-dimensional, with some failures appearing only in factual precision or derived-state tracking. This means the same incorrect output can stem from different pipeline stages and therefore needs different fixes.

Core claim

Reliability in deployed agents is a lifespan property of the full harness rather than a snapshot property of the base model, because interaction history triggers compression aging, interference aging, revision aging, and maintenance aging that are diagnosed by temporal dependency graphs and paired counterfactual probes producing distinct profiles for the write, retrieval, and utilization stages of the memory pipeline.

What carries the argument

AgingBench, which organizes degradation into four mechanisms and uses temporal dependency graphs with counterfactual probes to generate diagnostic profiles for memory pipeline stages.

If this is right

  • Behavioral tests can remain clean while factual precision decays, so single-metric checks miss lifespan problems.
  • Derived-state tracking can collapse sharply within one model, showing that some failures are abrupt and component-specific.
  • The same wrong answer can require different repairs depending on whether the diagnostic profile implicates write, retrieval, or utilization.
  • Reliable deployment needs mechanism-level diagnosis and stage-targeted repair instead of relying only on stronger initial models.
  • Evaluation must span many sessions to capture the cumulative effects of the four aging mechanisms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agents could benefit from scheduled memory rejuvenation steps that target the specific mechanism identified by the probes.
  • The same diagnostic approach might apply to other stateful systems such as long-running chatbots or recommendation engines that accumulate interaction history.
  • Certain memory policies might mitigate particular aging types more effectively than others, suggesting policy optimization as a follow-on direction.
  • Combining lifespan monitoring with occasional model updates could extend operational life beyond what either technique achieves alone.

Load-bearing premise

The four aging mechanisms together with the temporal graphs and counterfactual probes provide a complete diagnosis of all degradation sources without unmodeled effects from the agent harness or environment.

What would settle it

An experiment in which agents exhibit clear degradation that none of the four mechanisms can explain or in which the probes consistently point to the wrong pipeline stage for repair would falsify the diagnostic framework.

read the original abstract

Long-lived AI agents are increasingly deployed as persistent operational systems, yet they are still evaluated like freshly initialized models. Day-one benchmarks miss a basic systems question: how long does an agent remain reliable after deployment? Even when model weights are frozen, an agent's effective state keeps changing as it compresses interaction history, retrieves from a growing memory store, revises facts after updates, and undergoes routine maintenance. Reliability therefore becomes a lifespan property of the full agent harness, not only a snapshot property of the base model. We introduce AgingBench, a longitudinal reliability benchmark for agent lifespan engineering: measuring not only whether deployed agents degrade, but what form the degradation takes and where repair should target. AgingBench organizes agent aging into four mechanisms: compression aging, interference aging, revision aging, and maintenance aging. To diagnose these failures, AgingBench uses temporal dependency graphs and paired counterfactual probes that produce diagnostic profiles for the write, retrieval, and utilization stages of the memory pipeline. Across 7 scenarios, 14 models, multiple memory policies, and both runner-controlled and autonomous agents, over ~400 runs spanning 8 - 200 sessions show that agent aging is not one-dimensional: behavioral tests can remain clean while factual precision decays; derived-state tracking can collapse sharply within a single model; and the same wrong answer can require different repairs depending on what the diagnostic profile points to. These results suggest that reliable agent deployment requires lifespan evaluation, mechanism-level diagnosis, and stage-targeted repair, not only stronger day-one models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that deployed AI agents exhibit multi-dimensional aging even with frozen weights, due to four mechanisms (compression aging, interference aging, revision aging, maintenance aging) in the memory pipeline. It introduces AgingBench, which uses temporal dependency graphs and paired counterfactual probes to generate diagnostic profiles for write/retrieval/utilization stages, and reports empirical results from ~400 runs across 7 scenarios, 14 models, multiple policies, and runner-controlled/autonomous agents showing dissociations such as clean behavioral tests with decaying factual precision, sharp derived-state collapses, and profile-dependent repair needs. The work argues for lifespan evaluation and stage-targeted repair over day-one benchmarks alone.

Significance. If the counterfactual probes are validated to isolate the four mechanisms without residual confounds from the harness or environment, the work would meaningfully advance AI agent reliability research by providing a diagnostic framework that distinguishes degradation sources and guides targeted interventions, moving the field beyond snapshot evaluations.

major comments (2)
  1. [Methods] Methods (description of temporal dependency graphs and paired counterfactual probes): The central claim that these tools produce diagnostic profiles correctly attributing degradation to compression/interference/revision/maintenance without confounds rests on an unvalidated assumption; no controlled injection experiments (holding other mechanisms fixed) or oracle-harness comparisons are described to rule out probe-induced artifacts or runner/environment interactions, which is load-bearing for interpreting the reported dissociations as evidence of distinct lifespan mechanisms.
  2. [Results] Results (empirical evaluation across ~400 runs): The abstract states results showing multi-dimensional aging (e.g., clean behavior with decaying precision, sharp collapses) but provides no error bars, statistical tests, or details on how the four mechanisms were validated against alternatives, leaving the robustness of the cross-scenario claims difficult to assess.
minor comments (1)
  1. [Abstract] Abstract: The four aging mechanisms are named but not briefly defined, which would aid readers in following the diagnostic profiles.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important aspects of validation and statistical presentation that we address point by point below. We agree that both issues warrant revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Methods] Methods (description of temporal dependency graphs and paired counterfactual probes): The central claim that these tools produce diagnostic profiles correctly attributing degradation to compression/interference/revision/maintenance without confounds rests on an unvalidated assumption; no controlled injection experiments (holding other mechanisms fixed) or oracle-harness comparisons are described to rule out probe-induced artifacts or runner/environment interactions, which is load-bearing for interpreting the reported dissociations as evidence of distinct lifespan mechanisms.

    Authors: We agree that the manuscript does not describe controlled injection experiments or oracle-harness comparisons to further validate isolation of the four mechanisms. The paired counterfactual probes are constructed to differ only on the targeted factor while holding the harness and environment fixed, and the observed dissociations across scenarios provide supporting evidence. However, to address the concern directly, the revised manuscript will add a dedicated validation subsection that includes synthetic injection experiments (injecting one mechanism while holding others constant) and comparisons against oracle harnesses. revision: yes

  2. Referee: [Results] Results (empirical evaluation across ~400 runs): The abstract states results showing multi-dimensional aging (e.g., clean behavior with decaying precision, sharp collapses) but provides no error bars, statistical tests, or details on how the four mechanisms were validated against alternatives, leaving the robustness of the cross-scenario claims difficult to assess.

    Authors: We acknowledge that the current results section lacks error bars, statistical tests, and explicit details on distinguishing the four mechanisms from alternatives. The reported dissociations are drawn from the ~400 runs, but the presentation emphasizes qualitative patterns. In the revision we will add error bars to all metrics, include appropriate statistical tests (e.g., paired comparisons and ANOVA across conditions), and expand the results to describe how probe outcomes were used to attribute degradation to specific mechanisms versus alternatives. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no derivation chain or self-citation reductions

full rationale

The paper presents AgingBench as a longitudinal empirical benchmark, reporting observations from ~400 runs across scenarios, models, and policies. No equations, fitted parameters, or predictions are described. The four aging mechanisms and counterfactual probes are introduced as definitional components of the benchmark methodology rather than derived results. No self-citations are invoked to justify uniqueness theorems or ansatzes. The central claims rest on direct experimental data rather than any reduction to prior inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 4 invented entities

The central claim rests on the domain assumption that agent state changes can be partitioned into four aging mechanisms and three memory stages; these are introduced without independent evidence outside the benchmark itself.

axioms (2)
  • domain assumption Agent state changes after deployment can be partitioned into compression aging, interference aging, revision aging, and maintenance aging.
    This partitioning is used to organize the benchmark and interpret the diagnostic profiles.
  • domain assumption Temporal dependency graphs and paired counterfactual probes can isolate failures at the write, retrieval, and utilization stages of the memory pipeline.
    This is invoked to produce the diagnostic profiles that distinguish the aging mechanisms.
invented entities (4)
  • compression aging no independent evidence
    purpose: Categorize degradation caused by history compression in memory.
    Newly defined mechanism with no independent evidence cited beyond the benchmark results.
  • interference aging no independent evidence
    purpose: Categorize degradation caused by memory interference.
    Newly defined mechanism with no independent evidence cited beyond the benchmark results.
  • revision aging no independent evidence
    purpose: Categorize degradation caused by fact revisions.
    Newly defined mechanism with no independent evidence cited beyond the benchmark results.
  • maintenance aging no independent evidence
    purpose: Categorize degradation caused by routine maintenance operations.
    Newly defined mechanism with no independent evidence cited beyond the benchmark results.

pith-pipeline@v0.9.1-grok · 5826 in / 1591 out tokens · 36234 ms · 2026-06-29T21:10:50.090824+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 35 canonical work pages · 20 internal anchors

  1. [1]

    gpt-oss-120b & gpt-oss-20b Model Card

    SandhiniAgarwal, LamaAhmad, JasonAi, SamAltman, AndyApplebaum, Edwin Arbus, RahulK Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

  2. [2]

    Claude code.https://claude.ai, 2026

    Anthropic. Claude code.https://claude.ai, 2026. Accessed: 2026-04

  3. [3]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

  4. [4]

    Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks

    Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, et al. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3639–3664, 2025

  5. [5]

    Indexing and access for digital libraries and the internet: Human, database, and domain factors.Journal of the American Society for information science, 49(13):1185–1205, 1998

    Marcia J Bates. Indexing and access for digital libraries and the internet: Human, database, and domain factors.Journal of the American Society for information science, 49(13):1185–1205, 1998

  6. [6]

    Why Do Multi-Agent LLM Systems Fail?

    Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, KurtKeutzer, AdityaParameswaran, DanKlein, KannanRamchandran, etal. Whydomulti-agent llm systems fail?arXiv preprint arXiv:2503.13657, 2025

  7. [7]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  8. [8]

    Vehiclemembench: An executable benchmark for multi-user long-term memory in in-vehicle agents.arXiv preprint arXiv:2603.23840, 2026

    Yuhao Chen, Yi Xu, Xinyun Ding, Xiang Fang, Shuochen Liu, Luxi Lin, Qingyu Zhang, Ya Li, Quan Liu, and Tong Xu. Vehiclemembench: An executable benchmark for multi-user long-term memory in in-vehicle agents.arXiv preprint arXiv:2603.23840, 2026

  9. [9]

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

    Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

  10. [10]

    EvoClaw: Evaluating AI Agents on Continuous Software Evolution

    Gangda Deng, Zhaoling Chen, Zhongming Yu, Haoyang Fan, Yuhong Liu, Yuxin Yang, Dhruv Parikh, Rajgopal Kannan, Le Cong, Mengdi Wang, et al. Evoclaw: Evaluating ai agents on continuous software evolution.arXiv preprint arXiv:2603.13428, 2026

  11. [11]

    Human hippocampal neurogenesis in adulthood, ageing and alzheimer’s disease.Nature, 2026

    Ahmed Disouky, Mark A Sanborn, KR Sabitha, Mostafa M Mostafa, Ivan Alejandro Ayala, David A Bennett, Yisha Lu, Yi Zhou, C Dirk Keene, Sandra Weintraub, et al. Human hippocampal neurogenesis in adulthood, ageing and alzheimer’s disease.Nature, 2026

  12. [12]

    Memory traces in dynamical systems

    Surya Ganguli, Dongsung Huh, and Haim Sompolinsky. Memory traces in dynamical systems. Proceedings of the national academy of sciences, 105(48):18970–18975, 2008

  13. [13]

    A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

    Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, et al. A survey of self-evolving agents: What, when, how, and where to evolve on the path to artificial super intelligence.arXiv preprint arXiv:2507.21046, 2025

  14. [14]

    A survey of vibe coding with large language models.arXiv preprint arXiv:2510.12399, 2025

    Yuyao Ge, Lingrui Mei, Zenghao Duan, Tianhao Li, Yujia Zheng, Yiwei Wang, Lexin Wang, Jiayu Yao, Tianyu Liu, Yujun Cai, et al. A survey of vibe coding with large language models.arXiv preprint arXiv:2510.12399, 2025. 13 /hourglass-halfYour Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

  15. [15]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  16. [16]

    Regression test selection for java software.ACM Sigplan Notices, 36(11):312–326, 2001

    Mary Jean Harrold, James A Jones, Tongyu Li, Donglin Liang, Alessandro Orso, Maikel Pennings, Saurabh Sinha, S Alexander Spoon, and Ashish Gujarathi. Regression test selection for java software.ACM Sigplan Notices, 36(11):312–326, 2001

  17. [17]

    arXiv preprint arXiv:2602.16313 , year=

    Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen, Lang Yin, Ze Chen, Tong Arthur Wu, Siru Ouyang, Zihan Wang, et al. Memoryarena: Benchmarking agent memory in interde- pendent multi-session agentic tasks.arXiv preprint arXiv:2602.16313, 2026

  18. [18]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

  19. [19]

    Memory in the Age of AI Agents

    Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, et al. Memory in the age of ai agents.arXiv preprint arXiv:2512.13564, 2025

  20. [20]

    Amem- gym: Interactive memory benchmarking for assistants in long-horizon conversations.arXiv preprint arXiv:2603.01966, 2026

    Cheng Jiayang, Dongyu Ru, Lin Qiu, Yiyang Li, Xuezhi Cao, Yangqiu Song, and Xunliang Cai. Amemgym: Interactive memory benchmarking for assistants in long-horizon conversations. arXiv preprint arXiv:2603.01966, 2026

  21. [21]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

  22. [22]

    Memarchitect: A policy driven memory governance layer.arXiv preprint arXiv:2603.18330, 2026

    Lingavasan Suresh Kumar, Yang Ba, and Rong Pan. Memarchitect: A policy driven memory governance layer.arXiv preprint arXiv:2603.18330, 2026

  23. [23]

    LLMs get lost in multi-turn conversation

    Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. LLMs get lost in multi-turn conversation. InThe Fourteenth International Conference on Learning Representations, 2026

  24. [24]

    Sage: Self-evolving agents with reflective and memory-augmented abilities.Neurocomputing, 647:130470, 2025

    Xuechen Liang, Meiling Tao, Yinghui Xia, Jianhui Wang, Kun Li, Yijin Wang, Yangfan He, Jingsong Yang, Tianyu Shi, Yuantao Wang, et al. Sage: Self-evolving agents with reflective and memory-augmented abilities.Neurocomputing, 647:130470, 2025

  25. [25]

    Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

  26. [26]

    PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments

    Shuochen Liu, Junyi Zhu, Long Shu, Junda Lin, Yuhao Chen, Haotian Zhang, Chao Zhang, Derong Xu, Jia Li, Bo Tang, et al. Perma: Benchmarking personalized memory agents via event-driven preference and realistic task environments.arXiv preprint arXiv:2603.23231, 2026

  27. [27]

    Large Language Model Agent: A Survey on Methodology, Applications and Challenges

    Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, et al. Large language model agent: A survey on methodology, applications and challenges.arXiv preprint arXiv:2503.21460, 2025

  28. [28]

    Wizardcoder: Empowering code large language models with evol-instruct.arXiv preprint arXiv:2306.08568,

    Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. Wizardcoder: Empowering code large language models with evol-instruct.arXiv preprint arXiv:2306.08568, 2023. 14 /hourglass-halfYour Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

  29. [29]

    Evaluating very long-term conversational memory of llm agents

    Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024

  30. [30]

    Beliefshift: Benchmarking temporal belief consistency and opinion drift in llm agents.arXiv preprint arXiv:2603.23848, 2026

    Praveen Kumar Myakala, Manan Agrawal, and Rahul Manche. Beliefshift: Benchmarking temporal belief consistency and opinion drift in llm agents.arXiv preprint arXiv:2603.23848, 2026

  31. [31]

    Technical debt and the reliability of enterprise software systems: A competing risks analysis.Management Science, 62(5):1487–1510, 2016

    Narayan Ramasubbu and Chris F Kemerer. Technical debt and the reliability of enterprise software systems: A competing risks analysis.Management Science, 62(5):1487–1510, 2016

  32. [32]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024

  33. [33]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  34. [34]

    Openhands: An open platform for ai soft- ware developers as generalist agents

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai soft- ware developers as generalist agents. InThe Thirteenth International Conference on Learning Representations, 2025

  35. [35]

    The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break

    Xinyu Jessica Wang, Haoyue Bai, Yiyou Sun, Haorui Wang, Shuibai Zhang, Wenjie Hu, Mya Schroder, Bilge Mutlu, Dawn Song, and Robert D Nowak. The long-horizon task mirage? diagnosing where and why agentic systems break.arXiv preprint arXiv:2604.11978, 2026

  36. [36]

    Extensible database simulator for fast prototyping in-database algorithms

    Yifan Wang and Daisy Zhe Wang. Extensible database simulator for fast prototyping in-database algorithms. InProceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 5029–5033, 2022

  37. [37]

    On the form of forgetting.Psychological science, 2(6):409– 415, 1991

    John T Wixted and Ebbe B Ebbesen. On the form of forgetting.Psychological science, 2(6):409– 415, 1991

  38. [38]

    Rethinking computer architectures and software systems for phase-change memory.ACM Journal on Emerging Technologies in Computing Systems (JETC), 12(4):1–40, 2016

    Chengwen Wu, Guangyan Zhang, and Keqin Li. Rethinking computer architectures and software systems for phase-change memory.ACM Journal on Emerging Technologies in Computing Systems (JETC), 12(4):1–40, 2016

  39. [39]

    LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

    Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Longmemeval: Benchmarkingchatassistantsonlong-terminteractivememory.arXivpreprintarXiv:2410.10813, 2024

  40. [40]

    Learning on the job: An experience-driven self-evolving agent for long-horizon tasks.arXiv preprint arXiv:2510.08002, 2025

    Cheng Yang, Xuemeng Yang, Licheng Wen, Daocheng Fu, Jianbiao Mei, Rong Wu, Pinlong Cai, Yufan Shen, Nianchen Deng, Botian Shi, et al. Learning on the job: An experience-driven self-evolving agent for long-horizon tasks.arXiv preprint arXiv:2510.08002, 2025

  41. [41]

    2603.03296 , archivePrefix=

    Ke Yang, Zixi Chen, Xuan He, Jize Jiang, Michel Galley, Chenglong Wang, Jianfeng Gao, Jiawei Han, and ChengXiang Zhai. Plugmem: A task-agnostic plugin memory module for llm agents. arXiv preprint arXiv:2603.03296, 2026

  42. [42]

    $\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

    Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan.𝜏-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024. 15 /hourglass-halfYour Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

  43. [43]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

  44. [44]

    D-mem: A dual-process memory system for llm agents.arXiv preprint arXiv:2603.18631, 2026

    Zhixing You, Jiachen Yuan, and Jason Cai. D-mem: A dual-process memory system for llm agents.arXiv preprint arXiv:2603.18631, 2026

  45. [45]

    Routine: A structural planning framework for llm agent system in enterprise.arXiv preprint arXiv:2507.14447, 2025

    Guancheng Zeng, Xueyi Chen, Jiawang Hu, Shaohua Qi, Yaxuan Mao, Zhantao Wang, Yifan Nie, Shuang Li, Qiuyang Feng, Pengxu Qiu, et al. Routine: A structural planning framework for llm agent system in enterprise.arXiv preprint arXiv:2507.14447, 2025

  46. [46]

    Which agent causes task failures and when? on automated failure attribution of llm multi-agent systems.arXiv preprint arXiv:2505.00212, 2025

    Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, et al. Which agent causes task failures and when? on automated failure attribution of llm multi-agent systems.arXiv preprint arXiv:2505.00212, 2025

  47. [47]

    AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

    Yujie Zhao, Boqin Yuan, Junbo Huang, Haocheng Yuan, Zhongming Yu, Haozhou Xu, Lanxiang Hu, Abhilash Shankarampeta, Zimeng Huang, Wentao Ni, et al. Ama-bench: Evaluating long- horizon memory for agentic applications.arXiv preprint arXiv:2602.22769, 2026

  48. [48]

    Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering

    Chenyu Zhou, Huacan Chai, Wenteng Chen, Zihan Guo, Rong Shan, Yuanyi Song, Tianyi Xu, Yingxuan Yang, Aofan Yu, Weiming Zhang, et al. Externalization in llm agents: A unified review of memory, skills, protocols and harness engineering.arXiv preprint arXiv:2604.08224, 2026

  49. [49]

    Raffles: Reasoning-based attribution of faults for llm systems

    Chenyang Zhu, Spencer Hong, Jingyu Wu, Kushal Chawla, Yuhui Tang, Youbing Yin, Nathan Wolfe, Erin Babinsky, and Daben Liu. Raffles: Reasoning-based attribution of faults for llm systems. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7659–7688, 2026

  50. [50]

    Where llm agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370, 2025

    Kunlun Zhu, Zijia Liu, Bingxuan Li, Muxin Tian, Yingxuan Yang, Jiaxun Zhang, Pengrui Han, Qipeng Xie, Fuyang Cui, Weijia Zhang, et al. Where llm agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370, 2025

  51. [51]

    From lossy to verified: A provenance-aware tiered memory for agents.arXiv preprint arXiv:2602.17913, 2026

    Qiming Zhu, Shunian Chen, Rui Yu, Zhehao Wu, and Benyou Wang. From lossy to verified: A provenance-aware tiered memory for agents.arXiv preprint arXiv:2602.17913, 2026

  52. [52]

    Compliance brain assistant: Conversational agentic ai for assisting compliance tasks in enterprise environments.arXiv preprint arXiv:2507.17289, 2025

    Shitong Zhu, Chenhao Fang, Derek Larson, Neel Reddy Pochareddy, Rajeev Rao, Sophie Zeng, Yanqing Peng, Wendy Summer, Alex Goncalves, Arya Pudota, et al. Compliance brain assistant: Conversational agentic ai for assisting compliance tasks in enterprise environments.arXiv preprint arXiv:2507.17289, 2025

  53. [53]

    long-context

    Shiwei Zhu, Junjie Wu, Hui Xiong, and Guoping Xia. Scaling up top-k cosine similarity search. Data & Knowledge Engineering, 70(1):60–83, 2011. 16 /hourglass-halfYour Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems Appendix A Extended Related Work 19 B Metric Definitions and Scoring 22 B.1 Aging Curve Statistics . . . . . . . . . . . ...