Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems
Pith reviewed 2026-06-29 21:10 UTC · model grok-4.3
The pith
AI agents with frozen model weights still degrade over sessions through four memory aging mechanisms, requiring lifespan benchmarks and stage-targeted repairs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Reliability in deployed agents is a lifespan property of the full harness rather than a snapshot property of the base model, because interaction history triggers compression aging, interference aging, revision aging, and maintenance aging that are diagnosed by temporal dependency graphs and paired counterfactual probes producing distinct profiles for the write, retrieval, and utilization stages of the memory pipeline.
What carries the argument
AgingBench, which organizes degradation into four mechanisms and uses temporal dependency graphs with counterfactual probes to generate diagnostic profiles for memory pipeline stages.
If this is right
- Behavioral tests can remain clean while factual precision decays, so single-metric checks miss lifespan problems.
- Derived-state tracking can collapse sharply within one model, showing that some failures are abrupt and component-specific.
- The same wrong answer can require different repairs depending on whether the diagnostic profile implicates write, retrieval, or utilization.
- Reliable deployment needs mechanism-level diagnosis and stage-targeted repair instead of relying only on stronger initial models.
- Evaluation must span many sessions to capture the cumulative effects of the four aging mechanisms.
Where Pith is reading between the lines
- Agents could benefit from scheduled memory rejuvenation steps that target the specific mechanism identified by the probes.
- The same diagnostic approach might apply to other stateful systems such as long-running chatbots or recommendation engines that accumulate interaction history.
- Certain memory policies might mitigate particular aging types more effectively than others, suggesting policy optimization as a follow-on direction.
- Combining lifespan monitoring with occasional model updates could extend operational life beyond what either technique achieves alone.
Load-bearing premise
The four aging mechanisms together with the temporal graphs and counterfactual probes provide a complete diagnosis of all degradation sources without unmodeled effects from the agent harness or environment.
What would settle it
An experiment in which agents exhibit clear degradation that none of the four mechanisms can explain or in which the probes consistently point to the wrong pipeline stage for repair would falsify the diagnostic framework.
read the original abstract
Long-lived AI agents are increasingly deployed as persistent operational systems, yet they are still evaluated like freshly initialized models. Day-one benchmarks miss a basic systems question: how long does an agent remain reliable after deployment? Even when model weights are frozen, an agent's effective state keeps changing as it compresses interaction history, retrieves from a growing memory store, revises facts after updates, and undergoes routine maintenance. Reliability therefore becomes a lifespan property of the full agent harness, not only a snapshot property of the base model. We introduce AgingBench, a longitudinal reliability benchmark for agent lifespan engineering: measuring not only whether deployed agents degrade, but what form the degradation takes and where repair should target. AgingBench organizes agent aging into four mechanisms: compression aging, interference aging, revision aging, and maintenance aging. To diagnose these failures, AgingBench uses temporal dependency graphs and paired counterfactual probes that produce diagnostic profiles for the write, retrieval, and utilization stages of the memory pipeline. Across 7 scenarios, 14 models, multiple memory policies, and both runner-controlled and autonomous agents, over ~400 runs spanning 8 - 200 sessions show that agent aging is not one-dimensional: behavioral tests can remain clean while factual precision decays; derived-state tracking can collapse sharply within a single model; and the same wrong answer can require different repairs depending on what the diagnostic profile points to. These results suggest that reliable agent deployment requires lifespan evaluation, mechanism-level diagnosis, and stage-targeted repair, not only stronger day-one models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that deployed AI agents exhibit multi-dimensional aging even with frozen weights, due to four mechanisms (compression aging, interference aging, revision aging, maintenance aging) in the memory pipeline. It introduces AgingBench, which uses temporal dependency graphs and paired counterfactual probes to generate diagnostic profiles for write/retrieval/utilization stages, and reports empirical results from ~400 runs across 7 scenarios, 14 models, multiple policies, and runner-controlled/autonomous agents showing dissociations such as clean behavioral tests with decaying factual precision, sharp derived-state collapses, and profile-dependent repair needs. The work argues for lifespan evaluation and stage-targeted repair over day-one benchmarks alone.
Significance. If the counterfactual probes are validated to isolate the four mechanisms without residual confounds from the harness or environment, the work would meaningfully advance AI agent reliability research by providing a diagnostic framework that distinguishes degradation sources and guides targeted interventions, moving the field beyond snapshot evaluations.
major comments (2)
- [Methods] Methods (description of temporal dependency graphs and paired counterfactual probes): The central claim that these tools produce diagnostic profiles correctly attributing degradation to compression/interference/revision/maintenance without confounds rests on an unvalidated assumption; no controlled injection experiments (holding other mechanisms fixed) or oracle-harness comparisons are described to rule out probe-induced artifacts or runner/environment interactions, which is load-bearing for interpreting the reported dissociations as evidence of distinct lifespan mechanisms.
- [Results] Results (empirical evaluation across ~400 runs): The abstract states results showing multi-dimensional aging (e.g., clean behavior with decaying precision, sharp collapses) but provides no error bars, statistical tests, or details on how the four mechanisms were validated against alternatives, leaving the robustness of the cross-scenario claims difficult to assess.
minor comments (1)
- [Abstract] Abstract: The four aging mechanisms are named but not briefly defined, which would aid readers in following the diagnostic profiles.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. The comments highlight important aspects of validation and statistical presentation that we address point by point below. We agree that both issues warrant revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Methods] Methods (description of temporal dependency graphs and paired counterfactual probes): The central claim that these tools produce diagnostic profiles correctly attributing degradation to compression/interference/revision/maintenance without confounds rests on an unvalidated assumption; no controlled injection experiments (holding other mechanisms fixed) or oracle-harness comparisons are described to rule out probe-induced artifacts or runner/environment interactions, which is load-bearing for interpreting the reported dissociations as evidence of distinct lifespan mechanisms.
Authors: We agree that the manuscript does not describe controlled injection experiments or oracle-harness comparisons to further validate isolation of the four mechanisms. The paired counterfactual probes are constructed to differ only on the targeted factor while holding the harness and environment fixed, and the observed dissociations across scenarios provide supporting evidence. However, to address the concern directly, the revised manuscript will add a dedicated validation subsection that includes synthetic injection experiments (injecting one mechanism while holding others constant) and comparisons against oracle harnesses. revision: yes
-
Referee: [Results] Results (empirical evaluation across ~400 runs): The abstract states results showing multi-dimensional aging (e.g., clean behavior with decaying precision, sharp collapses) but provides no error bars, statistical tests, or details on how the four mechanisms were validated against alternatives, leaving the robustness of the cross-scenario claims difficult to assess.
Authors: We acknowledge that the current results section lacks error bars, statistical tests, and explicit details on distinguishing the four mechanisms from alternatives. The reported dissociations are drawn from the ~400 runs, but the presentation emphasizes qualitative patterns. In the revision we will add error bars to all metrics, include appropriate statistical tests (e.g., paired comparisons and ANOVA across conditions), and expand the results to describe how probe outcomes were used to attribute degradation to specific mechanisms versus alternatives. revision: yes
Circularity Check
Empirical benchmark with no derivation chain or self-citation reductions
full rationale
The paper presents AgingBench as a longitudinal empirical benchmark, reporting observations from ~400 runs across scenarios, models, and policies. No equations, fitted parameters, or predictions are described. The four aging mechanisms and counterfactual probes are introduced as definitional components of the benchmark methodology rather than derived results. No self-citations are invoked to justify uniqueness theorems or ansatzes. The central claims rest on direct experimental data rather than any reduction to prior inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Agent state changes after deployment can be partitioned into compression aging, interference aging, revision aging, and maintenance aging.
- domain assumption Temporal dependency graphs and paired counterfactual probes can isolate failures at the write, retrieval, and utilization stages of the memory pipeline.
invented entities (4)
-
compression aging
no independent evidence
-
interference aging
no independent evidence
-
revision aging
no independent evidence
-
maintenance aging
no independent evidence
Reference graph
Works this paper leans on
-
[1]
gpt-oss-120b & gpt-oss-20b Model Card
SandhiniAgarwal, LamaAhmad, JasonAi, SamAltman, AndyApplebaum, Edwin Arbus, RahulK Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Claude code.https://claude.ai, 2026
Anthropic. Claude code.https://claude.ai, 2026. Accessed: 2026-04
2026
-
[3]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks
Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, et al. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3639–3664, 2025
2025
-
[5]
Indexing and access for digital libraries and the internet: Human, database, and domain factors.Journal of the American Society for information science, 49(13):1185–1205, 1998
Marcia J Bates. Indexing and access for digital libraries and the internet: Human, database, and domain factors.Journal of the American Society for information science, 49(13):1185–1205, 1998
1998
-
[6]
Why Do Multi-Agent LLM Systems Fail?
Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, KurtKeutzer, AdityaParameswaran, DanKlein, KannanRamchandran, etal. Whydomulti-agent llm systems fail?arXiv preprint arXiv:2503.13657, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
Yuhao Chen, Yi Xu, Xinyun Ding, Xiang Fang, Shuochen Liu, Luxi Lin, Qingyu Zhang, Ya Li, Quan Liu, and Tong Xu. Vehiclemembench: An executable benchmark for multi-user long-term memory in in-vehicle agents.arXiv preprint arXiv:2603.23840, 2026
-
[9]
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
EvoClaw: Evaluating AI Agents on Continuous Software Evolution
Gangda Deng, Zhaoling Chen, Zhongming Yu, Haoyang Fan, Yuhong Liu, Yuxin Yang, Dhruv Parikh, Rajgopal Kannan, Le Cong, Mengdi Wang, et al. Evoclaw: Evaluating ai agents on continuous software evolution.arXiv preprint arXiv:2603.13428, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[11]
Human hippocampal neurogenesis in adulthood, ageing and alzheimer’s disease.Nature, 2026
Ahmed Disouky, Mark A Sanborn, KR Sabitha, Mostafa M Mostafa, Ivan Alejandro Ayala, David A Bennett, Yisha Lu, Yi Zhou, C Dirk Keene, Sandra Weintraub, et al. Human hippocampal neurogenesis in adulthood, ageing and alzheimer’s disease.Nature, 2026
2026
-
[12]
Memory traces in dynamical systems
Surya Ganguli, Dongsung Huh, and Haim Sompolinsky. Memory traces in dynamical systems. Proceedings of the national academy of sciences, 105(48):18970–18975, 2008
2008
-
[13]
Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, et al. A survey of self-evolving agents: What, when, how, and where to evolve on the path to artificial super intelligence.arXiv preprint arXiv:2507.21046, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
A survey of vibe coding with large language models.arXiv preprint arXiv:2510.12399, 2025
Yuyao Ge, Lingrui Mei, Zenghao Duan, Tianhao Li, Yujia Zheng, Yiwei Wang, Lexin Wang, Jiayu Yao, Tianyu Liu, Yujun Cai, et al. A survey of vibe coding with large language models.arXiv preprint arXiv:2510.12399, 2025. 13 /hourglass-halfYour Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems
-
[15]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Regression test selection for java software.ACM Sigplan Notices, 36(11):312–326, 2001
Mary Jean Harrold, James A Jones, Tongyu Li, Donglin Liang, Alessandro Orso, Maikel Pennings, Saurabh Sinha, S Alexander Spoon, and Ashish Gujarathi. Regression test selection for java software.ACM Sigplan Notices, 36(11):312–326, 2001
2001
-
[17]
arXiv preprint arXiv:2602.16313 , year=
Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen, Lang Yin, Ze Chen, Tong Arthur Wu, Siru Ouyang, Zihan Wang, et al. Memoryarena: Benchmarking agent memory in interde- pendent multi-session agentic tasks.arXiv preprint arXiv:2602.16313, 2026
-
[18]
RULER: What's the Real Context Size of Your Long-Context Language Models?
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Memory in the Age of AI Agents
Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, et al. Memory in the age of ai agents.arXiv preprint arXiv:2512.13564, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Cheng Jiayang, Dongyu Ru, Lin Qiu, Yiyang Li, Xuezhi Cao, Yangqiu Song, and Xunliang Cai. Amemgym: Interactive memory benchmarking for assistants in long-horizon conversations. arXiv preprint arXiv:2603.01966, 2026
-
[21]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Memarchitect: A policy driven memory governance layer.arXiv preprint arXiv:2603.18330, 2026
Lingavasan Suresh Kumar, Yang Ba, and Rong Pan. Memarchitect: A policy driven memory governance layer.arXiv preprint arXiv:2603.18330, 2026
-
[23]
LLMs get lost in multi-turn conversation
Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. LLMs get lost in multi-turn conversation. InThe Fourteenth International Conference on Learning Representations, 2026
2026
-
[24]
Sage: Self-evolving agents with reflective and memory-augmented abilities.Neurocomputing, 647:130470, 2025
Xuechen Liang, Meiling Tao, Yinghui Xia, Jianhui Wang, Kun Li, Yijin Wang, Yangfan He, Jingsong Yang, Tianyu Shi, Yuantao Wang, et al. Sage: Self-evolving agents with reflective and memory-augmented abilities.Neurocomputing, 647:130470, 2025
2025
-
[25]
Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024
2024
-
[26]
Shuochen Liu, Junyi Zhu, Long Shu, Junda Lin, Yuhao Chen, Haotian Zhang, Chao Zhang, Derong Xu, Jia Li, Bo Tang, et al. Perma: Benchmarking personalized memory agents via event-driven preference and realistic task environments.arXiv preprint arXiv:2603.23231, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[27]
Large Language Model Agent: A Survey on Methodology, Applications and Challenges
Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, et al. Large language model agent: A survey on methodology, applications and challenges.arXiv preprint arXiv:2503.21460, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. Wizardcoder: Empowering code large language models with evol-instruct.arXiv preprint arXiv:2306.08568, 2023. 14 /hourglass-halfYour Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems
-
[29]
Evaluating very long-term conversational memory of llm agents
Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024
2024
-
[30]
Praveen Kumar Myakala, Manan Agrawal, and Rahul Manche. Beliefshift: Benchmarking temporal belief consistency and opinion drift in llm agents.arXiv preprint arXiv:2603.23848, 2026
-
[31]
Technical debt and the reliability of enterprise software systems: A competing risks analysis.Management Science, 62(5):1487–1510, 2016
Narayan Ramasubbu and Chris F Kemerer. Technical debt and the reliability of enterprise software systems: A competing risks analysis.Management Science, 62(5):1487–1510, 2016
2016
-
[32]
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Openhands: An open platform for ai soft- ware developers as generalist agents
Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai soft- ware developers as generalist agents. InThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[35]
The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break
Xinyu Jessica Wang, Haoyue Bai, Yiyou Sun, Haorui Wang, Shuibai Zhang, Wenjie Hu, Mya Schroder, Bilge Mutlu, Dawn Song, and Robert D Nowak. The long-horizon task mirage? diagnosing where and why agentic systems break.arXiv preprint arXiv:2604.11978, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[36]
Extensible database simulator for fast prototyping in-database algorithms
Yifan Wang and Daisy Zhe Wang. Extensible database simulator for fast prototyping in-database algorithms. InProceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 5029–5033, 2022
2022
-
[37]
On the form of forgetting.Psychological science, 2(6):409– 415, 1991
John T Wixted and Ebbe B Ebbesen. On the form of forgetting.Psychological science, 2(6):409– 415, 1991
1991
-
[38]
Rethinking computer architectures and software systems for phase-change memory.ACM Journal on Emerging Technologies in Computing Systems (JETC), 12(4):1–40, 2016
Chengwen Wu, Guangyan Zhang, and Keqin Li. Rethinking computer architectures and software systems for phase-change memory.ACM Journal on Emerging Technologies in Computing Systems (JETC), 12(4):1–40, 2016
2016
-
[39]
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory
Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Longmemeval: Benchmarkingchatassistantsonlong-terminteractivememory.arXivpreprintarXiv:2410.10813, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
Cheng Yang, Xuemeng Yang, Licheng Wen, Daocheng Fu, Jianbiao Mei, Rong Wu, Pinlong Cai, Yufan Shen, Nianchen Deng, Botian Shi, et al. Learning on the job: An experience-driven self-evolving agent for long-horizon tasks.arXiv preprint arXiv:2510.08002, 2025
-
[41]
Ke Yang, Zixi Chen, Xuan He, Jize Jiang, Michel Galley, Chenglong Wang, Jianfeng Gao, Jiawei Han, and ChengXiang Zhai. Plugmem: A task-agnostic plugin memory module for llm agents. arXiv preprint arXiv:2603.03296, 2026
-
[42]
$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan.𝜏-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024. 15 /hourglass-halfYour Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022
2022
-
[44]
D-mem: A dual-process memory system for llm agents.arXiv preprint arXiv:2603.18631, 2026
Zhixing You, Jiachen Yuan, and Jason Cai. D-mem: A dual-process memory system for llm agents.arXiv preprint arXiv:2603.18631, 2026
-
[45]
Guancheng Zeng, Xueyi Chen, Jiawang Hu, Shaohua Qi, Yaxuan Mao, Zhantao Wang, Yifan Nie, Shuang Li, Qiuyang Feng, Pengxu Qiu, et al. Routine: A structural planning framework for llm agent system in enterprise.arXiv preprint arXiv:2507.14447, 2025
-
[46]
Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, et al. Which agent causes task failures and when? on automated failure attribution of llm multi-agent systems.arXiv preprint arXiv:2505.00212, 2025
-
[47]
AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications
Yujie Zhao, Boqin Yuan, Junbo Huang, Haocheng Yuan, Zhongming Yu, Haozhou Xu, Lanxiang Hu, Abhilash Shankarampeta, Zimeng Huang, Wentao Ni, et al. Ama-bench: Evaluating long- horizon memory for agentic applications.arXiv preprint arXiv:2602.22769, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[48]
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
Chenyu Zhou, Huacan Chai, Wenteng Chen, Zihan Guo, Rong Shan, Yuanyi Song, Tianyi Xu, Yingxuan Yang, Aofan Yu, Weiming Zhang, et al. Externalization in llm agents: A unified review of memory, skills, protocols and harness engineering.arXiv preprint arXiv:2604.08224, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[49]
Raffles: Reasoning-based attribution of faults for llm systems
Chenyang Zhu, Spencer Hong, Jingyu Wu, Kushal Chawla, Yuhui Tang, Youbing Yin, Nathan Wolfe, Erin Babinsky, and Daben Liu. Raffles: Reasoning-based attribution of faults for llm systems. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7659–7688, 2026
2026
-
[50]
Where llm agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370, 2025
Kunlun Zhu, Zijia Liu, Bingxuan Li, Muxin Tian, Yingxuan Yang, Jiaxun Zhang, Pengrui Han, Qipeng Xie, Fuyang Cui, Weijia Zhang, et al. Where llm agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370, 2025
-
[51]
Qiming Zhu, Shunian Chen, Rui Yu, Zhehao Wu, and Benyou Wang. From lossy to verified: A provenance-aware tiered memory for agents.arXiv preprint arXiv:2602.17913, 2026
-
[52]
Shitong Zhu, Chenhao Fang, Derek Larson, Neel Reddy Pochareddy, Rajeev Rao, Sophie Zeng, Yanqing Peng, Wendy Summer, Alex Goncalves, Arya Pudota, et al. Compliance brain assistant: Conversational agentic ai for assisting compliance tasks in enterprise environments.arXiv preprint arXiv:2507.17289, 2025
-
[53]
long-context
Shiwei Zhu, Junjie Wu, Hui Xiong, and Guoping Xia. Scaling up top-k cosine similarity search. Data & Knowledge Engineering, 70(1):60–83, 2011. 16 /hourglass-halfYour Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems Appendix A Extended Related Work 19 B Metric Definitions and Scoring 22 B.1 Aging Curve Statistics . . . . . . . . . . . ...
2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.