Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

Aditya Akella; Haris Vikalo; Jianing Zhu; John Robertson; Junbo Li; Kevin Wang; Yeonju Ro; Zhangyang Wang

arxiv: 2605.26302 · v1 · pith:WM3QS3MDnew · submitted 2026-05-25 · 💻 cs.AI · cs.CL· cs.MA

Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

Jianing Zhu , Yeonju Ro , John Robertson , Kevin Wang , Junbo Li , Haris Vikalo , Aditya Akella , Zhangyang Wang This is my paper

Pith reviewed 2026-06-29 21:10 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.MA

keywords AI agentsagent lifespanmemory agingreliability benchmarklong-term deploymenttemporal dependency graphscounterfactual probesmemory pipeline

0 comments

The pith

AI agents with frozen model weights still degrade over sessions through four memory aging mechanisms, requiring lifespan benchmarks and stage-targeted repairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that long-lived AI agents lose reliability as persistent systems because their memory compresses history, creates interference between facts, revises information, and performs maintenance, even when base model weights remain unchanged. Day-one benchmarks miss these changes because they evaluate only initial performance rather than behavior across multiple sessions. AgingBench tracks the process by building temporal dependency graphs and running paired counterfactual probes that isolate problems in the write, retrieval, or utilization stages of the memory pipeline. Tests across seven scenarios, fourteen models, and roughly four hundred runs reveal that degradation is multi-dimensional, with some failures appearing only in factual precision or derived-state tracking. This means the same incorrect output can stem from different pipeline stages and therefore needs different fixes.

Core claim

Reliability in deployed agents is a lifespan property of the full harness rather than a snapshot property of the base model, because interaction history triggers compression aging, interference aging, revision aging, and maintenance aging that are diagnosed by temporal dependency graphs and paired counterfactual probes producing distinct profiles for the write, retrieval, and utilization stages of the memory pipeline.

What carries the argument

AgingBench, which organizes degradation into four mechanisms and uses temporal dependency graphs with counterfactual probes to generate diagnostic profiles for memory pipeline stages.

If this is right

Behavioral tests can remain clean while factual precision decays, so single-metric checks miss lifespan problems.
Derived-state tracking can collapse sharply within one model, showing that some failures are abrupt and component-specific.
The same wrong answer can require different repairs depending on whether the diagnostic profile implicates write, retrieval, or utilization.
Reliable deployment needs mechanism-level diagnosis and stage-targeted repair instead of relying only on stronger initial models.
Evaluation must span many sessions to capture the cumulative effects of the four aging mechanisms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agents could benefit from scheduled memory rejuvenation steps that target the specific mechanism identified by the probes.
The same diagnostic approach might apply to other stateful systems such as long-running chatbots or recommendation engines that accumulate interaction history.
Certain memory policies might mitigate particular aging types more effectively than others, suggesting policy optimization as a follow-on direction.
Combining lifespan monitoring with occasional model updates could extend operational life beyond what either technique achieves alone.

Load-bearing premise

The four aging mechanisms together with the temporal graphs and counterfactual probes provide a complete diagnosis of all degradation sources without unmodeled effects from the agent harness or environment.

What would settle it

An experiment in which agents exhibit clear degradation that none of the four mechanisms can explain or in which the probes consistently point to the wrong pipeline stage for repair would falsify the diagnostic framework.

read the original abstract

Long-lived AI agents are increasingly deployed as persistent operational systems, yet they are still evaluated like freshly initialized models. Day-one benchmarks miss a basic systems question: how long does an agent remain reliable after deployment? Even when model weights are frozen, an agent's effective state keeps changing as it compresses interaction history, retrieves from a growing memory store, revises facts after updates, and undergoes routine maintenance. Reliability therefore becomes a lifespan property of the full agent harness, not only a snapshot property of the base model. We introduce AgingBench, a longitudinal reliability benchmark for agent lifespan engineering: measuring not only whether deployed agents degrade, but what form the degradation takes and where repair should target. AgingBench organizes agent aging into four mechanisms: compression aging, interference aging, revision aging, and maintenance aging. To diagnose these failures, AgingBench uses temporal dependency graphs and paired counterfactual probes that produce diagnostic profiles for the write, retrieval, and utilization stages of the memory pipeline. Across 7 scenarios, 14 models, multiple memory policies, and both runner-controlled and autonomous agents, over ~400 runs spanning 8 - 200 sessions show that agent aging is not one-dimensional: behavioral tests can remain clean while factual precision decays; derived-state tracking can collapse sharply within a single model; and the same wrong answer can require different repairs depending on what the diagnostic profile points to. These results suggest that reliable agent deployment requires lifespan evaluation, mechanism-level diagnosis, and stage-targeted repair, not only stronger day-one models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AgingBench frames agent degradation as multi-dimensional lifespan issues with four mechanisms and diagnostics, but the probes' ability to isolate causes without confounds is the open question.

read the letter

The main thing here is that agent reliability over time is not just about the base model but about how the full memory harness changes across sessions, and the paper gives a concrete way to break that down into four aging types with targeted diagnostics.

What is new is the split into compression aging, interference aging, revision aging, and maintenance aging, plus the use of temporal dependency graphs and counterfactual probes to produce profiles for the write, retrieval, and utilization stages. The experiments run across 7 scenarios, 14 models, multiple policies, and both runner-controlled and autonomous agents for roughly 400 runs, and they surface clear dissociations: behavioral tests can stay clean while factual precision falls, derived-state tracking can drop sharply, and the right fix depends on the profile. That empirical pattern is useful and not something prior snapshot benchmarks would catch.

The work covers a reasonable range of setups and makes a practical case for stage-targeted repair rather than just bigger day-one models.

The softer part is the validation of the diagnostics themselves. The abstract does not include error bars, statistical tests, or explicit checks that the counterfactual probes isolate one mechanism from the others or from harness and environment effects. If the probes shift state or interact with the runner in ways not modeled, the reported multi-dimensionality could be partly measurement-driven. The full methods will need to show controlled injections or oracle comparisons to close that gap.

This is for groups building or maintaining long-lived agent systems who want better than one-shot evals. It deserves peer review because the framework and scale are substantial enough to be worth referee time, even if the diagnostic claims will likely need tightening.

Referee Report

2 major / 1 minor

Summary. The paper claims that deployed AI agents exhibit multi-dimensional aging even with frozen weights, due to four mechanisms (compression aging, interference aging, revision aging, maintenance aging) in the memory pipeline. It introduces AgingBench, which uses temporal dependency graphs and paired counterfactual probes to generate diagnostic profiles for write/retrieval/utilization stages, and reports empirical results from ~400 runs across 7 scenarios, 14 models, multiple policies, and runner-controlled/autonomous agents showing dissociations such as clean behavioral tests with decaying factual precision, sharp derived-state collapses, and profile-dependent repair needs. The work argues for lifespan evaluation and stage-targeted repair over day-one benchmarks alone.

Significance. If the counterfactual probes are validated to isolate the four mechanisms without residual confounds from the harness or environment, the work would meaningfully advance AI agent reliability research by providing a diagnostic framework that distinguishes degradation sources and guides targeted interventions, moving the field beyond snapshot evaluations.

major comments (2)

[Methods] Methods (description of temporal dependency graphs and paired counterfactual probes): The central claim that these tools produce diagnostic profiles correctly attributing degradation to compression/interference/revision/maintenance without confounds rests on an unvalidated assumption; no controlled injection experiments (holding other mechanisms fixed) or oracle-harness comparisons are described to rule out probe-induced artifacts or runner/environment interactions, which is load-bearing for interpreting the reported dissociations as evidence of distinct lifespan mechanisms.
[Results] Results (empirical evaluation across ~400 runs): The abstract states results showing multi-dimensional aging (e.g., clean behavior with decaying precision, sharp collapses) but provides no error bars, statistical tests, or details on how the four mechanisms were validated against alternatives, leaving the robustness of the cross-scenario claims difficult to assess.

minor comments (1)

[Abstract] Abstract: The four aging mechanisms are named but not briefly defined, which would aid readers in following the diagnostic profiles.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important aspects of validation and statistical presentation that we address point by point below. We agree that both issues warrant revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Methods] Methods (description of temporal dependency graphs and paired counterfactual probes): The central claim that these tools produce diagnostic profiles correctly attributing degradation to compression/interference/revision/maintenance without confounds rests on an unvalidated assumption; no controlled injection experiments (holding other mechanisms fixed) or oracle-harness comparisons are described to rule out probe-induced artifacts or runner/environment interactions, which is load-bearing for interpreting the reported dissociations as evidence of distinct lifespan mechanisms.

Authors: We agree that the manuscript does not describe controlled injection experiments or oracle-harness comparisons to further validate isolation of the four mechanisms. The paired counterfactual probes are constructed to differ only on the targeted factor while holding the harness and environment fixed, and the observed dissociations across scenarios provide supporting evidence. However, to address the concern directly, the revised manuscript will add a dedicated validation subsection that includes synthetic injection experiments (injecting one mechanism while holding others constant) and comparisons against oracle harnesses. revision: yes
Referee: [Results] Results (empirical evaluation across ~400 runs): The abstract states results showing multi-dimensional aging (e.g., clean behavior with decaying precision, sharp collapses) but provides no error bars, statistical tests, or details on how the four mechanisms were validated against alternatives, leaving the robustness of the cross-scenario claims difficult to assess.

Authors: We acknowledge that the current results section lacks error bars, statistical tests, and explicit details on distinguishing the four mechanisms from alternatives. The reported dissociations are drawn from the ~400 runs, but the presentation emphasizes qualitative patterns. In the revision we will add error bars to all metrics, include appropriate statistical tests (e.g., paired comparisons and ANOVA across conditions), and expand the results to describe how probe outcomes were used to attribute degradation to specific mechanisms versus alternatives. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no derivation chain or self-citation reductions

full rationale

The paper presents AgingBench as a longitudinal empirical benchmark, reporting observations from ~400 runs across scenarios, models, and policies. No equations, fitted parameters, or predictions are described. The four aging mechanisms and counterfactual probes are introduced as definitional components of the benchmark methodology rather than derived results. No self-citations are invoked to justify uniqueness theorems or ansatzes. The central claims rest on direct experimental data rather than any reduction to prior inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 4 invented entities

The central claim rests on the domain assumption that agent state changes can be partitioned into four aging mechanisms and three memory stages; these are introduced without independent evidence outside the benchmark itself.

axioms (2)

domain assumption Agent state changes after deployment can be partitioned into compression aging, interference aging, revision aging, and maintenance aging.
This partitioning is used to organize the benchmark and interpret the diagnostic profiles.
domain assumption Temporal dependency graphs and paired counterfactual probes can isolate failures at the write, retrieval, and utilization stages of the memory pipeline.
This is invoked to produce the diagnostic profiles that distinguish the aging mechanisms.

invented entities (4)

compression aging no independent evidence
purpose: Categorize degradation caused by history compression in memory.
Newly defined mechanism with no independent evidence cited beyond the benchmark results.
interference aging no independent evidence
purpose: Categorize degradation caused by memory interference.
Newly defined mechanism with no independent evidence cited beyond the benchmark results.
revision aging no independent evidence
purpose: Categorize degradation caused by fact revisions.
Newly defined mechanism with no independent evidence cited beyond the benchmark results.
maintenance aging no independent evidence
purpose: Categorize degradation caused by routine maintenance operations.
Newly defined mechanism with no independent evidence cited beyond the benchmark results.

pith-pipeline@v0.9.1-grok · 5826 in / 1591 out tokens · 36234 ms · 2026-06-29T21:10:50.090824+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 35 canonical work pages · 20 internal anchors

[1]

gpt-oss-120b & gpt-oss-20b Model Card

SandhiniAgarwal, LamaAhmad, JasonAi, SamAltman, AndyApplebaum, Edwin Arbus, RahulK Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Claude code.https://claude.ai, 2026

Anthropic. Claude code.https://claude.ai, 2026. Accessed: 2026-04

2026
[3]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks

Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, et al. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3639–3664, 2025

2025
[5]

Indexing and access for digital libraries and the internet: Human, database, and domain factors.Journal of the American Society for information science, 49(13):1185–1205, 1998

Marcia J Bates. Indexing and access for digital libraries and the internet: Human, database, and domain factors.Journal of the American Society for information science, 49(13):1185–1205, 1998

1998
[6]

Why Do Multi-Agent LLM Systems Fail?

Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, KurtKeutzer, AdityaParameswaran, DanKlein, KannanRamchandran, etal. Whydomulti-agent llm systems fail?arXiv preprint arXiv:2503.13657, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Vehiclemembench: An executable benchmark for multi-user long-term memory in in-vehicle agents.arXiv preprint arXiv:2603.23840, 2026

Yuhao Chen, Yi Xu, Xinyun Ding, Xiang Fang, Shuochen Liu, Luxi Lin, Qingyu Zhang, Ya Li, Quan Liu, and Tong Xu. Vehiclemembench: An executable benchmark for multi-user long-term memory in in-vehicle agents.arXiv preprint arXiv:2603.23840, 2026

work page arXiv 2026
[9]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

EvoClaw: Evaluating AI Agents on Continuous Software Evolution

Gangda Deng, Zhaoling Chen, Zhongming Yu, Haoyang Fan, Yuhong Liu, Yuxin Yang, Dhruv Parikh, Rajgopal Kannan, Le Cong, Mengdi Wang, et al. Evoclaw: Evaluating ai agents on continuous software evolution.arXiv preprint arXiv:2603.13428, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

Human hippocampal neurogenesis in adulthood, ageing and alzheimer’s disease.Nature, 2026

Ahmed Disouky, Mark A Sanborn, KR Sabitha, Mostafa M Mostafa, Ivan Alejandro Ayala, David A Bennett, Yisha Lu, Yi Zhou, C Dirk Keene, Sandra Weintraub, et al. Human hippocampal neurogenesis in adulthood, ageing and alzheimer’s disease.Nature, 2026

2026
[12]

Memory traces in dynamical systems

Surya Ganguli, Dongsung Huh, and Haim Sompolinsky. Memory traces in dynamical systems. Proceedings of the national academy of sciences, 105(48):18970–18975, 2008

2008
[13]

A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, et al. A survey of self-evolving agents: What, when, how, and where to evolve on the path to artificial super intelligence.arXiv preprint arXiv:2507.21046, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

A survey of vibe coding with large language models.arXiv preprint arXiv:2510.12399, 2025

Yuyao Ge, Lingrui Mei, Zenghao Duan, Tianhao Li, Yujia Zheng, Yiwei Wang, Lexin Wang, Jiayu Yao, Tianyu Liu, Yujun Cai, et al. A survey of vibe coding with large language models.arXiv preprint arXiv:2510.12399, 2025. 13 /hourglass-halfYour Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

work page arXiv 2025
[15]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Regression test selection for java software.ACM Sigplan Notices, 36(11):312–326, 2001

Mary Jean Harrold, James A Jones, Tongyu Li, Donglin Liang, Alessandro Orso, Maikel Pennings, Saurabh Sinha, S Alexander Spoon, and Ashish Gujarathi. Regression test selection for java software.ACM Sigplan Notices, 36(11):312–326, 2001

2001
[17]

arXiv preprint arXiv:2602.16313 , year=

Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen, Lang Yin, Ze Chen, Tong Arthur Wu, Siru Ouyang, Zihan Wang, et al. Memoryarena: Benchmarking agent memory in interde- pendent multi-session agentic tasks.arXiv preprint arXiv:2602.16313, 2026

work page arXiv 2026
[18]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Memory in the Age of AI Agents

Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, et al. Memory in the age of ai agents.arXiv preprint arXiv:2512.13564, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Amem- gym: Interactive memory benchmarking for assistants in long-horizon conversations.arXiv preprint arXiv:2603.01966, 2026

Cheng Jiayang, Dongyu Ru, Lin Qiu, Yiyang Li, Xuezhi Cao, Yangqiu Song, and Xunliang Cai. Amemgym: Interactive memory benchmarking for assistants in long-horizon conversations. arXiv preprint arXiv:2603.01966, 2026

work page arXiv 2026
[21]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Memarchitect: A policy driven memory governance layer.arXiv preprint arXiv:2603.18330, 2026

Lingavasan Suresh Kumar, Yang Ba, and Rong Pan. Memarchitect: A policy driven memory governance layer.arXiv preprint arXiv:2603.18330, 2026

work page arXiv 2026
[23]

LLMs get lost in multi-turn conversation

Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. LLMs get lost in multi-turn conversation. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[24]

Sage: Self-evolving agents with reflective and memory-augmented abilities.Neurocomputing, 647:130470, 2025

Xuechen Liang, Meiling Tao, Yinghui Xia, Jianhui Wang, Kun Li, Yijin Wang, Yangfan He, Jingsong Yang, Tianyu Shi, Yuantao Wang, et al. Sage: Self-evolving agents with reflective and memory-augmented abilities.Neurocomputing, 647:130470, 2025

2025
[25]

Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

2024
[26]

PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments

Shuochen Liu, Junyi Zhu, Long Shu, Junda Lin, Yuhao Chen, Haotian Zhang, Chao Zhang, Derong Xu, Jia Li, Bo Tang, et al. Perma: Benchmarking personalized memory agents via event-driven preference and realistic task environments.arXiv preprint arXiv:2603.23231, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

Large Language Model Agent: A Survey on Methodology, Applications and Challenges

Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, et al. Large language model agent: A survey on methodology, applications and challenges.arXiv preprint arXiv:2503.21460, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Wizardcoder: Empowering code large language models with evol-instruct.arXiv preprint arXiv:2306.08568,

Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. Wizardcoder: Empowering code large language models with evol-instruct.arXiv preprint arXiv:2306.08568, 2023. 14 /hourglass-halfYour Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

work page arXiv 2023
[29]

Evaluating very long-term conversational memory of llm agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024

2024
[30]

Beliefshift: Benchmarking temporal belief consistency and opinion drift in llm agents.arXiv preprint arXiv:2603.23848, 2026

Praveen Kumar Myakala, Manan Agrawal, and Rahul Manche. Beliefshift: Benchmarking temporal belief consistency and opinion drift in llm agents.arXiv preprint arXiv:2603.23848, 2026

work page arXiv 2026
[31]

Technical debt and the reliability of enterprise software systems: A competing risks analysis.Management Science, 62(5):1487–1510, 2016

Narayan Ramasubbu and Chris F Kemerer. Technical debt and the reliability of enterprise software systems: A competing risks analysis.Management Science, 62(5):1487–1510, 2016

2016
[32]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Openhands: An open platform for ai soft- ware developers as generalist agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai soft- ware developers as generalist agents. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[35]

The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break

Xinyu Jessica Wang, Haoyue Bai, Yiyou Sun, Haorui Wang, Shuibai Zhang, Wenjie Hu, Mya Schroder, Bilge Mutlu, Dawn Song, and Robert D Nowak. The long-horizon task mirage? diagnosing where and why agentic systems break.arXiv preprint arXiv:2604.11978, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[36]

Extensible database simulator for fast prototyping in-database algorithms

Yifan Wang and Daisy Zhe Wang. Extensible database simulator for fast prototyping in-database algorithms. InProceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 5029–5033, 2022

2022
[37]

On the form of forgetting.Psychological science, 2(6):409– 415, 1991

John T Wixted and Ebbe B Ebbesen. On the form of forgetting.Psychological science, 2(6):409– 415, 1991

1991
[38]

Rethinking computer architectures and software systems for phase-change memory.ACM Journal on Emerging Technologies in Computing Systems (JETC), 12(4):1–40, 2016

Chengwen Wu, Guangyan Zhang, and Keqin Li. Rethinking computer architectures and software systems for phase-change memory.ACM Journal on Emerging Technologies in Computing Systems (JETC), 12(4):1–40, 2016

2016
[39]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Longmemeval: Benchmarkingchatassistantsonlong-terminteractivememory.arXivpreprintarXiv:2410.10813, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Learning on the job: An experience-driven self-evolving agent for long-horizon tasks.arXiv preprint arXiv:2510.08002, 2025

Cheng Yang, Xuemeng Yang, Licheng Wen, Daocheng Fu, Jianbiao Mei, Rong Wu, Pinlong Cai, Yufan Shen, Nianchen Deng, Botian Shi, et al. Learning on the job: An experience-driven self-evolving agent for long-horizon tasks.arXiv preprint arXiv:2510.08002, 2025

work page arXiv 2025
[41]

2603.03296 , archivePrefix=

Ke Yang, Zixi Chen, Xuan He, Jize Jiang, Michel Galley, Chenglong Wang, Jianfeng Gao, Jiawei Han, and ChengXiang Zhai. Plugmem: A task-agnostic plugin memory module for llm agents. arXiv preprint arXiv:2603.03296, 2026

work page arXiv 2026
[42]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan.𝜏-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024. 15 /hourglass-halfYour Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

2022
[44]

D-mem: A dual-process memory system for llm agents.arXiv preprint arXiv:2603.18631, 2026

Zhixing You, Jiachen Yuan, and Jason Cai. D-mem: A dual-process memory system for llm agents.arXiv preprint arXiv:2603.18631, 2026

work page arXiv 2026
[45]

Routine: A structural planning framework for llm agent system in enterprise.arXiv preprint arXiv:2507.14447, 2025

Guancheng Zeng, Xueyi Chen, Jiawang Hu, Shaohua Qi, Yaxuan Mao, Zhantao Wang, Yifan Nie, Shuang Li, Qiuyang Feng, Pengxu Qiu, et al. Routine: A structural planning framework for llm agent system in enterprise.arXiv preprint arXiv:2507.14447, 2025

work page arXiv 2025
[46]

Which agent causes task failures and when? on automated failure attribution of llm multi-agent systems.arXiv preprint arXiv:2505.00212, 2025

Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, et al. Which agent causes task failures and when? on automated failure attribution of llm multi-agent systems.arXiv preprint arXiv:2505.00212, 2025

work page arXiv 2025
[47]

AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

Yujie Zhao, Boqin Yuan, Junbo Huang, Haocheng Yuan, Zhongming Yu, Haozhou Xu, Lanxiang Hu, Abhilash Shankarampeta, Zimeng Huang, Wentao Ni, et al. Ama-bench: Evaluating long- horizon memory for agentic applications.arXiv preprint arXiv:2602.22769, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[48]

Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering

Chenyu Zhou, Huacan Chai, Wenteng Chen, Zihan Guo, Rong Shan, Yuanyi Song, Tianyi Xu, Yingxuan Yang, Aofan Yu, Weiming Zhang, et al. Externalization in llm agents: A unified review of memory, skills, protocols and harness engineering.arXiv preprint arXiv:2604.08224, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[49]

Raffles: Reasoning-based attribution of faults for llm systems

Chenyang Zhu, Spencer Hong, Jingyu Wu, Kushal Chawla, Yuhui Tang, Youbing Yin, Nathan Wolfe, Erin Babinsky, and Daben Liu. Raffles: Reasoning-based attribution of faults for llm systems. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7659–7688, 2026

2026
[50]

Where llm agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370, 2025

Kunlun Zhu, Zijia Liu, Bingxuan Li, Muxin Tian, Yingxuan Yang, Jiaxun Zhang, Pengrui Han, Qipeng Xie, Fuyang Cui, Weijia Zhang, et al. Where llm agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370, 2025

work page arXiv 2025
[51]

From lossy to verified: A provenance-aware tiered memory for agents.arXiv preprint arXiv:2602.17913, 2026

Qiming Zhu, Shunian Chen, Rui Yu, Zhehao Wu, and Benyou Wang. From lossy to verified: A provenance-aware tiered memory for agents.arXiv preprint arXiv:2602.17913, 2026

work page arXiv 2026
[52]

Compliance brain assistant: Conversational agentic ai for assisting compliance tasks in enterprise environments.arXiv preprint arXiv:2507.17289, 2025

Shitong Zhu, Chenhao Fang, Derek Larson, Neel Reddy Pochareddy, Rajeev Rao, Sophie Zeng, Yanqing Peng, Wendy Summer, Alex Goncalves, Arya Pudota, et al. Compliance brain assistant: Conversational agentic ai for assisting compliance tasks in enterprise environments.arXiv preprint arXiv:2507.17289, 2025

work page arXiv 2025
[53]

long-context

Shiwei Zhu, Junjie Wu, Hui Xiong, and Guoping Xia. Scaling up top-k cosine similarity search. Data & Knowledge Engineering, 70(1):60–83, 2011. 16 /hourglass-halfYour Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems Appendix A Extended Related Work 19 B Metric Definitions and Scoring 22 B.1 Aging Curve Statistics . . . . . . . . . . . ...

2011

[1] [1]

gpt-oss-120b & gpt-oss-20b Model Card

SandhiniAgarwal, LamaAhmad, JasonAi, SamAltman, AndyApplebaum, Edwin Arbus, RahulK Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Claude code.https://claude.ai, 2026

Anthropic. Claude code.https://claude.ai, 2026. Accessed: 2026-04

2026

[3] [3]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks

Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, et al. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3639–3664, 2025

2025

[5] [5]

Indexing and access for digital libraries and the internet: Human, database, and domain factors.Journal of the American Society for information science, 49(13):1185–1205, 1998

Marcia J Bates. Indexing and access for digital libraries and the internet: Human, database, and domain factors.Journal of the American Society for information science, 49(13):1185–1205, 1998

1998

[6] [6]

Why Do Multi-Agent LLM Systems Fail?

Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, KurtKeutzer, AdityaParameswaran, DanKlein, KannanRamchandran, etal. Whydomulti-agent llm systems fail?arXiv preprint arXiv:2503.13657, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[8] [8]

Vehiclemembench: An executable benchmark for multi-user long-term memory in in-vehicle agents.arXiv preprint arXiv:2603.23840, 2026

Yuhao Chen, Yi Xu, Xinyun Ding, Xiang Fang, Shuochen Liu, Luxi Lin, Qingyu Zhang, Ya Li, Quan Liu, and Tong Xu. Vehiclemembench: An executable benchmark for multi-user long-term memory in in-vehicle agents.arXiv preprint arXiv:2603.23840, 2026

work page arXiv 2026

[9] [9]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

EvoClaw: Evaluating AI Agents on Continuous Software Evolution

Gangda Deng, Zhaoling Chen, Zhongming Yu, Haoyang Fan, Yuhong Liu, Yuxin Yang, Dhruv Parikh, Rajgopal Kannan, Le Cong, Mengdi Wang, et al. Evoclaw: Evaluating ai agents on continuous software evolution.arXiv preprint arXiv:2603.13428, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[11] [11]

Human hippocampal neurogenesis in adulthood, ageing and alzheimer’s disease.Nature, 2026

Ahmed Disouky, Mark A Sanborn, KR Sabitha, Mostafa M Mostafa, Ivan Alejandro Ayala, David A Bennett, Yisha Lu, Yi Zhou, C Dirk Keene, Sandra Weintraub, et al. Human hippocampal neurogenesis in adulthood, ageing and alzheimer’s disease.Nature, 2026

2026

[12] [12]

Memory traces in dynamical systems

Surya Ganguli, Dongsung Huh, and Haim Sompolinsky. Memory traces in dynamical systems. Proceedings of the national academy of sciences, 105(48):18970–18975, 2008

2008

[13] [13]

A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, et al. A survey of self-evolving agents: What, when, how, and where to evolve on the path to artificial super intelligence.arXiv preprint arXiv:2507.21046, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

A survey of vibe coding with large language models.arXiv preprint arXiv:2510.12399, 2025

Yuyao Ge, Lingrui Mei, Zenghao Duan, Tianhao Li, Yujia Zheng, Yiwei Wang, Lexin Wang, Jiayu Yao, Tianyu Liu, Yujun Cai, et al. A survey of vibe coding with large language models.arXiv preprint arXiv:2510.12399, 2025. 13 /hourglass-halfYour Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

work page arXiv 2025

[15] [15]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Regression test selection for java software.ACM Sigplan Notices, 36(11):312–326, 2001

Mary Jean Harrold, James A Jones, Tongyu Li, Donglin Liang, Alessandro Orso, Maikel Pennings, Saurabh Sinha, S Alexander Spoon, and Ashish Gujarathi. Regression test selection for java software.ACM Sigplan Notices, 36(11):312–326, 2001

2001

[17] [17]

arXiv preprint arXiv:2602.16313 , year=

Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen, Lang Yin, Ze Chen, Tong Arthur Wu, Siru Ouyang, Zihan Wang, et al. Memoryarena: Benchmarking agent memory in interde- pendent multi-session agentic tasks.arXiv preprint arXiv:2602.16313, 2026

work page arXiv 2026

[18] [18]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Memory in the Age of AI Agents

Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, et al. Memory in the age of ai agents.arXiv preprint arXiv:2512.13564, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Amem- gym: Interactive memory benchmarking for assistants in long-horizon conversations.arXiv preprint arXiv:2603.01966, 2026

Cheng Jiayang, Dongyu Ru, Lin Qiu, Yiyang Li, Xuezhi Cao, Yangqiu Song, and Xunliang Cai. Amemgym: Interactive memory benchmarking for assistants in long-horizon conversations. arXiv preprint arXiv:2603.01966, 2026

work page arXiv 2026

[21] [21]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

Memarchitect: A policy driven memory governance layer.arXiv preprint arXiv:2603.18330, 2026

Lingavasan Suresh Kumar, Yang Ba, and Rong Pan. Memarchitect: A policy driven memory governance layer.arXiv preprint arXiv:2603.18330, 2026

work page arXiv 2026

[23] [23]

LLMs get lost in multi-turn conversation

Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. LLMs get lost in multi-turn conversation. InThe Fourteenth International Conference on Learning Representations, 2026

2026

[24] [24]

Sage: Self-evolving agents with reflective and memory-augmented abilities.Neurocomputing, 647:130470, 2025

Xuechen Liang, Meiling Tao, Yinghui Xia, Jianhui Wang, Kun Li, Yijin Wang, Yangfan He, Jingsong Yang, Tianyu Shi, Yuantao Wang, et al. Sage: Self-evolving agents with reflective and memory-augmented abilities.Neurocomputing, 647:130470, 2025

2025

[25] [25]

Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

2024

[26] [26]

PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments

Shuochen Liu, Junyi Zhu, Long Shu, Junda Lin, Yuhao Chen, Haotian Zhang, Chao Zhang, Derong Xu, Jia Li, Bo Tang, et al. Perma: Benchmarking personalized memory agents via event-driven preference and realistic task environments.arXiv preprint arXiv:2603.23231, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[27] [27]

Large Language Model Agent: A Survey on Methodology, Applications and Challenges

Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, et al. Large language model agent: A survey on methodology, applications and challenges.arXiv preprint arXiv:2503.21460, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Wizardcoder: Empowering code large language models with evol-instruct.arXiv preprint arXiv:2306.08568,

Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. Wizardcoder: Empowering code large language models with evol-instruct.arXiv preprint arXiv:2306.08568, 2023. 14 /hourglass-halfYour Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

work page arXiv 2023

[29] [29]

Evaluating very long-term conversational memory of llm agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024

2024

[30] [30]

Beliefshift: Benchmarking temporal belief consistency and opinion drift in llm agents.arXiv preprint arXiv:2603.23848, 2026

Praveen Kumar Myakala, Manan Agrawal, and Rahul Manche. Beliefshift: Benchmarking temporal belief consistency and opinion drift in llm agents.arXiv preprint arXiv:2603.23848, 2026

work page arXiv 2026

[31] [31]

Technical debt and the reliability of enterprise software systems: A competing risks analysis.Management Science, 62(5):1487–1510, 2016

Narayan Ramasubbu and Chris F Kemerer. Technical debt and the reliability of enterprise software systems: A competing risks analysis.Management Science, 62(5):1487–1510, 2016

2016

[32] [32]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

Openhands: An open platform for ai soft- ware developers as generalist agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai soft- ware developers as generalist agents. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[35] [35]

The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break

Xinyu Jessica Wang, Haoyue Bai, Yiyou Sun, Haorui Wang, Shuibai Zhang, Wenjie Hu, Mya Schroder, Bilge Mutlu, Dawn Song, and Robert D Nowak. The long-horizon task mirage? diagnosing where and why agentic systems break.arXiv preprint arXiv:2604.11978, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[36] [36]

Extensible database simulator for fast prototyping in-database algorithms

Yifan Wang and Daisy Zhe Wang. Extensible database simulator for fast prototyping in-database algorithms. InProceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 5029–5033, 2022

2022

[37] [37]

On the form of forgetting.Psychological science, 2(6):409– 415, 1991

John T Wixted and Ebbe B Ebbesen. On the form of forgetting.Psychological science, 2(6):409– 415, 1991

1991

[38] [38]

Rethinking computer architectures and software systems for phase-change memory.ACM Journal on Emerging Technologies in Computing Systems (JETC), 12(4):1–40, 2016

Chengwen Wu, Guangyan Zhang, and Keqin Li. Rethinking computer architectures and software systems for phase-change memory.ACM Journal on Emerging Technologies in Computing Systems (JETC), 12(4):1–40, 2016

2016

[39] [39]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Longmemeval: Benchmarkingchatassistantsonlong-terminteractivememory.arXivpreprintarXiv:2410.10813, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [40]

Learning on the job: An experience-driven self-evolving agent for long-horizon tasks.arXiv preprint arXiv:2510.08002, 2025

Cheng Yang, Xuemeng Yang, Licheng Wen, Daocheng Fu, Jianbiao Mei, Rong Wu, Pinlong Cai, Yufan Shen, Nianchen Deng, Botian Shi, et al. Learning on the job: An experience-driven self-evolving agent for long-horizon tasks.arXiv preprint arXiv:2510.08002, 2025

work page arXiv 2025

[41] [41]

2603.03296 , archivePrefix=

Ke Yang, Zixi Chen, Xuan He, Jize Jiang, Michel Galley, Chenglong Wang, Jianfeng Gao, Jiawei Han, and ChengXiang Zhai. Plugmem: A task-agnostic plugin memory module for llm agents. arXiv preprint arXiv:2603.03296, 2026

work page arXiv 2026

[42] [42]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan.𝜏-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024. 15 /hourglass-halfYour Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [43]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

2022

[44] [44]

D-mem: A dual-process memory system for llm agents.arXiv preprint arXiv:2603.18631, 2026

Zhixing You, Jiachen Yuan, and Jason Cai. D-mem: A dual-process memory system for llm agents.arXiv preprint arXiv:2603.18631, 2026

work page arXiv 2026

[45] [45]

Routine: A structural planning framework for llm agent system in enterprise.arXiv preprint arXiv:2507.14447, 2025

Guancheng Zeng, Xueyi Chen, Jiawang Hu, Shaohua Qi, Yaxuan Mao, Zhantao Wang, Yifan Nie, Shuang Li, Qiuyang Feng, Pengxu Qiu, et al. Routine: A structural planning framework for llm agent system in enterprise.arXiv preprint arXiv:2507.14447, 2025

work page arXiv 2025

[46] [46]

Which agent causes task failures and when? on automated failure attribution of llm multi-agent systems.arXiv preprint arXiv:2505.00212, 2025

Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, et al. Which agent causes task failures and when? on automated failure attribution of llm multi-agent systems.arXiv preprint arXiv:2505.00212, 2025

work page arXiv 2025

[47] [47]

AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

Yujie Zhao, Boqin Yuan, Junbo Huang, Haocheng Yuan, Zhongming Yu, Haozhou Xu, Lanxiang Hu, Abhilash Shankarampeta, Zimeng Huang, Wentao Ni, et al. Ama-bench: Evaluating long- horizon memory for agentic applications.arXiv preprint arXiv:2602.22769, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[48] [48]

Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering

Chenyu Zhou, Huacan Chai, Wenteng Chen, Zihan Guo, Rong Shan, Yuanyi Song, Tianyi Xu, Yingxuan Yang, Aofan Yu, Weiming Zhang, et al. Externalization in llm agents: A unified review of memory, skills, protocols and harness engineering.arXiv preprint arXiv:2604.08224, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[49] [49]

Raffles: Reasoning-based attribution of faults for llm systems

Chenyang Zhu, Spencer Hong, Jingyu Wu, Kushal Chawla, Yuhui Tang, Youbing Yin, Nathan Wolfe, Erin Babinsky, and Daben Liu. Raffles: Reasoning-based attribution of faults for llm systems. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7659–7688, 2026

2026

[50] [50]

Where llm agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370, 2025

Kunlun Zhu, Zijia Liu, Bingxuan Li, Muxin Tian, Yingxuan Yang, Jiaxun Zhang, Pengrui Han, Qipeng Xie, Fuyang Cui, Weijia Zhang, et al. Where llm agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370, 2025

work page arXiv 2025

[51] [51]

From lossy to verified: A provenance-aware tiered memory for agents.arXiv preprint arXiv:2602.17913, 2026

Qiming Zhu, Shunian Chen, Rui Yu, Zhehao Wu, and Benyou Wang. From lossy to verified: A provenance-aware tiered memory for agents.arXiv preprint arXiv:2602.17913, 2026

work page arXiv 2026

[52] [52]

Compliance brain assistant: Conversational agentic ai for assisting compliance tasks in enterprise environments.arXiv preprint arXiv:2507.17289, 2025

Shitong Zhu, Chenhao Fang, Derek Larson, Neel Reddy Pochareddy, Rajeev Rao, Sophie Zeng, Yanqing Peng, Wendy Summer, Alex Goncalves, Arya Pudota, et al. Compliance brain assistant: Conversational agentic ai for assisting compliance tasks in enterprise environments.arXiv preprint arXiv:2507.17289, 2025

work page arXiv 2025

[53] [53]

long-context

Shiwei Zhu, Junjie Wu, Hui Xiong, and Guoping Xia. Scaling up top-k cosine similarity search. Data & Knowledge Engineering, 70(1):60–83, 2011. 16 /hourglass-halfYour Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems Appendix A Extended Related Work 19 B Metric Definitions and Scoring 22 B.1 Aging Curve Statistics . . . . . . . . . . . ...

2011