pith. machine review for the scientific record. sign in

arxiv: 2605.13941 · v1 · submitted 2026-05-13 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

EvolveMem:Self-Evolving Memory Architecture via AutoResearch for LLM Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-15 04:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords self-evolving memoryLLM agentslong-term memoryretrieval configurationfailure diagnosisAutoResearchmemory architectureautonomous optimization
0
0 comments X

The pith

LLM agents can improve long-term memory by letting an LLM module diagnose retrieval failures and autonomously adjust the system's own configuration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that memory for LLM agents needs to evolve at two levels at once: the stored knowledge and the retrieval rules that query it. Existing systems keep scoring functions, fusion methods, and generation policies frozen after deployment, so the authors expose the entire retrieval configuration as an adjustable action space. An LLM diagnosis module reads detailed failure logs after each question, names the root causes, and proposes specific changes; a meta-analyzer applies them only when safeguards confirm no regression. The resulting closed loop runs repeated AutoResearch cycles that start from a minimal baseline and converge on stronger strategies, some of which introduce configuration dimensions absent from the original space.

Core claim

EvolveMem treats retrieval infrastructure as evolvable rather than fixed: the LLM diagnosis module identifies root causes from per-question failure logs and proposes targeted configuration adjustments, while a guarded meta-analyzer enforces revert-on-regression and explore-on-stagnation rules. This self-evolution process autonomously discovers effective retrieval strategies, including new dimensions, and produces configurations that transfer positively across benchmarks instead of overfitting to any single task.

What carries the argument

The LLM-powered diagnosis module that reads failure logs, names root causes of retrieval errors, and proposes concrete configuration changes inside a safeguarded closed loop.

If this is right

  • Memory retrieval strategies improve without manual retuning for each new task or domain.
  • Evolved configurations transfer positively rather than catastrophically to different benchmarks.
  • The system can discover entirely new configuration dimensions not present in the initial action space.
  • Autonomous research cycles replace repeated human-driven configuration search for agent memory.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same diagnosis-plus-adjustment loop could be applied to other fixed components inside LLM agents such as planning or tool selection.
  • Post-deployment agents might keep running lightweight evolution cycles on new user interactions without external oversight.
  • The approach suggests a path toward memory systems whose retrieval rules continue to specialize over months of real use rather than remaining static after release.

Load-bearing premise

The LLM diagnosis module must correctly identify why retrieval failed and suggest fixes that the safeguards can validate without introducing undetected regressions or biases.

What would settle it

Apply an evolved configuration to a fresh benchmark outside the training loop and observe either no gain or negative transfer relative to the original fixed baseline.

Figures

Figures reproduced from arXiv: 2605.13941 by Cihang Xie, Huaxiu Yao, Jiaqi Liu, Mingyu Ding, Peng Xia, Xinyu Ye, Zeyu Zheng.

Figure 1
Figure 1. Figure 1: EVOLVEMEM self-evolves its retrieval configuration on LoCoMo via AutoResearch. (a) A four-step evolution loop (EVALUATE–DIAGNOSE–PROPOSE–GUARD) ratchets accepted proposals into the action space; harmful ones (e.g., R2) are auto-reverted. (b) Overall F1 trajectory (single￾backbone GPT-4o): 30.5% baseline to 54.3% at R7. SimpleMem [13] compresses conversations into retrieval-friendly units. Another line focu… view at source ↗
Figure 2
Figure 2. Figure 2: EVOLVEMEM architecture. Three layers connected by a self-evolution feedback loop. A typed memory store is populated by an LLM-based extractor with retry and chunk-splitting; a multi-view retriever fuses BM25, semantic, and structured-metadata search with optional entity-swap, query decomposition, and answer verification; an LLM-powered diagnosis module reads per-question raw-result logs and proposes struct… view at source ↗
read the original abstract

Long-term memory is essential for LLM agents that operate across multiple sessions, yet existing memory systems treat retrieval infrastructure as fixed: stored content evolves while scoring functions, fusion strategies, and answer-generation policies remain frozen at deployment. We argue that truly adaptive memory requires co-evolution at two levels: the stored knowledge and the retrieval mechanism that queries it. We present EvolveMem, a self-evolving memory architecture that exposes its full retrieval configuration as a structured action space optimized by an LLM-powered diagnosis module. In each evolution round, the module reads per-question failure logs, identifies root causes, and proposes targeted configuration adjustments; a guarded meta-analyzer applies them with automatic revert-on-regression and explore-on-stagnation safeguards. This closed-loop self-evolution realizes an AutoResearch process: the system autonomously conducts iterative research cycles on its own architecture, replacing manual configuration tuning. Starting from a minimal baseline, the process converges autonomously, discovering effective retrieval strategies including entirely new configuration dimensions not present in the original action space. On LoCoMo, EvolveMem outperforms the strongest baseline by 25.7% relative and achieves a 78.0% relative improvement over the minimal baseline. On MemBench, EvolveMem exceeds the strongest baseline by 18.9% relative. Evolved configurations transfer across benchmarks with positive rather than catastrophic transfer, indicating that the self-evolution process captures universal retrieval principles rather than benchmark-specific heuristics. Code is available at https://github.com/aiming-lab/SimpleMem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces EvolveMem, a self-evolving memory architecture for LLM agents in which an LLM-powered diagnosis module analyzes per-question failure logs to propose targeted edits to the full retrieval configuration (scoring, fusion, and generation policies). A guarded meta-analyzer applies changes under revert-on-regression and explore-on-stagnation safeguards, realizing an AutoResearch loop that starts from a minimal baseline and autonomously discovers new configuration dimensions. On LoCoMo the system reports a 25.7% relative gain over the strongest baseline (78.0% over minimal); on MemBench the gain is 18.9% relative. Evolved configurations exhibit positive rather than negative transfer across the two benchmarks, which the authors interpret as evidence that the process captures universal retrieval principles. Code is released.

Significance. If the central claims hold, the work offers a concrete demonstration of closed-loop, LLM-driven architecture search for agent memory systems, reducing reliance on manual hyper-parameter tuning and providing empirical support for positive cross-benchmark generalization. The release of code and the explicit reporting of transfer results are concrete strengths that would allow the community to verify and extend the AutoResearch paradigm.

major comments (3)
  1. [Experiments] Experiments section: the headline performance deltas (25.7% on LoCoMo, 18.9% on MemBench) are presented without ablations that isolate the contribution of the LLM diagnosis module from the effects of additional search budget or the guarded meta-analyzer alone; without such controls the causal link between diagnosis quality and observed gains remains unverified.
  2. [Evaluation] Evaluation protocol: no statistical significance tests, number of independent evolution runs, or variance across runs are reported, and the abstract provides no details on data splits or controls for post-hoc selection of evolved configurations, undermining confidence that the reported improvements are robust.
  3. [Method] Diagnosis module description: the reliability of root-cause identification and change proposals is load-bearing for the AutoResearch claim, yet no inter-annotator agreement, human validation of proposals, or consistency metrics across LLM calls are supplied; this leaves open the possibility that gains arise from noisy search rather than intelligent diagnosis.
minor comments (2)
  1. [Method] The abstract states that the system discovers 'entirely new configuration dimensions not present in the original action space,' but the main text should include concrete examples of these dimensions and how they were represented in the structured action space.
  2. [Method] Notation for the guarded meta-analyzer and its revert/explore rules could be formalized (e.g., as a small state machine or pseudocode) to improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with specific plans for revision. These changes will strengthen the causal evidence for the LLM diagnosis module, improve statistical reporting, and add validation for the diagnosis process while preserving the core AutoResearch contribution.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the headline performance deltas (25.7% on LoCoMo, 18.9% on MemBench) are presented without ablations that isolate the contribution of the LLM diagnosis module from the effects of additional search budget or the guarded meta-analyzer alone; without such controls the causal link between diagnosis quality and observed gains remains unverified.

    Authors: We agree that isolating the LLM diagnosis module's contribution is essential. In the revised manuscript we will add a dedicated ablation subsection reporting three controlled variants run with identical search budgets: (1) full EvolveMem, (2) guarded meta-analyzer with random proposals instead of LLM diagnosis, and (3) random search over the same configuration space without the meta-analyzer. Results will be presented in a new table with relative gains, allowing direct assessment of the diagnosis module's incremental value. We will also note the additional compute required for these controls. revision: yes

  2. Referee: [Evaluation] Evaluation protocol: no statistical significance tests, number of independent evolution runs, or variance across runs are reported, and the abstract provides no details on data splits or controls for post-hoc selection of evolved configurations, undermining confidence that the reported improvements are robust.

    Authors: We acknowledge the need for greater statistical rigor. The revision will report results from five independent evolution runs per benchmark, including mean accuracy, standard deviation, and paired t-test p-values against baselines. The abstract will be updated to state that standard benchmark splits were used and that final configurations were chosen on a held-out validation portion of each dataset to avoid post-hoc selection. These details and the run count will be added to the Evaluation and Experiments sections. revision: yes

  3. Referee: [Method] Diagnosis module description: the reliability of root-cause identification and change proposals is load-bearing for the AutoResearch claim, yet no inter-annotator agreement, human validation of proposals, or consistency metrics across LLM calls are supplied; this leaves open the possibility that gains arise from noisy search rather than intelligent diagnosis.

    Authors: We agree that empirical validation of diagnosis quality would strengthen the claim. We will add a new subsection with human evaluation of 100 randomly sampled diagnosis outputs (two independent annotators, Cohen's kappa reported) and consistency measurements obtained by re-running the diagnosis module (temperature 0) on the same failure logs and measuring proposal overlap. These results will be presented alongside the main experiments to demonstrate that the observed gains arise from structured rather than noisy proposals. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical gains rest on external benchmark measurements

full rationale

The paper describes an empirical architecture in which an LLM diagnosis module proposes configuration edits that are then evaluated on held-out benchmarks (LoCoMo, MemBench). Reported improvements (25.7% and 18.9% relative) and positive cross-benchmark transfer are measured outcomes, not quantities derived by construction from the same inputs. No equations, self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the text that would collapse the central claims back to the system’s own outputs. The derivation chain is therefore self-contained as an experimental result rather than a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the unverified capability of the LLM to perform accurate failure diagnosis and on the existence of a well-behaved action space for retrieval configurations; no explicit free parameters are named in the abstract.

axioms (1)
  • domain assumption An LLM can reliably diagnose root causes of memory retrieval failures from per-question logs
    This is the core mechanism enabling the evolution loop and is assumed rather than proven in the abstract.
invented entities (1)
  • Guarded meta-analyzer no independent evidence
    purpose: Applies configuration changes with automatic revert-on-regression and explore-on-stagnation safeguards
    New component introduced to stabilize the self-evolution process

pith-pipeline@v0.9.0 · 5590 in / 1233 out tokens · 40389 ms · 2026-05-15T04:45:05.843429+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 11 internal anchors

  1. [1]

    Self-rag: Learning to retrieve, generate, and critique through self-reflection

    Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection. InInternational Conference on Learning Representations (ICLR), 2024

  2. [2]

    Self-play fine- tuning converts weak language models to strong language models

    Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine- tuning converts weak language models to strong language models. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024

  3. [3]

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

    Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

  4. [4]

    Über das gedächtnis: Untersuchungen zur experimentellen psychologie

    Hermann Ebbinghaus. Über das gedächtnis: Untersuchungen zur experimentellen psychologie. 1885

  5. [5]

    A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

    Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, et al. A survey of self-evolving agents: On the path to artificial super intelligence.arXiv preprint arXiv:2507.21046, 2025

  6. [6]

    λ-tune: Harnessing large language models for automated database system tuning

    Victor Giannakouris and Immanuel Trummer. λ-tune: Harnessing large language models for automated database system tuning. InProceedings of the ACM on Management of Data (SIGMOD), 2025

  7. [7]

    Realm: Retrieval- augmented language model pre-training

    Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Realm: Retrieval- augmented language model pre-training. InProceedings of the 37th International Conference on Machine Learning (ICML), pages 3929–3938, 2020

  8. [8]

    Memory in the Age of AI Agents

    Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, et al. Memory in the age of ai agents.arXiv preprint arXiv:2512.13564, 2025

  9. [9]

    Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C. Park. Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024

  10. [10]

    Active retrieval augmented generation.Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7969–7992, 2023

    Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation.Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7969–7992, 2023

  11. [11]

    Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

  12. [12]

    Omni-simplemem: Autoresearch- guided discovery of lifelong multimodal agent memory.arXiv preprint arXiv:2604.01007, 2026

    Jiaqi Liu, Zipeng Ling, Shi Qiu, Yanqing Liu, Siwei Han, Peng Xia, Haoqin Tu, Zeyu Zheng, Cihang Xie, Charles Fleming, Mingyu Ding, and Huaxiu Yao. Omni-simplemem: Autoresearch- guided discovery of lifelong multimodal agent memory.arXiv preprint arXiv:2604.01007, 2026

  13. [13]

    arXiv preprint arXiv:2601.02553 , year=

    Jiaqi Liu, Yaofeng Su, Peng Xia, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, and Huaxiu Yao. Simplemem: Efficient lifelong memory for llm agents.arXiv preprint arXiv:2601.02553, 2026

  14. [14]

    Autoresearchclaw: Fully autonomous research from idea to paper, 2026.https://github.com/aiming-lab/AutoResearchClaw

    Jiaqi Liu, Peng Xia, Siwei Han, Shi Qiu, Letian Zhang, Guiming Chen, Haoqin Tu, Xinyu Yang, Jiawei Zhou, Hongtu Zhu, Yun Li, Jiaheng Zhang, Yuyin Zhou, Zeyu Zheng, Cihang Xie, Mingyu Ding, and Huaxiu Yao. Autoresearchclaw: Fully autonomous research from idea to paper, 2026.https://github.com/aiming-lab/AutoResearchClaw. 10

  15. [15]

    Memverse: Multimodal memory for lifelong learning agents.arXiv preprint arXiv:2512.03627, 2025

    Junming Liu, Yifei Sun, Weihua Cheng, Haodong Lei, Yirong Chen, Licheng Wen, Xuemeng Yang, Daocheng Fu, Pinlong Cai, Nianchen Deng, et al. Memverse: Multimodal memory for lifelong learning agents.arXiv preprint arXiv:2512.03627, 2025

  16. [16]

    Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems, 36:46534–46594, 2023

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems, 36:46534–46594, 2023

  17. [17]

    Evaluating very long-term conversational memory of llm agents

    Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024

  18. [18]

    Why there are comple- mentary learning systems in the hippocampus and neocortex.Psychological Review, 102(3): 419–457, 1995

    James L McClelland, Bruce L McNaughton, and Randall C O’Reilly. Why there are comple- mentary learning systems in the hippocampus and neocortex.Psychological Review, 102(3): 419–457, 1995

  19. [19]

    MemGPT: Towards LLMs as Operating Systems

    Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G Patil, Ion Stoica, and Joseph E Gonzalez. Memgpt: Towards llms as operating systems.arXiv preprint arXiv:2310.08560, 2023

  20. [20]

    Vicky Zhao, Lili Qiu, and Jianfeng Gao

    Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Xufang Luo, Hao Cheng, Dongsheng Li, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, and Jianfeng Gao. SeCom: On memory construc- tion and retrieval for personalized conversational agents. InProceedings of the International Conference on Learning Representations (ICLR), 2025

  21. [21]

    Generative agents: Interactive simulacra of human behavior.Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–22, 2023

    Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior.Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–22, 2023

  22. [22]

    Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

    Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

  23. [23]

    Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L

    Theodore R. Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L. Griffiths. Cognitive architectures for language agents.Transactions on Machine Learning Research (TMLR), 2024

  24. [24]

    MemBench: Towards more comprehensive evaluation on the memory of LLM-based agents

    Haoran Tan, Zeyu Zhang, Chen Ma, Xu Chen, Quanyu Dai, and Zhenhua Dong. MemBench: Towards more comprehensive evaluation on the memory of LLM-based agents. InFindings of the Association for Computational Linguistics: ACL 2025, pages 19336–19352, 2025

  25. [25]

    In prospect and retrospect: Reflective memory management for long-term personalized dialogue agents

    Zhen Tan, Jun Yan, I-Hung Hsu, Rujun Han, Zifeng Wang, Long Le, Yiwen Song, Yanfei Chen, Hamid Palangi, George Lee, Anand Rajan Iyer, Tianlong Chen, Huan Liu, Chen-Yu Lee, and Tomas Pfister. In prospect and retrospect: Reflective memory management for long-term personalized dialogue agents. InProceedings of the 63rd Annual Meeting of the Association for C...

  26. [26]

    Unleashing infinite-length input capacity for large-scale language models with self-controlled memory system

    Bing Wang, Xinnian Liang, Jian Yang, Hui Huang, Shuangzhi Wu, Peihao Wu, Lu Lu, Zejun Ma, and Zhoujun Li. SCM: Enhancing large language model with self-controlled memory framework.arXiv preprint arXiv:2304.13343, 2023

  27. [27]

    V oyager: An open-ended embodied agent with large language models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. Transactions on Machine Learning Research, 2024

  28. [28]

    A new paradigm in tuning learned indexes: A reinforcement learning enhanced approach

    Taiyi Wang, Liang Liang, Guang Yang, Thomas Heinis, and Eiko Yoneki. A new paradigm in tuning learned indexes: A reinforcement learning enhanced approach. InProceedings of the ACM on Management of Data (SIGMOD), 2025

  29. [29]

    Augmenting language models with long-term memory

    Weizhi Wang, Li Dong, Hao Cheng, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, and Furu Wei. Augmenting language models with long-term memory. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 11

  30. [30]

    MEMORYLLM: Towards self-updatable large language models

    Yu Wang, Yifan Gao, Xiusi Chen, Haoming Jiang, Shiyang Li, Jingfeng Yang, Qingyu Yin, Zheng Li, Xian Li, Bing Yin, Jingbo Shang, and Julian McAuley. MEMORYLLM: Towards self-updatable large language models. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024

  31. [31]

    Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

    Tianxin Wei, Noveen Sachdeva, Benjamin Coleman, Zhankui He, Yuanchen Bei, Xuying Ning, Mengting Ai, Yunzhe Li, Jingrui He, Ed H. Chi, et al. Evo-memory: Benchmarking LLM agent test-time learning with self-evolving memory.arXiv preprint arXiv:2511.20857, 2025

  32. [32]

    EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

    Rong Wu, Xiaoman Wang, Jianbiao Mei, Pinlong Cai, Daocheng Fu, Cheng Yang, Licheng Wen, Xuemeng Yang, Yufan Shen, Yuxin Wang, and Botian Shi. EvolveR: Self-evolving LLM agents through an experience-driven lifecycle.arXiv preprint arXiv:2510.16079, 2025

  33. [33]

    SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

    Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, et al. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026

  34. [34]

    A-mem: Agentic memory for llm agents

    Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  35. [35]

    Corrective Retrieval Augmented Generation

    Shi-Qi Yan, Jia-Chen Gu, Yun Zhu, and Zhen-Hua Ling. Corrective retrieval augmented generation.arXiv preprint arXiv:2401.15884, 2024

  36. [36]

    Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

    Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Jinhe Bi, Kristian Kersting, Jeff Z. Pan, Hinrich Schütze, V olker Tresp, and Yunpu Ma. Memory-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning.arXiv preprint arXiv:2508.19828, 2025

  37. [37]

    Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

    Yi Yu, Liuyi Yao, Yuexiang Xie, Qingquan Tan, Jiaqi Feng, Yaliang Li, and Libing Wu. Agentic memory: Learning unified long-term and short-term memory management for large language model agents.arXiv preprint arXiv:2601.01885, 2026

  38. [38]

    URLhttps://arxiv.org/abs/2512.18746

    Guibin Zhang, Haotian Ren, Chong Zhan, Zhenhong Zhou, Junhao Wang, He Zhu, Wangchun- shu Zhou, and Shuicheng Yan. Memevolve: Meta-evolution of agent memory systems.arXiv preprint arXiv:2512.18746, 2025

  39. [39]

    MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

    Haozhen Zhang, Quanyu Long, Jianzhu Bao, Tao Feng, Weizhi Zhang, Haodong Yue, and Wenya Wang. Memskill: Learning and evolving memory skills for self-evolving agents.arXiv preprint arXiv:2602.02474, 2026

  40. [40]

    Memrl: Self-evolving agents via runtime reinforcement learning on episodic memory.arXiv preprint arXiv:2601.03192, 2026

    Shengtao Zhang, Jiaqian Wang, Ruiwen Zhou, Junwei Liao, Yuchen Feng, Zhuo Li, Yujie Zheng, Weinan Zhang, Ying Wen, Zhiyu Li, et al. Memrl: Self-evolving agents via runtime reinforcement learning on episodic memory.arXiv preprint arXiv:2601.03192, 2026

  41. [41]

    A survey on the memory mechanism of large language model based agents

    Zeyu Zhang, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Quanyu Dai, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model based agents. ACM Transactions on Information Systems, 43(6), 2025

  42. [42]

    Expel: Llm agents are experiential learners

    Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19632–19642, 2024

  43. [43]

    Unknown”/“not specified

    Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19724–19731, 2024. 12 A Detailed Formulations This appendix provides the formal details behind the three components of EVOLVEMEM. We organize the...

  44. [44]

    Perseid meteor shower, painting descriptions) can be recalled

    Enable semantic retrieval with fusion_mode=’rrf’ and semantic_top_k in the low- mid range (12–16) so lexically-different but semantically-related memories (e.g., camping trip vs. Perseid meteor shower, painting descriptions) can be recalled

  45. [45]

    Increase retrieval depth and context breadth (keyword_top_k to ∼10–12, max_context to ∼12–16)especially for categories 4 and 5 via per_category_overridesto fix the many abstentions and ‘not specified’ failures for detailed episodic facts

  46. [46]

    watched the Perseid meteor shower while camping

    Tighten and enrich extraction so specific concrete details are captured and retrievable. Notice that the diagnosis LLM namesthe exact failure modeof this case (camping trip vs. Per- seid meteor shower) inside priority action 1—a failure pattern it inferred from L0 alone, with no benchmark-specific cue. 15 Table 8: Per-round trace for the case-study probe ...

  47. [47]

    Complete Coverage: Generate entries for ALL facts, events, opinions, plans, feelings

  48. [48]

    Use actual names and absolute dates

    Force Disambiguation: PROHIBIT pronouns (he/she/it/they). Use actual names and absolute dates

  49. [49]

    Lossless Restatement: Each entry must be complete, independent, self-contained

  50. [50]

    twice",

    Extract EVERY specific detail – no paraphrasing fine-grained facts: - Named entities: book/movie/song/game titles (keep quotation marks), brand names, places, pet names, nicknames, colors, specific activities, specific numbers. - Quantities: exact counts, frequencies ("twice", "three times"), durations ("for 3 years", "since 2019"). - Lists: if someone me...

  51. [51]

    lossless_restatement

    Cover names, places, objects, opinions, plans, feelings, events, dates, gifts, hobbies, relationships, pets, travel, food, books, art, music, work, family, health. [Output Format] Return a JSON array: [ { "lossless_restatement": "Complete sentence with all subjects, objects, time, location", "keywords": ["keyword1", "keyword2"], "timestamp": "YYYY-MM-DD o...

  52. [52]

    ALWAYS provide a substantive answer, never ’not specified’

  53. [53]

    reasoning

    Answer in 1-5 words using exact facts from context. Return JSON: {"reasoning":"brief","answer":"concise"} LoCoMo Cat. 3 Inferential (base): User Prompt Question: {question} Context: {context} This question asks for an INFERENCE or COUNTERFACTUAL judgement (e.g., ’Would X...’, ’What would X likely...’). Your job is to synthesize a best-guess answer from th...

  54. [54]

    The answer must always be a substantive judgement

    NEVER answer ’unknown’ / ’not specified’ / ’not mentioned’. The answer must always be a substantive judgement

  55. [55]

    Preferred forms: - ’Would X...’ -> ’Likely yes’ / ’Likely no’ / ’Yes’ / ’No’ (+ a short reason ONLY if very informative)

    Answer in 1-6 words. Preferred forms: - ’Would X...’ -> ’Likely yes’ / ’Likely no’ / ’Yes’ / ’No’ (+ a short reason ONLY if very informative). - ’What/Which would X...’ -> name the most likely option

  56. [56]

    reasoning

    Choose the option most consistent with the user’s stated preferences, history, and values in the context. Return JSON: {"reasoning":"brief","answer":"concise"} 19 LoCoMo Cat. 3 Nuanced-Inferential (discovered, 6 subtypes): User Prompt Gated by locomo_cat3_inferential_nuanced. A regex classifier routes each Cat. 3 question to one of six subtypes; each rece...

  57. [57]

    Make the MOST PLAUSIBLE guess grounded in the speaker’s stated preferences

    CRITICAL: NEVER answer ’Unknown’, ’Not specified’, ’Not mentioned’, empty string, or any refusal. Make the MOST PLAUSIBLE guess grounded in the speaker’s stated preferences

  58. [58]

    Prefer exact phrases from context over paraphrased abstract nouns (’Nintendo Switch’, not ’a console’)

  59. [59]

    For counterfactuals (’would X...’), pick the option most consistent with the speaker’s stated preferences

  60. [60]

    reasoning

    Geography: if the question asks for a STATE / COUNTRY and context only has a CITY / LANDMARK, infer the enclosing jurisdiction. Return JSON: {"reasoning":"brief","answer":"<answer>"} Subtype-specific instructions(abbreviated): •Counting: enumerate matching events, return spelled-out number (<5) or Arabic digit. •Location-hierarchy: copy named jurisdiction...

  61. [61]

    Use EXACT words/phrases from context

    Answer in 1-10 words. Use EXACT words/phrases from context

  62. [62]

    - ’when’ / year questions -> 4-digit year (’2019’) or ’YYYY-MM-DD’ if date is known; never ’N years ago’

    Format conventions: - ’how many/times’ -> single Arabic numeral (’2’, not ’two’). - ’when’ / year questions -> 4-digit year (’2019’) or ’YYYY-MM-DD’ if date is known; never ’N years ago’. - ’where’ -> place name exactly as in context. - ’what/who’ -> shortest distinctive noun phrase in context

  63. [63]

    Even if context is indirect, pick the single most plausible answer from what IS mentioned

    NEVER answer ’Unknown’, ’Not specified’, ’Not mentioned’. Even if context is indirect, pick the single most plausible answer from what IS mentioned

  64. [64]

    reasoning

    For multi-item questions (e.g. ’which movies’), list each item separated by a comma. Return JSON: {"reasoning":"brief","answer":"concise"} F.4 Answer Generation: MemBench (MCQ) MemBench is multiple-choice. MemBench: System Prompt You are a memory-grounded multiple-choice question answerer. You MUST pick exactly ONE letter (A/B/C/D). JSON only. 20 MemBench...

  65. [65]

    Pick EXACTLY one letter from {A,B,C,D}

  66. [66]

    Base your answer on the context; if context is incomplete still pick the most plausible option

  67. [67]

    reasoning

    Return JSON: {"reasoning":"brief","answer":"X"} where X is a single letter. F.5 Answer Verification (Second Pass) Invoked whenenable_answer_verificationis set (Eq. 13). Verifier: System Prompt Answer verifier. JSON output only. Verifier Strict (default): User Prompt Question: {question} Context: {context} Candidate answer: {candidate} Review the candidate...

  68. [68]

    If many ’abstention’ failures -> raise top_k, widen max_context, consider rrf fusion

  69. [69]

    If many ’wrong answer’ failures with high retrieval -> lower max_context or raise weights for strongest view

  70. [70]

    If temporal category weakness -> enable time_decay_half_life_days

  71. [71]

    If adversarial category weakness -> enable_entity_swap=true

  72. [72]

    If multi-hop weakness -> reflection_rounds >= 1

  73. [73]

    If ONE category lags -> per_category_overrides (preserve gains elsewhere)

  74. [74]

    Prefer enabling something disabled BEFORE tuning a small int

  75. [75]

    If residual ’Unknown’ or format-mismatch -> enable_answer_verification=true

  76. [76]

    root_causes

    LoCoMo prompt-surface flags are highest-ROI when their symptom matches; propose them early. ## Output Return JSON with ‘parameter_suggestions‘ as a flat dict of field -> new value. Fields MUST match RetrievalConfig field names exactly. Only include fields you want to change. {{ "root_causes": {{"extraction_gap": {{...}}, "retrieval_miss": {{...}}, "answer...