arxiv: 2605.13941 · v1 · submitted 2026-05-13 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

EvolveMem:Self-Evolving Memory Architecture via AutoResearch for LLM Agents

Jiaqi Liu , Xinyu Ye , Peng Xia , Zeyu Zheng , Cihang Xie , Mingyu Ding , Huaxiu Yao

Authors on Pith no claims yet

Pith reviewed 2026-05-15 04:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords self-evolving memoryLLM agentslong-term memoryretrieval configurationfailure diagnosisAutoResearchmemory architectureautonomous optimization

0 comments

The pith

LLM agents can improve long-term memory by letting an LLM module diagnose retrieval failures and autonomously adjust the system's own configuration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that memory for LLM agents needs to evolve at two levels at once: the stored knowledge and the retrieval rules that query it. Existing systems keep scoring functions, fusion methods, and generation policies frozen after deployment, so the authors expose the entire retrieval configuration as an adjustable action space. An LLM diagnosis module reads detailed failure logs after each question, names the root causes, and proposes specific changes; a meta-analyzer applies them only when safeguards confirm no regression. The resulting closed loop runs repeated AutoResearch cycles that start from a minimal baseline and converge on stronger strategies, some of which introduce configuration dimensions absent from the original space.

Core claim

EvolveMem treats retrieval infrastructure as evolvable rather than fixed: the LLM diagnosis module identifies root causes from per-question failure logs and proposes targeted configuration adjustments, while a guarded meta-analyzer enforces revert-on-regression and explore-on-stagnation rules. This self-evolution process autonomously discovers effective retrieval strategies, including new dimensions, and produces configurations that transfer positively across benchmarks instead of overfitting to any single task.

What carries the argument

The LLM-powered diagnosis module that reads failure logs, names root causes of retrieval errors, and proposes concrete configuration changes inside a safeguarded closed loop.

If this is right

Memory retrieval strategies improve without manual retuning for each new task or domain.
Evolved configurations transfer positively rather than catastrophically to different benchmarks.
The system can discover entirely new configuration dimensions not present in the initial action space.
Autonomous research cycles replace repeated human-driven configuration search for agent memory.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same diagnosis-plus-adjustment loop could be applied to other fixed components inside LLM agents such as planning or tool selection.
Post-deployment agents might keep running lightweight evolution cycles on new user interactions without external oversight.
The approach suggests a path toward memory systems whose retrieval rules continue to specialize over months of real use rather than remaining static after release.

Load-bearing premise

The LLM diagnosis module must correctly identify why retrieval failed and suggest fixes that the safeguards can validate without introducing undetected regressions or biases.

What would settle it

Apply an evolved configuration to a fresh benchmark outside the training loop and observe either no gain or negative transfer relative to the original fixed baseline.

Figures

Figures reproduced from arXiv: 2605.13941 by Cihang Xie, Huaxiu Yao, Jiaqi Liu, Mingyu Ding, Peng Xia, Xinyu Ye, Zeyu Zheng.

**Figure 1.** Figure 1: EVOLVEMEM self-evolves its retrieval configuration on LoCoMo via AutoResearch. (a) A four-step evolution loop (EVALUATE–DIAGNOSE–PROPOSE–GUARD) ratchets accepted proposals into the action space; harmful ones (e.g., R2) are auto-reverted. (b) Overall F1 trajectory (singlebackbone GPT-4o): 30.5% baseline to 54.3% at R7. SimpleMem [13] compresses conversations into retrieval-friendly units. Another line focu… view at source ↗

**Figure 2.** Figure 2: EVOLVEMEM architecture. Three layers connected by a self-evolution feedback loop. A typed memory store is populated by an LLM-based extractor with retry and chunk-splitting; a multi-view retriever fuses BM25, semantic, and structured-metadata search with optional entity-swap, query decomposition, and answer verification; an LLM-powered diagnosis module reads per-question raw-result logs and proposes struct… view at source ↗

read the original abstract

Long-term memory is essential for LLM agents that operate across multiple sessions, yet existing memory systems treat retrieval infrastructure as fixed: stored content evolves while scoring functions, fusion strategies, and answer-generation policies remain frozen at deployment. We argue that truly adaptive memory requires co-evolution at two levels: the stored knowledge and the retrieval mechanism that queries it. We present EvolveMem, a self-evolving memory architecture that exposes its full retrieval configuration as a structured action space optimized by an LLM-powered diagnosis module. In each evolution round, the module reads per-question failure logs, identifies root causes, and proposes targeted configuration adjustments; a guarded meta-analyzer applies them with automatic revert-on-regression and explore-on-stagnation safeguards. This closed-loop self-evolution realizes an AutoResearch process: the system autonomously conducts iterative research cycles on its own architecture, replacing manual configuration tuning. Starting from a minimal baseline, the process converges autonomously, discovering effective retrieval strategies including entirely new configuration dimensions not present in the original action space. On LoCoMo, EvolveMem outperforms the strongest baseline by 25.7% relative and achieves a 78.0% relative improvement over the minimal baseline. On MemBench, EvolveMem exceeds the strongest baseline by 18.9% relative. Evolved configurations transfer across benchmarks with positive rather than catastrophic transfer, indicating that the self-evolution process captures universal retrieval principles rather than benchmark-specific heuristics. Code is available at https://github.com/aiming-lab/SimpleMem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EvolveMem shows a workable closed-loop setup for evolving retrieval configs in LLM agents with positive cross-benchmark transfer, but the gains rest on an unablated LLM diagnosis step whose reliability is not demonstrated.

read the letter

EvolveMem makes retrieval config itself evolve instead of staying frozen after deployment. The system turns the full set of scoring, fusion, and generation choices into an action space, then lets an LLM read failure logs, name root causes, and suggest edits; a meta-analyzer applies them with revert-on-regression and explore-on-stagnation guards. Starting from a minimal baseline it converges on better strategies and even invents new configuration dimensions not in the original space. On LoCoMo it beats the strongest baseline by 25.7 percent relative and improves 78 percent over the starting point; on MemBench the lift is 18.9 percent. The evolved configs transfer positively rather than collapsing, which is the clearest sign that something general is being learned rather than benchmark-specific hacks. Code release helps anyone who wants to inspect the loop directly. The main gap is the missing evidence that the LLM diagnosis step is doing real work. The abstract reports no ablation that disables diagnosis or measures proposal consistency across runs, so it is still possible the gains come mainly from extra search budget or the safeguards rather than accurate root-cause calls. Statistical significance, variance across seeds, and exact baseline details are also not shown, which leaves the numbers hard to interpret at face value. The mild circularity of using the same LLM family for both agent and diagnosis is present but not fatal if the transfer result holds. This is for people building long-horizon agents who care about memory that can adapt without constant human retuning. A reader who wants concrete numbers on self-improving retrieval will find the transfer result useful even if the mechanism details need more checks. It deserves a serious referee because the framing is new and the surface results are strong enough to justify asking for the missing ablations and controls.

Referee Report

3 major / 2 minor

Summary. The paper introduces EvolveMem, a self-evolving memory architecture for LLM agents in which an LLM-powered diagnosis module analyzes per-question failure logs to propose targeted edits to the full retrieval configuration (scoring, fusion, and generation policies). A guarded meta-analyzer applies changes under revert-on-regression and explore-on-stagnation safeguards, realizing an AutoResearch loop that starts from a minimal baseline and autonomously discovers new configuration dimensions. On LoCoMo the system reports a 25.7% relative gain over the strongest baseline (78.0% over minimal); on MemBench the gain is 18.9% relative. Evolved configurations exhibit positive rather than negative transfer across the two benchmarks, which the authors interpret as evidence that the process captures universal retrieval principles. Code is released.

Significance. If the central claims hold, the work offers a concrete demonstration of closed-loop, LLM-driven architecture search for agent memory systems, reducing reliance on manual hyper-parameter tuning and providing empirical support for positive cross-benchmark generalization. The release of code and the explicit reporting of transfer results are concrete strengths that would allow the community to verify and extend the AutoResearch paradigm.

major comments (3)

[Experiments] Experiments section: the headline performance deltas (25.7% on LoCoMo, 18.9% on MemBench) are presented without ablations that isolate the contribution of the LLM diagnosis module from the effects of additional search budget or the guarded meta-analyzer alone; without such controls the causal link between diagnosis quality and observed gains remains unverified.
[Evaluation] Evaluation protocol: no statistical significance tests, number of independent evolution runs, or variance across runs are reported, and the abstract provides no details on data splits or controls for post-hoc selection of evolved configurations, undermining confidence that the reported improvements are robust.
[Method] Diagnosis module description: the reliability of root-cause identification and change proposals is load-bearing for the AutoResearch claim, yet no inter-annotator agreement, human validation of proposals, or consistency metrics across LLM calls are supplied; this leaves open the possibility that gains arise from noisy search rather than intelligent diagnosis.

minor comments (2)

[Method] The abstract states that the system discovers 'entirely new configuration dimensions not present in the original action space,' but the main text should include concrete examples of these dimensions and how they were represented in the structured action space.
[Method] Notation for the guarded meta-analyzer and its revert/explore rules could be formalized (e.g., as a small state machine or pseudocode) to improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with specific plans for revision. These changes will strengthen the causal evidence for the LLM diagnosis module, improve statistical reporting, and add validation for the diagnosis process while preserving the core AutoResearch contribution.

read point-by-point responses

Referee: [Experiments] Experiments section: the headline performance deltas (25.7% on LoCoMo, 18.9% on MemBench) are presented without ablations that isolate the contribution of the LLM diagnosis module from the effects of additional search budget or the guarded meta-analyzer alone; without such controls the causal link between diagnosis quality and observed gains remains unverified.

Authors: We agree that isolating the LLM diagnosis module's contribution is essential. In the revised manuscript we will add a dedicated ablation subsection reporting three controlled variants run with identical search budgets: (1) full EvolveMem, (2) guarded meta-analyzer with random proposals instead of LLM diagnosis, and (3) random search over the same configuration space without the meta-analyzer. Results will be presented in a new table with relative gains, allowing direct assessment of the diagnosis module's incremental value. We will also note the additional compute required for these controls. revision: yes
Referee: [Evaluation] Evaluation protocol: no statistical significance tests, number of independent evolution runs, or variance across runs are reported, and the abstract provides no details on data splits or controls for post-hoc selection of evolved configurations, undermining confidence that the reported improvements are robust.

Authors: We acknowledge the need for greater statistical rigor. The revision will report results from five independent evolution runs per benchmark, including mean accuracy, standard deviation, and paired t-test p-values against baselines. The abstract will be updated to state that standard benchmark splits were used and that final configurations were chosen on a held-out validation portion of each dataset to avoid post-hoc selection. These details and the run count will be added to the Evaluation and Experiments sections. revision: yes
Referee: [Method] Diagnosis module description: the reliability of root-cause identification and change proposals is load-bearing for the AutoResearch claim, yet no inter-annotator agreement, human validation of proposals, or consistency metrics across LLM calls are supplied; this leaves open the possibility that gains arise from noisy search rather than intelligent diagnosis.

Authors: We agree that empirical validation of diagnosis quality would strengthen the claim. We will add a new subsection with human evaluation of 100 randomly sampled diagnosis outputs (two independent annotators, Cohen's kappa reported) and consistency measurements obtained by re-running the diagnosis module (temperature 0) on the same failure logs and measuring proposal overlap. These results will be presented alongside the main experiments to demonstrate that the observed gains arise from structured rather than noisy proposals. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical gains rest on external benchmark measurements

full rationale

The paper describes an empirical architecture in which an LLM diagnosis module proposes configuration edits that are then evaluated on held-out benchmarks (LoCoMo, MemBench). Reported improvements (25.7% and 18.9% relative) and positive cross-benchmark transfer are measured outcomes, not quantities derived by construction from the same inputs. No equations, self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the text that would collapse the central claims back to the system’s own outputs. The derivation chain is therefore self-contained as an experimental result rather than a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the unverified capability of the LLM to perform accurate failure diagnosis and on the existence of a well-behaved action space for retrieval configurations; no explicit free parameters are named in the abstract.

axioms (1)

domain assumption An LLM can reliably diagnose root causes of memory retrieval failures from per-question logs
This is the core mechanism enabling the evolution loop and is assumed rather than proven in the abstract.

invented entities (1)

Guarded meta-analyzer no independent evidence
purpose: Applies configuration changes with automatic revert-on-regression and explore-on-stagnation safeguards
New component introduced to stabilize the self-evolution process

pith-pipeline@v0.9.0 · 5590 in / 1233 out tokens · 40389 ms · 2026-05-15T04:45:05.843429+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the module reads per-question failure logs, identifies root causes, and proposes targeted configuration adjustments; a guarded meta-analyzer applies them with automatic revert-on-regression
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Evolved configurations transfer across benchmarks with positive rather than catastrophic transfer

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 11 internal anchors

[1]

Self-rag: Learning to retrieve, generate, and critique through self-reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[2]

Self-play fine- tuning converts weak language models to strong language models

Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine- tuning converts weak language models to strong language models. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024

work page 2024
[3]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Über das gedächtnis: Untersuchungen zur experimentellen psychologie

Hermann Ebbinghaus. Über das gedächtnis: Untersuchungen zur experimentellen psychologie. 1885

work page
[5]

A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, et al. A survey of self-evolving agents: On the path to artificial super intelligence.arXiv preprint arXiv:2507.21046, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

λ-tune: Harnessing large language models for automated database system tuning

Victor Giannakouris and Immanuel Trummer. λ-tune: Harnessing large language models for automated database system tuning. InProceedings of the ACM on Management of Data (SIGMOD), 2025

work page 2025
[7]

Realm: Retrieval- augmented language model pre-training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Realm: Retrieval- augmented language model pre-training. InProceedings of the 37th International Conference on Machine Learning (ICML), pages 3929–3938, 2020

work page 2020
[8]

Memory in the Age of AI Agents

Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, et al. Memory in the age of ai agents.arXiv preprint arXiv:2512.13564, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C. Park. Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024

work page 2024
[10]

Active retrieval augmented generation.Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7969–7992, 2023

Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation.Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7969–7992, 2023

work page 2023
[11]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

work page 2020
[12]

Omni-simplemem: Autoresearch- guided discovery of lifelong multimodal agent memory.arXiv preprint arXiv:2604.01007, 2026

Jiaqi Liu, Zipeng Ling, Shi Qiu, Yanqing Liu, Siwei Han, Peng Xia, Haoqin Tu, Zeyu Zheng, Cihang Xie, Charles Fleming, Mingyu Ding, and Huaxiu Yao. Omni-simplemem: Autoresearch- guided discovery of lifelong multimodal agent memory.arXiv preprint arXiv:2604.01007, 2026

work page arXiv 2026
[13]

arXiv preprint arXiv:2601.02553 , year=

Jiaqi Liu, Yaofeng Su, Peng Xia, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, and Huaxiu Yao. Simplemem: Efficient lifelong memory for llm agents.arXiv preprint arXiv:2601.02553, 2026

work page arXiv 2026
[14]

Autoresearchclaw: Fully autonomous research from idea to paper, 2026.https://github.com/aiming-lab/AutoResearchClaw

Jiaqi Liu, Peng Xia, Siwei Han, Shi Qiu, Letian Zhang, Guiming Chen, Haoqin Tu, Xinyu Yang, Jiawei Zhou, Hongtu Zhu, Yun Li, Jiaheng Zhang, Yuyin Zhou, Zeyu Zheng, Cihang Xie, Mingyu Ding, and Huaxiu Yao. Autoresearchclaw: Fully autonomous research from idea to paper, 2026.https://github.com/aiming-lab/AutoResearchClaw. 10

work page 2026
[15]

Memverse: Multimodal memory for lifelong learning agents.arXiv preprint arXiv:2512.03627, 2025

Junming Liu, Yifei Sun, Weihua Cheng, Haodong Lei, Yirong Chen, Licheng Wen, Xuemeng Yang, Daocheng Fu, Pinlong Cai, Nianchen Deng, et al. Memverse: Multimodal memory for lifelong learning agents.arXiv preprint arXiv:2512.03627, 2025

work page arXiv 2025
[16]

Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems, 36:46534–46594, 2023

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems, 36:46534–46594, 2023

work page 2023
[17]

Evaluating very long-term conversational memory of llm agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024

work page 2024
[18]

Why there are comple- mentary learning systems in the hippocampus and neocortex.Psychological Review, 102(3): 419–457, 1995

James L McClelland, Bruce L McNaughton, and Randall C O’Reilly. Why there are comple- mentary learning systems in the hippocampus and neocortex.Psychological Review, 102(3): 419–457, 1995

work page 1995
[19]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G Patil, Ion Stoica, and Joseph E Gonzalez. Memgpt: Towards llms as operating systems.arXiv preprint arXiv:2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Vicky Zhao, Lili Qiu, and Jianfeng Gao

Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Xufang Luo, Hao Cheng, Dongsheng Li, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, and Jianfeng Gao. SeCom: On memory construc- tion and retrieval for personalized conversational agents. InProceedings of the International Conference on Learning Representations (ICLR), 2025

work page 2025
[21]

Generative agents: Interactive simulacra of human behavior.Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–22, 2023

Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior.Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–22, 2023

work page 2023
[22]

Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

work page 2023
[23]

Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L

Theodore R. Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L. Griffiths. Cognitive architectures for language agents.Transactions on Machine Learning Research (TMLR), 2024

work page 2024
[24]

MemBench: Towards more comprehensive evaluation on the memory of LLM-based agents

Haoran Tan, Zeyu Zhang, Chen Ma, Xu Chen, Quanyu Dai, and Zhenhua Dong. MemBench: Towards more comprehensive evaluation on the memory of LLM-based agents. InFindings of the Association for Computational Linguistics: ACL 2025, pages 19336–19352, 2025

work page 2025
[25]

In prospect and retrospect: Reflective memory management for long-term personalized dialogue agents

Zhen Tan, Jun Yan, I-Hung Hsu, Rujun Han, Zifeng Wang, Long Le, Yiwen Song, Yanfei Chen, Hamid Palangi, George Lee, Anand Rajan Iyer, Tianlong Chen, Huan Liu, Chen-Yu Lee, and Tomas Pfister. In prospect and retrospect: Reflective memory management for long-term personalized dialogue agents. InProceedings of the 63rd Annual Meeting of the Association for C...

work page 2025
[26]

Unleashing infinite-length input capacity for large-scale language models with self-controlled memory system

Bing Wang, Xinnian Liang, Jian Yang, Hui Huang, Shuangzhi Wu, Peihao Wu, Lu Lu, Zejun Ma, and Zhoujun Li. SCM: Enhancing large language model with self-controlled memory framework.arXiv preprint arXiv:2304.13343, 2023

work page arXiv 2023
[27]

V oyager: An open-ended embodied agent with large language models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. Transactions on Machine Learning Research, 2024

work page 2024
[28]

A new paradigm in tuning learned indexes: A reinforcement learning enhanced approach

Taiyi Wang, Liang Liang, Guang Yang, Thomas Heinis, and Eiko Yoneki. A new paradigm in tuning learned indexes: A reinforcement learning enhanced approach. InProceedings of the ACM on Management of Data (SIGMOD), 2025

work page 2025
[29]

Augmenting language models with long-term memory

Weizhi Wang, Li Dong, Hao Cheng, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, and Furu Wei. Augmenting language models with long-term memory. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 11

work page 2023
[30]

MEMORYLLM: Towards self-updatable large language models

Yu Wang, Yifan Gao, Xiusi Chen, Haoming Jiang, Shiyang Li, Jingfeng Yang, Qingyu Yin, Zheng Li, Xian Li, Bing Yin, Jingbo Shang, and Julian McAuley. MEMORYLLM: Towards self-updatable large language models. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024

work page 2024
[31]

Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

Tianxin Wei, Noveen Sachdeva, Benjamin Coleman, Zhankui He, Yuanchen Bei, Xuying Ning, Mengting Ai, Yunzhe Li, Jingrui He, Ed H. Chi, et al. Evo-memory: Benchmarking LLM agent test-time learning with self-evolving memory.arXiv preprint arXiv:2511.20857, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

Rong Wu, Xiaoman Wang, Jianbiao Mei, Pinlong Cai, Daocheng Fu, Cheng Yang, Licheng Wen, Xuemeng Yang, Yufan Shen, Yuxin Wang, and Botian Shi. EvolveR: Self-evolving LLM agents through an experience-driven lifecycle.arXiv preprint arXiv:2510.16079, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, et al. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[34]

A-mem: Agentic memory for llm agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[35]

Corrective Retrieval Augmented Generation

Shi-Qi Yan, Jia-Chen Gu, Yun Zhu, and Zhen-Hua Ling. Corrective retrieval augmented generation.arXiv preprint arXiv:2401.15884, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Jinhe Bi, Kristian Kersting, Jeff Z. Pan, Hinrich Schütze, V olker Tresp, and Yunpu Ma. Memory-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning.arXiv preprint arXiv:2508.19828, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Yi Yu, Liuyi Yao, Yuexiang Xie, Qingquan Tan, Jiaqi Feng, Yaliang Li, and Libing Wu. Agentic memory: Learning unified long-term and short-term memory management for large language model agents.arXiv preprint arXiv:2601.01885, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[38]

URLhttps://arxiv.org/abs/2512.18746

Guibin Zhang, Haotian Ren, Chong Zhan, Zhenhong Zhou, Junhao Wang, He Zhu, Wangchun- shu Zhou, and Shuicheng Yan. Memevolve: Meta-evolution of agent memory systems.arXiv preprint arXiv:2512.18746, 2025

work page arXiv 2025
[39]

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

Haozhen Zhang, Quanyu Long, Jianzhu Bao, Tao Feng, Weizhi Zhang, Haodong Yue, and Wenya Wang. Memskill: Learning and evolving memory skills for self-evolving agents.arXiv preprint arXiv:2602.02474, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

Memrl: Self-evolving agents via runtime reinforcement learning on episodic memory.arXiv preprint arXiv:2601.03192, 2026

Shengtao Zhang, Jiaqian Wang, Ruiwen Zhou, Junwei Liao, Yuchen Feng, Zhuo Li, Yujie Zheng, Weinan Zhang, Ying Wen, Zhiyu Li, et al. Memrl: Self-evolving agents via runtime reinforcement learning on episodic memory.arXiv preprint arXiv:2601.03192, 2026

work page arXiv 2026
[41]

A survey on the memory mechanism of large language model based agents

Zeyu Zhang, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Quanyu Dai, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model based agents. ACM Transactions on Information Systems, 43(6), 2025

work page 2025
[42]

Expel: Llm agents are experiential learners

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19632–19642, 2024

work page 2024
[43]

Unknown”/“not specified

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19724–19731, 2024. 12 A Detailed Formulations This appendix provides the formal details behind the three components of EVOLVEMEM. We organize the...

work page 2024
[44]

Perseid meteor shower, painting descriptions) can be recalled

Enable semantic retrieval with fusion_mode=’rrf’ and semantic_top_k in the low- mid range (12–16) so lexically-different but semantically-related memories (e.g., camping trip vs. Perseid meteor shower, painting descriptions) can be recalled

work page
[45]

Increase retrieval depth and context breadth (keyword_top_k to ∼10–12, max_context to ∼12–16)especially for categories 4 and 5 via per_category_overridesto fix the many abstentions and ‘not specified’ failures for detailed episodic facts

work page
[46]

watched the Perseid meteor shower while camping

Tighten and enrich extraction so specific concrete details are captured and retrievable. Notice that the diagnosis LLM namesthe exact failure modeof this case (camping trip vs. Per- seid meteor shower) inside priority action 1—a failure pattern it inferred from L0 alone, with no benchmark-specific cue. 15 Table 8: Per-round trace for the case-study probe ...

work page
[47]

Complete Coverage: Generate entries for ALL facts, events, opinions, plans, feelings

work page
[48]

Use actual names and absolute dates

Force Disambiguation: PROHIBIT pronouns (he/she/it/they). Use actual names and absolute dates

work page
[49]

Lossless Restatement: Each entry must be complete, independent, self-contained

work page
[50]

twice",

Extract EVERY specific detail – no paraphrasing fine-grained facts: - Named entities: book/movie/song/game titles (keep quotation marks), brand names, places, pet names, nicknames, colors, specific activities, specific numbers. - Quantities: exact counts, frequencies ("twice", "three times"), durations ("for 3 years", "since 2019"). - Lists: if someone me...

work page 2019
[51]

lossless_restatement

Cover names, places, objects, opinions, plans, feelings, events, dates, gifts, hobbies, relationships, pets, travel, food, books, art, music, work, family, health. [Output Format] Return a JSON array: [ { "lossless_restatement": "Complete sentence with all subjects, objects, time, location", "keywords": ["keyword1", "keyword2"], "timestamp": "YYYY-MM-DD o...

work page
[52]

ALWAYS provide a substantive answer, never ’not specified’

work page
[53]

reasoning

Answer in 1-5 words using exact facts from context. Return JSON: {"reasoning":"brief","answer":"concise"} LoCoMo Cat. 3 Inferential (base): User Prompt Question: {question} Context: {context} This question asks for an INFERENCE or COUNTERFACTUAL judgement (e.g., ’Would X...’, ’What would X likely...’). Your job is to synthesize a best-guess answer from th...

work page
[54]

The answer must always be a substantive judgement

NEVER answer ’unknown’ / ’not specified’ / ’not mentioned’. The answer must always be a substantive judgement

work page
[55]

Preferred forms: - ’Would X...’ -> ’Likely yes’ / ’Likely no’ / ’Yes’ / ’No’ (+ a short reason ONLY if very informative)

Answer in 1-6 words. Preferred forms: - ’Would X...’ -> ’Likely yes’ / ’Likely no’ / ’Yes’ / ’No’ (+ a short reason ONLY if very informative). - ’What/Which would X...’ -> name the most likely option

work page
[56]

reasoning

Choose the option most consistent with the user’s stated preferences, history, and values in the context. Return JSON: {"reasoning":"brief","answer":"concise"} 19 LoCoMo Cat. 3 Nuanced-Inferential (discovered, 6 subtypes): User Prompt Gated by locomo_cat3_inferential_nuanced. A regex classifier routes each Cat. 3 question to one of six subtypes; each rece...

work page
[57]

Make the MOST PLAUSIBLE guess grounded in the speaker’s stated preferences

CRITICAL: NEVER answer ’Unknown’, ’Not specified’, ’Not mentioned’, empty string, or any refusal. Make the MOST PLAUSIBLE guess grounded in the speaker’s stated preferences

work page
[58]

Prefer exact phrases from context over paraphrased abstract nouns (’Nintendo Switch’, not ’a console’)

work page
[59]

For counterfactuals (’would X...’), pick the option most consistent with the speaker’s stated preferences

work page
[60]

reasoning

Geography: if the question asks for a STATE / COUNTRY and context only has a CITY / LANDMARK, infer the enclosing jurisdiction. Return JSON: {"reasoning":"brief","answer":"<answer>"} Subtype-specific instructions(abbreviated): •Counting: enumerate matching events, return spelled-out number (<5) or Arabic digit. •Location-hierarchy: copy named jurisdiction...

work page
[61]

Use EXACT words/phrases from context

Answer in 1-10 words. Use EXACT words/phrases from context

work page
[62]

- ’when’ / year questions -> 4-digit year (’2019’) or ’YYYY-MM-DD’ if date is known; never ’N years ago’

Format conventions: - ’how many/times’ -> single Arabic numeral (’2’, not ’two’). - ’when’ / year questions -> 4-digit year (’2019’) or ’YYYY-MM-DD’ if date is known; never ’N years ago’. - ’where’ -> place name exactly as in context. - ’what/who’ -> shortest distinctive noun phrase in context

work page 2019
[63]

Even if context is indirect, pick the single most plausible answer from what IS mentioned

NEVER answer ’Unknown’, ’Not specified’, ’Not mentioned’. Even if context is indirect, pick the single most plausible answer from what IS mentioned

work page
[64]

reasoning

For multi-item questions (e.g. ’which movies’), list each item separated by a comma. Return JSON: {"reasoning":"brief","answer":"concise"} F.4 Answer Generation: MemBench (MCQ) MemBench is multiple-choice. MemBench: System Prompt You are a memory-grounded multiple-choice question answerer. You MUST pick exactly ONE letter (A/B/C/D). JSON only. 20 MemBench...

work page
[65]

Pick EXACTLY one letter from {A,B,C,D}

work page
[66]

Base your answer on the context; if context is incomplete still pick the most plausible option

work page
[67]

reasoning

Return JSON: {"reasoning":"brief","answer":"X"} where X is a single letter. F.5 Answer Verification (Second Pass) Invoked whenenable_answer_verificationis set (Eq. 13). Verifier: System Prompt Answer verifier. JSON output only. Verifier Strict (default): User Prompt Question: {question} Context: {context} Candidate answer: {candidate} Review the candidate...

work page
[68]

If many ’abstention’ failures -> raise top_k, widen max_context, consider rrf fusion

work page
[69]

If many ’wrong answer’ failures with high retrieval -> lower max_context or raise weights for strongest view

work page
[70]

If temporal category weakness -> enable time_decay_half_life_days

work page
[71]

If adversarial category weakness -> enable_entity_swap=true

work page
[72]

If multi-hop weakness -> reflection_rounds >= 1

work page
[73]

If ONE category lags -> per_category_overrides (preserve gains elsewhere)

work page
[74]

Prefer enabling something disabled BEFORE tuning a small int

work page
[75]

If residual ’Unknown’ or format-mismatch -> enable_answer_verification=true

work page
[76]

root_causes

LoCoMo prompt-surface flags are highest-ROI when their symptom matches; propose them early. ## Output Return JSON with ‘parameter_suggestions‘ as a flat dict of field -> new value. Fields MUST match RetrievalConfig field names exactly. Only include fields you want to change. {{ "root_causes": {{"extraction_gap": {{...}}, "retrieval_miss": {{...}}, "answer...

work page