pith. sign in

arxiv: 2606.04315 · v1 · pith:INQZVMQRnew · submitted 2026-06-03 · 💻 cs.AI

Exploring Cross-Scenario Generality of Agentic Memory Systems: Diagnostics and a Strong Baseline

Pith reviewed 2026-06-28 06:46 UTC · model grok-4.3

classification 💻 cs.AI
keywords agentic memoryLLM agentsmemory systemstool callscross-scenario evaluationAutoMEMstorage managementgenerality
0
0 comments X

The pith

An agentic harness where the LLM actively manages its own flat text-file storage via tool calls achieves the best cross-scenario ranking among evaluated memory systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates eight existing memory systems plus one new agentic harness across five distinct scenarios: single-turn QA, multi-session chat, agentic-trajectory QA, memory stress tests, and long-horizon agentic tasks. It finds that the harness, which gives the agent direct control over storage and retrieval operations through tool calls to flat text files, consistently ranks highest in average performance. A sympathetic reader would care because this result points to a design principle: agent memory works better when the LLM itself decides what to store and when to retrieve rather than depending on a fixed external pipeline. The authors use this finding to introduce AutoMEM as an instantiation of the principle.

Core claim

The agentic harness self-manages flat text-file storage via tool calls and achieves the best cross-task ranking, suggesting that memory performance hinges on giving the agent active control over storage and retrieval rather than on a passive store behind a fixed pipeline. This insight is instantiated in AutoMEM, an agentic memory harness with a self-managed tool interface that achieves the best cross-scenario generality among the systems evaluated.

What carries the argument

The agentic harness for search problems, which self-manages flat text-file storage via tool calls to give the agent active control over memory operations.

Load-bearing premise

The five chosen scenarios represent the heterogeneous trajectories that agents encounter in real deployments.

What would settle it

Running the same set of systems on a sixth scenario outside the original five, such as multi-agent collaborative tasks, and checking whether the agentic harness still produces the highest average rank.

Figures

Figures reproduced from arXiv: 2606.04315 by Jialiang Gu, Jiliang Tang, Junyu Yin, Kai Guo, Keren Zhou, Shenglai Zeng, Xianxuan Long, Xiaoze Liu, Zhikai Chen.

Figure 1
Figure 1. Figure 1: Per-method preprocessing vs. inference cost per question, averaged across sub-benchmarks within each task category. Raw trajectory scroll Step 5 scroll Step 6 go_back Step 7 HippoRAG: black case $11.99 has_price step/action: lost AMA-Agent: e@5 e@6 e@7 scroll scroll go_back step/action: kept [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Schema loss on an AMABench-Web ques￾tion (“which step performed the return from the prod￾uct page, and how many scrolls happened before it?”). HippoRAG keeps entity-relation facts but discards step indices and actions; AMA-Agent’s turn-indexed graph keeps both. 4.2 Where index-based memory fails on agentic QA Indexing-based methods perform poorly on agentic QA ( [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Per-question token cost vs. amortisation [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
read the original abstract

LLM agents accumulate histories that outgrow their context windows, motivating a growing literature on memory systems. Yet most existing designs are tuned to a single scenario (multi-session chat or a single trajectory format), and there is little evidence that they generalize across the heterogeneous trajectories agents encounter in deployment. We revisit eight memory systems plus an agentic harness for search problems, on five scenarios: single-turn QA, multi-session chat, agentic-trajectory QA, memory stress tests, and long-horizon agentic tasks. The harness, which self-manages flat text-file storage via tool calls, achieves the best cross-task ranking, suggesting that memory performance hinges on giving the agent active control over storage and retrieval rather than on a passive store behind a fixed pipeline. We instantiate this insight in AutoMEM, an agentic memory harness with a self-managed tool interface that achieves the best cross-scenario generality among the systems we evaluate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript evaluates eight memory systems plus a new agentic harness (AutoMEM) across five scenarios (single-turn QA, multi-session chat, agentic-trajectory QA, memory stress tests, long-horizon agentic tasks). It reports that the harness, which lets the agent self-manage flat text-file storage via tool calls, achieves the best cross-task ranking and concludes that memory performance hinges on active agent control over storage/retrieval rather than passive stores behind fixed pipelines.

Significance. If the empirical ranking is robust, the work supplies a multi-scenario diagnostic framework and a strong baseline for LLM-agent memory design, explicitly crediting the multi-system evaluation and the agentic harness as a reproducible point of comparison. It shifts focus from single-scenario tuning to cross-scenario generality.

major comments (1)
  1. [Abstract] Abstract: the claim that the harness's top cross-task ranking licenses the design implication (active control superior to passive pipelines) is load-bearing on the assumption that the five scenarios adequately sample deployment trajectories; the manuscript supplies no diversity metric, coverage argument, or sensitivity analysis showing the scenarios are heterogeneous rather than correlated on dimensions such as context length or retrieval noise.
minor comments (2)
  1. Clarify the precise interface and state-management protocol of the agentic harness versus the eight baseline systems in the methods section.
  2. Specify how the cross-task ranking is aggregated (e.g., mean rank, weighted sum) and whether statistical significance or error bars accompany the reported ordering.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting a key assumption in the abstract. We respond to the major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the harness's top cross-task ranking licenses the design implication (active control superior to passive pipelines) is load-bearing on the assumption that the five scenarios adequately sample deployment trajectories; the manuscript supplies no diversity metric, coverage argument, or sensitivity analysis showing the scenarios are heterogeneous rather than correlated on dimensions such as context length or retrieval noise.

    Authors: The five scenarios were selected to represent distinct regimes of agent deployment, spanning single-turn retrieval, persistent multi-session interaction, sequential trajectory reasoning, robustness under stress (noise and length), and extended planning horizons, with corresponding differences in context length and interaction structure as detailed in Section 3. We agree that no quantitative diversity metric, coverage argument, or sensitivity analysis is supplied. In revision we will add a short discussion of scenario heterogeneity along the noted dimensions and qualify the abstract claim to indicate that the design implication is drawn from the evaluated set of scenarios rather than asserted as universally sampled. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ranking claim is self-contained

full rationale

The paper reports an empirical comparison of nine memory systems (eight baselines plus the proposed harness) across five fixed scenarios and bases its design suggestion on the observed cross-scenario ranking. No equations, fitted parameters, or first-principles derivations appear in the provided text. No self-citations are used to justify uniqueness theorems, ansatzes, or load-bearing premises. The representativeness of the five scenarios is an external validity assumption rather than a circular reduction of any claimed derivation to its own inputs. The evaluation therefore stands as an independent empirical result against the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review yields no visible free parameters, axioms, or invented entities beyond the high-level description of the harness itself.

invented entities (1)
  • AutoMEM no independent evidence
    purpose: Agentic memory harness instantiated from the harness insight
    Mentioned as the concrete system achieving best generality; no independent evidence supplied in abstract.

pith-pipeline@v0.9.1-grok · 5716 in / 1087 out tokens · 29993 ms · 2026-06-28T06:46:45.751751+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 4 canonical work pages

  1. [1]

    Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, and Yiqun Liu. 2025. https://arxiv.org/abs/2510.17281 MemoryBench : A benchmark for memory and continual learning in LLM systems . ArXiv preprint, abs/2510.17281

  2. [2]

    Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2024. https://arxiv.org/abs/2412.15204 LongBench v2 : Towards deeper understanding and reasoning on realistic long-context multitasks . ArXiv preprint, abs/2412.15204

  3. [3]

    Gunjan Chhablani, Deshraj Khanna, and Singh Taranjeet. 2024. https://github.com/mem0ai/mem0 Mem0: The memory layer for AI agents . GitHub

  4. [4]

    Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. 2024. https://arxiv.org/abs/2404.16130 From local to global: A graph RAG approach to query-focused summarization . ArXiv preprint, abs/2404.16130

  5. [5]

    Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, Huajun Chen, and Ningyu Zhang. 2025. https://arxiv.org/abs/2510.18866 LightMem : Lightweight and efficient memory-augmented generation . ArXiv preprint, abs/2510.18866

  6. [6]

    Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and Andr \'e s Taylor. 2018. Cypher: An evolving query language for property graphs. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD), pages 1433--1445

  7. [7]

    Bernal Jim \'e nez Guti \'e rrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, and Yu Su. 2025. https://arxiv.org/abs/2502.14802 From RAG to memory: Non-parametric continual learning for large language models . ArXiv preprint, abs/2502.14802

  8. [8]

    Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen, Lang Yin, Ze Chen, Tong Arthur Wu, Siru Ouyang, Zihan Wang, Jiaxin Pei, Julian McAuley, Yejin Choi, and Alex Pentland. 2026. https://arxiv.org/abs/2602.16313 MemoryArena : Benchmarking agent memory in interdependent multi-session agentic tasks . ArXiv preprint, abs/2602.16313

  9. [9]

    Stefan Heule, Emily Jia, and Naman Jain. 2025. https://cursor.com/blog/semsearch Improving agent with semantic search . Cursor Blog. Published November 6, 2025

  10. [10]

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. 2024. https://arxiv.org/abs/2404.06654 RULER : What's the real context size of your long-context language models? ArXiv preprint, abs/2404.06654

  11. [11]

    Tiansheng Hu, Yilun Zhao, Canyu Zhang, Arman Cohan, and Chen Zhao. 2026 a . https://arxiv.org/abs/2602.05975 SAGE : Benchmarking and improving retrieval for deep research agents . ArXiv preprint, abs/2602.05975

  12. [12]

    Yuanzhe Hu, Yu Wang, and Julian McAuley. 2025. https://arxiv.org/abs/2507.05257 MemoryAgentBench : Evaluating memory in LLM agents via incremental multi-turn interactions . ArXiv preprint, abs/2507.05257

  13. [13]

    Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, and 28 others. 2026 b . https://arxiv.org/abs/2512.13564 Memory in the age of AI agents . ArXiv preprint, abs/2512.13564

  14. [14]

    Jim \'e nez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jim \'e nez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. https://arxiv.org/abs/2310.06770 SWE -bench: Can language models resolve real-world GitHub issues? ArXiv preprint, abs/2310.06770

  15. [15]

    Hao Kang, Ziyang Li, Xinyu Yang, Weili Xu, Yinfang Chen, Junxiong Wang, Beidi Chen, Tushar Krishna, Chenfeng Xu, and Simran Arora. 2026. https://arxiv.org/abs/2602.13692 ThunderAgent : A simple, fast and program-aware agentic inference system . ArXiv preprint, abs/2602.13692

  16. [16]

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.550 Dense passage retrieval for open-domain question answering . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769--6781, Online. Ass...

  17. [17]

    Dong-Ho Lee, Adyasha Maharana, Jay Pujara, Xiang Ren, and Francesco Barbieri. 2025. https://arxiv.org/abs/2502.13270 REALTALK : A 21-day real-world dataset for long-term conversation . ArXiv preprint, abs/2502.13270

  18. [18]

    u ttler, Mike Lewis, Wen - tau Yih, Tim Rockt \

    Patrick S. H. Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \" u ttler, Mike Lewis, Wen - tau Yih, Tim Rockt \" a schel, Sebastian Riedel, and Douwe Kiela. 2020. https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html Retrieval-augmented generation for knowledge-inte...

  19. [19]

    Kai Li, Xuanqing Yu, Ziyi Ni, Yi Zeng, Yao Xu, Zheqing Zhang, Xin Li, Jitao Sang, Xiaogang Duan, Xuelei Wang, Chengbao Liu, and Jie Tan. 2026 a . https://arxiv.org/abs/2601.02845 TiMem : Temporal-hierarchical memory consolidation for long-horizon conversational agents . ArXiv preprint, abs/2601.02845

  20. [20]

    Yifei Li, Weidong Guo, Lingling Zhang, Rongman Xu, Muye Huang, Hui Liu, Lijiao Xu, Yu Xu, and Jun Liu. 2026 b . https://arxiv.org/abs/2602.10715 Locomo-plus: Beyond-factual cognitive memory evaluation framework for LLM agents . ArXiv preprint, abs/2602.10715

  21. [21]

    Zhuofeng Li, Haoxiang Zhang, Cong Wei, Pan Lu, Ping Nie, Yi Lu, Yuyang Bai, Shangbin Feng, Hangxiao Zhu, Ming Zhong, Yuyu Zhang, Jianwen Xie, Yejin Choi, James Zou, Jiawei Han, Wenhu Chen, Jimmy Lin, Dongfu Jiang, and Yu Zhang. 2026 c . https://arxiv.org/abs/2605.05242 Beyond semantic similarity: Rethinking retrieval for agentic search via direct corpus i...

  22. [22]

    Jiaqi Liu, Yaofeng Su, Peng Xia, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, and Huaxiu Yao. 2026 a . https://arxiv.org/abs/2601.02553 SimpleMem : Efficient lifelong memory for LLM agents . ArXiv preprint, abs/2601.02553

  23. [23]

    Xiaoyuan Liu, Tian Liang, Dongyang Ma, Deyu Zhou, Haitao Mi, Pinjia He, and Yan Wang. 2026 b . https://arxiv.org/abs/2602.12108 The pensieve paradigm: Stateful language models mastering their own context . ArXiv preprint, abs/2602.12108

  24. [24]

    Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. 2024. https://arxiv.org/abs/2402.17753 Evaluating very long-term conversational memory of LLM agents . ArXiv preprint, abs/2402.17753

  25. [25]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with hum...

  26. [26]

    Patil, Kevin Lin, Sarah Wooders, and Joseph E

    Charles Packer, Vivian Fang, Shishir G. Patil, Kevin Lin, Sarah Wooders, and Joseph E. Gonzalez. 2023. https://arxiv.org/abs/2310.08560 MemGPT : Towards LLMs as operating systems . ArXiv preprint, abs/2310.08560

  27. [27]

    Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thompson, He He, and Samuel R. Bowman. 2021. https://arxiv.org/abs/2112.08608 QuALITY : Question answering with long input texts, yes! ArXiv preprint, abs/2112.08608

  28. [28]

    Natchanon Pollertlam and Witchayut Kornsuwannawit. 2026. https://arxiv.org/abs/2603.04814 Beyond the context window: A cost-performance analysis of fact-based memory vs.\ long-context LLM s for persistent agents . ArXiv preprint, abs/2603.04814

  29. [29]

    Hongjin Qian, Zheng Liu, Peitian Zhang, Kelong Mao, Defu Lian, Zhicheng Dou, and Tiejun Huang. 2024. https://arxiv.org/abs/2409.05591 MemoRAG : Boosting long context processing with global memory-enhanced retrieval augmentation . ArXiv preprint, abs/2409.05591

  30. [30]

    Alireza Salemi, Sheshera Mysore, Michael Bendersky, and Hamed Zamani. 2024. https://doi.org/10.18653/v1/2024.acl-long.399 LaMP : When large language models meet personalization . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7370--7392. Association for Computational Linguistics

  31. [31]

    Timo Schick, Jane Dwivedi-Yu, Roberto Dess \`i , Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. https://arxiv.org/abs/2302.04761 Toolformer: Language models can teach themselves to use tools . ArXiv preprint, abs/2302.04761

  32. [32]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y.K. Li, Y. Wu, and Daya Guo. 2024. https://arxiv.org/abs/2402.03300 DeepSeekMath : Pushing the limits of mathematical reasoning in open language models . ArXiv preprint, abs/2402.03300

  33. [33]

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre C \^o t \'e , Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. 2020. https://arxiv.org/abs/2010.03768 ALFWorld : Aligning text and embodied environments for interactive learning . ArXiv preprint, abs/2010.03768

  34. [34]

    Yiheng Shu, Saisri Padmaja Jonnalagedda, Xiang Gao, Bernal Jim \'e nez Guti \'e rrez, Weijian Qi, Kamalika Das, Huan Sun, and Yu Su. 2026. https://arxiv.org/abs/2602.13530 REMem : Reasoning with episodic memory in language agent . ArXiv preprint, abs/2602.13530

  35. [35]

    Anxin Tian, Yiming Li, Xing Li, Hui-Ling Zhen, Lei Chen, Xianzhi Yu, Zhenhua Dong, and Mingxuan Yuan. 2026. https://arxiv.org/abs/2601.08160 SwiftMem : Fast agentic memory via query-aware indexing . ArXiv preprint, abs/2601.08160

  36. [36]

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. 2024. https://doi.org/10.1007/s11704-024-40231-1 A survey on large language model based autonomous agents . Frontiers of Computer Science, 18(6):186345

  37. [37]

    Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, and Ling Yang. 2026. https://arxiv.org/abs/2603.10165 OpenClaw-RL : Train any agent simply by talking . ArXiv preprint, abs/2603.10165

  38. [38]

    Yu Wang, Ryuichi Takanobu, Zhiqi Liang, Yuzhen Mao, Yuanzhe Hu, Julian McAuley, and Xiaojian Wu. 2025. https://arxiv.org/abs/2509.25911 Mem- : Learning memory construction via reinforcement learning . ArXiv preprint, abs/2509.25911

  39. [39]

    Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. 2024. https://arxiv.org/abs/2410.10813 LongMemEval : Benchmarking chat assistants on long-term interactive memory . ArXiv preprint, abs/2410.10813

  40. [40]

    Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. 2026. https://arxiv.org/abs/2602.08234 SkillRL : Evolving agents via recursive skill-augmented reinforcement learning . ArXiv preprint, abs/2602.08234

  41. [41]

    Jian Xie, Kai Zhang, Jiangjie Chen, Tinghui Zhu, Renze Lou, Yuandong Tian, Yanghua Xiao, and Yu Su. 2024. https://arxiv.org/abs/2402.01622 TravelPlanner : A benchmark for real-world planning with language agents . ArXiv preprint, abs/2402.01622

  42. [42]

    Yiming Xiong, Shengran Hu, and Jeff Clune. 2026. https://arxiv.org/abs/2602.07755 Learning to continually learn via meta-learning agentic memory designs . ArXiv preprint, abs/2602.07755

  43. [43]

    Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. 2025. https://arxiv.org/abs/2502.12110 A-MEM : Agentic memory for LLM agents . ArXiv preprint, abs/2502.12110

  44. [44]

    Xiucheng Xu, Bingbing Xu, Xueyun Tian, Zihe Huang, Rongxin Chen, Yunfan Li, and Huawei Shen. 2026. https://arxiv.org/abs/2601.14287 Chain-of-memory: Lightweight memory construction with dynamic evolution for LLM agents . ArXiv preprint, abs/2601.14287

  45. [45]

    Pan, Hinrich Sch \"u tze, Volker Tresp, and Yunpu Ma

    Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Jinhe Bi, Kristian Kersting, Jeff Z. Pan, Hinrich Sch \"u tze, Volker Tresp, and Yunpu Ma. 2025. https://arxiv.org/abs/2508.19828 Memory- R1 : Enhancing large language model agents to manage and utilize memories via reinforcement learning . ArXiv preprint, abs/2508.19828

  46. [46]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others. 2025. https://arxiv.org/abs/2505.09388 Qwen3 technical report . ArXiv preprint, abs/2505.09388

  47. [47]

    Ke Yang, Zixi Chen, Xuan He, Jize Jiang, Michel Galley, Chenglong Wang, Jianfeng Gao, Jiawei Han, and ChengXiang Zhai. 2026. https://arxiv.org/abs/2603.03296 PlugMem : A task-agnostic plugin memory module for LLM agents . ArXiv preprint, abs/2603.03296

  48. [48]

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. https://doi.org/10.18653/v1/D18-1259 H otpot QA : A dataset for diverse, explainable multi-hop question answering . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369--2380, Brussels...

  49. [49]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. https://arxiv.org/abs/2210.03629 ReAct : Synergizing reasoning and acting in language models . ArXiv preprint, abs/2210.03629

  50. [50]

    Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, and Hao Zhou. 2025. https://arxiv.org/abs/2507.02259 MemAgent : Reshaping long-context LLM with multi-conv RL -based memory agent . ArXiv preprint, abs/2507.02259

  51. [51]

    Yanwei Yue, Boci Peng, Xuanbo Fan, Jiaxin Guo, Qiankun Li, and Yan Zhang. 2026. https://arxiv.org/abs/2601.23014 Mem-T : Densifying rewards for long-horizon memory agents . ArXiv preprint, abs/2601.23014

  52. [52]

    Haozhen Zhang, Haodong Yue, Tao Feng, Quanyu Long, Jianzhu Bao, Bowen Jin, Weizhi Zhang, Xiao Li, Jiaxuan You, Chengwei Qin, and Wenya Wang. 2026 a . https://arxiv.org/abs/2602.06025 Learning query-aware budget-tier routing for runtime agent memory . ArXiv preprint, abs/2602.06025

  53. [53]

    Shengtao Zhang, Jiaqian Wang, Ruiwen Zhou, Junwei Liao, Yuchen Feng, Zhuo Li, Yujie Zheng, Weinan Zhang, Ying Wen, Zhiyu Li, Feiyu Xiong, Yutao Qi, Bo Tang, and Muning Wen. 2026 b . https://arxiv.org/abs/2601.03192 MemRL : Self-evolving agents via runtime reinforcement learning on episodic memory . ArXiv preprint, abs/2601.03192

  54. [54]

    Yujie Zhao, Boqin Yuan, Junbo Huang, Haocheng Yuan, Zhongming Yu, Haozhou Xu, Lanxiang Hu, Abhilash Shankarampeta, Zimeng Huang, Wentao Ni, Yuandong Tian, and Jishen Zhao. 2026. https://arxiv.org/abs/2602.22769 AMA-Bench : Evaluating long-horizon memory for agentic applications . ArXiv preprint, abs/2602.22769

  55. [55]

    Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. 2023. https://arxiv.org/abs/2305.10250 MemoryBank : Enhancing large language models with long-term memory . ArXiv preprint, abs/2305.10250

  56. [56]

    Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang. 2025. https://arxiv.org/abs/2506.15841 MEM1 : Learning to synergize memory and reasoning for efficient long-horizon agents . ArXiv preprint, abs/2506.15841