Exploring Cross-Scenario Generality of Agentic Memory Systems: Diagnostics and a Strong Baseline

Jialiang Gu; Jiliang Tang; Junyu Yin; Kai Guo; Keren Zhou; Shenglai Zeng; Xianxuan Long; Xiaoze Liu; Zhikai Chen

arxiv: 2606.04315 · v1 · pith:INQZVMQRnew · submitted 2026-06-03 · 💻 cs.AI

Exploring Cross-Scenario Generality of Agentic Memory Systems: Diagnostics and a Strong Baseline

Zhikai Chen , Jialiang Gu , Junyu Yin , Xianxuan Long , Shenglai Zeng , Xiaoze Liu , Kai Guo , Keren Zhou

show 1 more author

Jiliang Tang

This is my paper

Pith reviewed 2026-06-28 06:46 UTC · model grok-4.3

classification 💻 cs.AI

keywords agentic memoryLLM agentsmemory systemstool callscross-scenario evaluationAutoMEMstorage managementgenerality

0 comments

The pith

An agentic harness where the LLM actively manages its own flat text-file storage via tool calls achieves the best cross-scenario ranking among evaluated memory systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates eight existing memory systems plus one new agentic harness across five distinct scenarios: single-turn QA, multi-session chat, agentic-trajectory QA, memory stress tests, and long-horizon agentic tasks. It finds that the harness, which gives the agent direct control over storage and retrieval operations through tool calls to flat text files, consistently ranks highest in average performance. A sympathetic reader would care because this result points to a design principle: agent memory works better when the LLM itself decides what to store and when to retrieve rather than depending on a fixed external pipeline. The authors use this finding to introduce AutoMEM as an instantiation of the principle.

Core claim

The agentic harness self-manages flat text-file storage via tool calls and achieves the best cross-task ranking, suggesting that memory performance hinges on giving the agent active control over storage and retrieval rather than on a passive store behind a fixed pipeline. This insight is instantiated in AutoMEM, an agentic memory harness with a self-managed tool interface that achieves the best cross-scenario generality among the systems evaluated.

What carries the argument

The agentic harness for search problems, which self-manages flat text-file storage via tool calls to give the agent active control over memory operations.

Load-bearing premise

The five chosen scenarios represent the heterogeneous trajectories that agents encounter in real deployments.

What would settle it

Running the same set of systems on a sixth scenario outside the original five, such as multi-agent collaborative tasks, and checking whether the agentic harness still produces the highest average rank.

Figures

Figures reproduced from arXiv: 2606.04315 by Jialiang Gu, Jiliang Tang, Junyu Yin, Kai Guo, Keren Zhou, Shenglai Zeng, Xianxuan Long, Xiaoze Liu, Zhikai Chen.

**Figure 1.** Figure 1: Per-method preprocessing vs. inference cost per question, averaged across sub-benchmarks within each task category. Raw trajectory scroll Step 5 scroll Step 6 go_back Step 7 HippoRAG: black case $11.99 has_price step/action: lost AMA-Agent: e@5 e@6 e@7 scroll scroll go_back step/action: kept [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Schema loss on an AMABench-Web question (“which step performed the return from the product page, and how many scrolls happened before it?”). HippoRAG keeps entity-relation facts but discards step indices and actions; AMA-Agent’s turn-indexed graph keeps both. 4.2 Where index-based memory fails on agentic QA Indexing-based methods perform poorly on agentic QA ( [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Per-question token cost vs. amortisation [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

LLM agents accumulate histories that outgrow their context windows, motivating a growing literature on memory systems. Yet most existing designs are tuned to a single scenario (multi-session chat or a single trajectory format), and there is little evidence that they generalize across the heterogeneous trajectories agents encounter in deployment. We revisit eight memory systems plus an agentic harness for search problems, on five scenarios: single-turn QA, multi-session chat, agentic-trajectory QA, memory stress tests, and long-horizon agentic tasks. The harness, which self-manages flat text-file storage via tool calls, achieves the best cross-task ranking, suggesting that memory performance hinges on giving the agent active control over storage and retrieval rather than on a passive store behind a fixed pipeline. We instantiate this insight in AutoMEM, an agentic memory harness with a self-managed tool interface that achieves the best cross-scenario generality among the systems we evaluate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The agentic harness ranks highest across the five scenarios, but the claim that active tool-based control drives generality hinges on those scenarios being a solid sample of deployment cases.

read the letter

The main point here is that an agentic harness letting the LLM manage flat text files through its own tool calls comes out ahead in average ranking over eight other memory systems when tested on five different scenarios. That result is the concrete new piece.

The paper does a solid job setting up the cross-scenario test. Most memory papers stick to one task type like multi-session chat. This one pulls in single-turn QA, multi-session chat, agentic-trajectory QA, memory stress tests, and long-horizon tasks, then ranks the systems on each. The harness wins the overall ranking, which they turn into the suggestion that giving the agent direct control over storage beats passive pipelines. They also ship AutoMEM as their version of that idea.

The soft spot is the representativeness of the five scenarios. The stress-test note is right that the design implication only follows if these scenarios capture the real variation in agent trajectories. The abstract gives no diversity metric, coverage check, or sensitivity analysis on dimensions like context length or noise level. If the scenarios turn out correlated, the ranking does not license the broader conclusion about active control. The full paper needs to show the actual tables, error bars, and run counts to confirm the ranking is stable.

This is for people working on memory for deployed LLM agents who care about performance beyond a single task format. It has enough of an empirical result to go to a serious referee, though the methods and results sections will need close checking on the scenario choice and statistical support.

Referee Report

1 major / 2 minor

Summary. The manuscript evaluates eight memory systems plus a new agentic harness (AutoMEM) across five scenarios (single-turn QA, multi-session chat, agentic-trajectory QA, memory stress tests, long-horizon agentic tasks). It reports that the harness, which lets the agent self-manage flat text-file storage via tool calls, achieves the best cross-task ranking and concludes that memory performance hinges on active agent control over storage/retrieval rather than passive stores behind fixed pipelines.

Significance. If the empirical ranking is robust, the work supplies a multi-scenario diagnostic framework and a strong baseline for LLM-agent memory design, explicitly crediting the multi-system evaluation and the agentic harness as a reproducible point of comparison. It shifts focus from single-scenario tuning to cross-scenario generality.

major comments (1)

[Abstract] Abstract: the claim that the harness's top cross-task ranking licenses the design implication (active control superior to passive pipelines) is load-bearing on the assumption that the five scenarios adequately sample deployment trajectories; the manuscript supplies no diversity metric, coverage argument, or sensitivity analysis showing the scenarios are heterogeneous rather than correlated on dimensions such as context length or retrieval noise.

minor comments (2)

Clarify the precise interface and state-management protocol of the agentic harness versus the eight baseline systems in the methods section.
Specify how the cross-task ranking is aggregated (e.g., mean rank, weighted sum) and whether statistical significance or error bars accompany the reported ordering.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting a key assumption in the abstract. We respond to the major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the harness's top cross-task ranking licenses the design implication (active control superior to passive pipelines) is load-bearing on the assumption that the five scenarios adequately sample deployment trajectories; the manuscript supplies no diversity metric, coverage argument, or sensitivity analysis showing the scenarios are heterogeneous rather than correlated on dimensions such as context length or retrieval noise.

Authors: The five scenarios were selected to represent distinct regimes of agent deployment, spanning single-turn retrieval, persistent multi-session interaction, sequential trajectory reasoning, robustness under stress (noise and length), and extended planning horizons, with corresponding differences in context length and interaction structure as detailed in Section 3. We agree that no quantitative diversity metric, coverage argument, or sensitivity analysis is supplied. In revision we will add a short discussion of scenario heterogeneity along the noted dimensions and qualify the abstract claim to indicate that the design implication is drawn from the evaluated set of scenarios rather than asserted as universally sampled. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ranking claim is self-contained

full rationale

The paper reports an empirical comparison of nine memory systems (eight baselines plus the proposed harness) across five fixed scenarios and bases its design suggestion on the observed cross-scenario ranking. No equations, fitted parameters, or first-principles derivations appear in the provided text. No self-citations are used to justify uniqueness theorems, ansatzes, or load-bearing premises. The representativeness of the five scenarios is an external validity assumption rather than a circular reduction of any claimed derivation to its own inputs. The evaluation therefore stands as an independent empirical result against the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review yields no visible free parameters, axioms, or invented entities beyond the high-level description of the harness itself.

invented entities (1)

AutoMEM no independent evidence
purpose: Agentic memory harness instantiated from the harness insight
Mentioned as the concrete system achieving best generality; no independent evidence supplied in abstract.

pith-pipeline@v0.9.1-grok · 5716 in / 1087 out tokens · 29993 ms · 2026-06-28T06:46:45.751751+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 4 canonical work pages

[1]

Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, and Yiqun Liu. 2025. https://arxiv.org/abs/2510.17281 MemoryBench : A benchmark for memory and continual learning in LLM systems . ArXiv preprint, abs/2510.17281

Pith/arXiv arXiv 2025
[2]

Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2024. https://arxiv.org/abs/2412.15204 LongBench v2 : Towards deeper understanding and reasoning on realistic long-context multitasks . ArXiv preprint, abs/2412.15204

Pith/arXiv arXiv 2024
[3]

Gunjan Chhablani, Deshraj Khanna, and Singh Taranjeet. 2024. https://github.com/mem0ai/mem0 Mem0: The memory layer for AI agents . GitHub

2024
[4]

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. 2024. https://arxiv.org/abs/2404.16130 From local to global: A graph RAG approach to query-focused summarization . ArXiv preprint, abs/2404.16130

Pith/arXiv arXiv 2024
[5]

Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, Huajun Chen, and Ningyu Zhang. 2025. https://arxiv.org/abs/2510.18866 LightMem : Lightweight and efficient memory-augmented generation . ArXiv preprint, abs/2510.18866

Pith/arXiv arXiv 2025
[6]

Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and Andr \'e s Taylor. 2018. Cypher: An evolving query language for property graphs. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD), pages 1433--1445

2018
[7]

Bernal Jim \'e nez Guti \'e rrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, and Yu Su. 2025. https://arxiv.org/abs/2502.14802 From RAG to memory: Non-parametric continual learning for large language models . ArXiv preprint, abs/2502.14802

Pith/arXiv arXiv 2025
[8]

Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen, Lang Yin, Ze Chen, Tong Arthur Wu, Siru Ouyang, Zihan Wang, Jiaxin Pei, Julian McAuley, Yejin Choi, and Alex Pentland. 2026. https://arxiv.org/abs/2602.16313 MemoryArena : Benchmarking agent memory in interdependent multi-session agentic tasks . ArXiv preprint, abs/2602.16313

arXiv 2026
[9]

Stefan Heule, Emily Jia, and Naman Jain. 2025. https://cursor.com/blog/semsearch Improving agent with semantic search . Cursor Blog. Published November 6, 2025

2025
[10]

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. 2024. https://arxiv.org/abs/2404.06654 RULER : What's the real context size of your long-context language models? ArXiv preprint, abs/2404.06654

Pith/arXiv arXiv 2024
[11]

Tiansheng Hu, Yilun Zhao, Canyu Zhang, Arman Cohan, and Chen Zhao. 2026 a . https://arxiv.org/abs/2602.05975 SAGE : Benchmarking and improving retrieval for deep research agents . ArXiv preprint, abs/2602.05975

arXiv 2026
[12]

Yuanzhe Hu, Yu Wang, and Julian McAuley. 2025. https://arxiv.org/abs/2507.05257 MemoryAgentBench : Evaluating memory in LLM agents via incremental multi-turn interactions . ArXiv preprint, abs/2507.05257

Pith/arXiv arXiv 2025
[13]

Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, and 28 others. 2026 b . https://arxiv.org/abs/2512.13564 Memory in the age of AI agents . ArXiv preprint, abs/2512.13564

Pith/arXiv arXiv 2026
[14]

Jim \'e nez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jim \'e nez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. https://arxiv.org/abs/2310.06770 SWE -bench: Can language models resolve real-world GitHub issues? ArXiv preprint, abs/2310.06770

Pith/arXiv arXiv 2023
[15]

Hao Kang, Ziyang Li, Xinyu Yang, Weili Xu, Yinfang Chen, Junxiong Wang, Beidi Chen, Tushar Krishna, Chenfeng Xu, and Simran Arora. 2026. https://arxiv.org/abs/2602.13692 ThunderAgent : A simple, fast and program-aware agentic inference system . ArXiv preprint, abs/2602.13692

arXiv 2026
[16]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.550 Dense passage retrieval for open-domain question answering . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769--6781, Online. Ass...

work page doi:10.18653/v1/2020.emnlp-main.550 2020
[17]

Dong-Ho Lee, Adyasha Maharana, Jay Pujara, Xiang Ren, and Francesco Barbieri. 2025. https://arxiv.org/abs/2502.13270 REALTALK : A 21-day real-world dataset for long-term conversation . ArXiv preprint, abs/2502.13270

arXiv 2025
[18]

u ttler, Mike Lewis, Wen - tau Yih, Tim Rockt \

Patrick S. H. Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \" u ttler, Mike Lewis, Wen - tau Yih, Tim Rockt \" a schel, Sebastian Riedel, and Douwe Kiela. 2020. https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html Retrieval-augmented generation for knowledge-inte...

2020
[19]

Kai Li, Xuanqing Yu, Ziyi Ni, Yi Zeng, Yao Xu, Zheqing Zhang, Xin Li, Jitao Sang, Xiaogang Duan, Xuelei Wang, Chengbao Liu, and Jie Tan. 2026 a . https://arxiv.org/abs/2601.02845 TiMem : Temporal-hierarchical memory consolidation for long-horizon conversational agents . ArXiv preprint, abs/2601.02845

Pith/arXiv arXiv 2026
[20]

Yifei Li, Weidong Guo, Lingling Zhang, Rongman Xu, Muye Huang, Hui Liu, Lijiao Xu, Yu Xu, and Jun Liu. 2026 b . https://arxiv.org/abs/2602.10715 Locomo-plus: Beyond-factual cognitive memory evaluation framework for LLM agents . ArXiv preprint, abs/2602.10715

arXiv 2026
[21]

Zhuofeng Li, Haoxiang Zhang, Cong Wei, Pan Lu, Ping Nie, Yi Lu, Yuyang Bai, Shangbin Feng, Hangxiao Zhu, Ming Zhong, Yuyu Zhang, Jianwen Xie, Yejin Choi, James Zou, Jiawei Han, Wenhu Chen, Jimmy Lin, Dongfu Jiang, and Yu Zhang. 2026 c . https://arxiv.org/abs/2605.05242 Beyond semantic similarity: Rethinking retrieval for agentic search via direct corpus i...

Pith/arXiv arXiv 2026
[22]

Jiaqi Liu, Yaofeng Su, Peng Xia, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, and Huaxiu Yao. 2026 a . https://arxiv.org/abs/2601.02553 SimpleMem : Efficient lifelong memory for LLM agents . ArXiv preprint, abs/2601.02553

Pith/arXiv arXiv 2026
[23]

Xiaoyuan Liu, Tian Liang, Dongyang Ma, Deyu Zhou, Haitao Mi, Pinjia He, and Yan Wang. 2026 b . https://arxiv.org/abs/2602.12108 The pensieve paradigm: Stateful language models mastering their own context . ArXiv preprint, abs/2602.12108

arXiv 2026
[24]

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. 2024. https://arxiv.org/abs/2402.17753 Evaluating very long-term conversational memory of LLM agents . ArXiv preprint, abs/2402.17753

Pith/arXiv arXiv 2024
[25]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with hum...

2022
[26]

Patil, Kevin Lin, Sarah Wooders, and Joseph E

Charles Packer, Vivian Fang, Shishir G. Patil, Kevin Lin, Sarah Wooders, and Joseph E. Gonzalez. 2023. https://arxiv.org/abs/2310.08560 MemGPT : Towards LLMs as operating systems . ArXiv preprint, abs/2310.08560

Pith/arXiv arXiv 2023
[27]

Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thompson, He He, and Samuel R. Bowman. 2021. https://arxiv.org/abs/2112.08608 QuALITY : Question answering with long input texts, yes! ArXiv preprint, abs/2112.08608

arXiv 2021
[28]

Natchanon Pollertlam and Witchayut Kornsuwannawit. 2026. https://arxiv.org/abs/2603.04814 Beyond the context window: A cost-performance analysis of fact-based memory vs.\ long-context LLM s for persistent agents . ArXiv preprint, abs/2603.04814

arXiv 2026
[29]

Hongjin Qian, Zheng Liu, Peitian Zhang, Kelong Mao, Defu Lian, Zhicheng Dou, and Tiejun Huang. 2024. https://arxiv.org/abs/2409.05591 MemoRAG : Boosting long context processing with global memory-enhanced retrieval augmentation . ArXiv preprint, abs/2409.05591

arXiv 2024
[30]

Alireza Salemi, Sheshera Mysore, Michael Bendersky, and Hamed Zamani. 2024. https://doi.org/10.18653/v1/2024.acl-long.399 LaMP : When large language models meet personalization . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7370--7392. Association for Computational Linguistics

work page doi:10.18653/v1/2024.acl-long.399 2024
[31]

Timo Schick, Jane Dwivedi-Yu, Roberto Dess \`i , Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. https://arxiv.org/abs/2302.04761 Toolformer: Language models can teach themselves to use tools . ArXiv preprint, abs/2302.04761

Pith/arXiv arXiv 2023
[32]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y.K. Li, Y. Wu, and Daya Guo. 2024. https://arxiv.org/abs/2402.03300 DeepSeekMath : Pushing the limits of mathematical reasoning in open language models . ArXiv preprint, abs/2402.03300

Pith/arXiv arXiv 2024
[33]

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre C \^o t \'e , Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. 2020. https://arxiv.org/abs/2010.03768 ALFWorld : Aligning text and embodied environments for interactive learning . ArXiv preprint, abs/2010.03768

Pith/arXiv arXiv 2020
[34]

Yiheng Shu, Saisri Padmaja Jonnalagedda, Xiang Gao, Bernal Jim \'e nez Guti \'e rrez, Weijian Qi, Kamalika Das, Huan Sun, and Yu Su. 2026. https://arxiv.org/abs/2602.13530 REMem : Reasoning with episodic memory in language agent . ArXiv preprint, abs/2602.13530

arXiv 2026
[35]

Anxin Tian, Yiming Li, Xing Li, Hui-Ling Zhen, Lei Chen, Xianzhi Yu, Zhenhua Dong, and Mingxuan Yuan. 2026. https://arxiv.org/abs/2601.08160 SwiftMem : Fast agentic memory via query-aware indexing . ArXiv preprint, abs/2601.08160

arXiv 2026
[36]

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. 2024. https://doi.org/10.1007/s11704-024-40231-1 A survey on large language model based autonomous agents . Frontiers of Computer Science, 18(6):186345

work page doi:10.1007/s11704-024-40231-1 2024
[37]

Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, and Ling Yang. 2026. https://arxiv.org/abs/2603.10165 OpenClaw-RL : Train any agent simply by talking . ArXiv preprint, abs/2603.10165

Pith/arXiv arXiv 2026
[38]

Yu Wang, Ryuichi Takanobu, Zhiqi Liang, Yuzhen Mao, Yuanzhe Hu, Julian McAuley, and Xiaojian Wu. 2025. https://arxiv.org/abs/2509.25911 Mem- : Learning memory construction via reinforcement learning . ArXiv preprint, abs/2509.25911

Pith/arXiv arXiv 2025
[39]

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. 2024. https://arxiv.org/abs/2410.10813 LongMemEval : Benchmarking chat assistants on long-term interactive memory . ArXiv preprint, abs/2410.10813

Pith/arXiv arXiv 2024
[40]

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. 2026. https://arxiv.org/abs/2602.08234 SkillRL : Evolving agents via recursive skill-augmented reinforcement learning . ArXiv preprint, abs/2602.08234

Pith/arXiv arXiv 2026
[41]

Jian Xie, Kai Zhang, Jiangjie Chen, Tinghui Zhu, Renze Lou, Yuandong Tian, Yanghua Xiao, and Yu Su. 2024. https://arxiv.org/abs/2402.01622 TravelPlanner : A benchmark for real-world planning with language agents . ArXiv preprint, abs/2402.01622

arXiv 2024
[42]

Yiming Xiong, Shengran Hu, and Jeff Clune. 2026. https://arxiv.org/abs/2602.07755 Learning to continually learn via meta-learning agentic memory designs . ArXiv preprint, abs/2602.07755

arXiv 2026
[43]

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. 2025. https://arxiv.org/abs/2502.12110 A-MEM : Agentic memory for LLM agents . ArXiv preprint, abs/2502.12110

Pith/arXiv arXiv 2025
[44]

Xiucheng Xu, Bingbing Xu, Xueyun Tian, Zihe Huang, Rongxin Chen, Yunfan Li, and Huawei Shen. 2026. https://arxiv.org/abs/2601.14287 Chain-of-memory: Lightweight memory construction with dynamic evolution for LLM agents . ArXiv preprint, abs/2601.14287

Pith/arXiv arXiv 2026
[45]

Pan, Hinrich Sch \"u tze, Volker Tresp, and Yunpu Ma

Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Jinhe Bi, Kristian Kersting, Jeff Z. Pan, Hinrich Sch \"u tze, Volker Tresp, and Yunpu Ma. 2025. https://arxiv.org/abs/2508.19828 Memory- R1 : Enhancing large language model agents to manage and utilize memories via reinforcement learning . ArXiv preprint, abs/2508.19828

Pith/arXiv arXiv 2025
[46]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others. 2025. https://arxiv.org/abs/2505.09388 Qwen3 technical report . ArXiv preprint, abs/2505.09388

Pith/arXiv arXiv 2025
[47]

Ke Yang, Zixi Chen, Xuan He, Jize Jiang, Michel Galley, Chenglong Wang, Jianfeng Gao, Jiawei Han, and ChengXiang Zhai. 2026. https://arxiv.org/abs/2603.03296 PlugMem : A task-agnostic plugin memory module for LLM agents . ArXiv preprint, abs/2603.03296

arXiv 2026
[48]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. https://doi.org/10.18653/v1/D18-1259 H otpot QA : A dataset for diverse, explainable multi-hop question answering . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369--2380, Brussels...

work page doi:10.18653/v1/d18-1259 2018
[49]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. https://arxiv.org/abs/2210.03629 ReAct : Synergizing reasoning and acting in language models . ArXiv preprint, abs/2210.03629

Pith/arXiv arXiv 2022
[50]

Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, and Hao Zhou. 2025. https://arxiv.org/abs/2507.02259 MemAgent : Reshaping long-context LLM with multi-conv RL -based memory agent . ArXiv preprint, abs/2507.02259

Pith/arXiv arXiv 2025
[51]

Yanwei Yue, Boci Peng, Xuanbo Fan, Jiaxin Guo, Qiankun Li, and Yan Zhang. 2026. https://arxiv.org/abs/2601.23014 Mem-T : Densifying rewards for long-horizon memory agents . ArXiv preprint, abs/2601.23014

arXiv 2026
[52]

Haozhen Zhang, Haodong Yue, Tao Feng, Quanyu Long, Jianzhu Bao, Bowen Jin, Weizhi Zhang, Xiao Li, Jiaxuan You, Chengwei Qin, and Wenya Wang. 2026 a . https://arxiv.org/abs/2602.06025 Learning query-aware budget-tier routing for runtime agent memory . ArXiv preprint, abs/2602.06025

Pith/arXiv arXiv 2026
[53]

Shengtao Zhang, Jiaqian Wang, Ruiwen Zhou, Junwei Liao, Yuchen Feng, Zhuo Li, Yujie Zheng, Weinan Zhang, Ying Wen, Zhiyu Li, Feiyu Xiong, Yutao Qi, Bo Tang, and Muning Wen. 2026 b . https://arxiv.org/abs/2601.03192 MemRL : Self-evolving agents via runtime reinforcement learning on episodic memory . ArXiv preprint, abs/2601.03192

Pith/arXiv arXiv 2026
[54]

Yujie Zhao, Boqin Yuan, Junbo Huang, Haocheng Yuan, Zhongming Yu, Haozhou Xu, Lanxiang Hu, Abhilash Shankarampeta, Zimeng Huang, Wentao Ni, Yuandong Tian, and Jishen Zhao. 2026. https://arxiv.org/abs/2602.22769 AMA-Bench : Evaluating long-horizon memory for agentic applications . ArXiv preprint, abs/2602.22769

Pith/arXiv arXiv 2026
[55]

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. 2023. https://arxiv.org/abs/2305.10250 MemoryBank : Enhancing large language models with long-term memory . ArXiv preprint, abs/2305.10250

Pith/arXiv arXiv 2023
[56]

Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang. 2025. https://arxiv.org/abs/2506.15841 MEM1 : Learning to synergize memory and reasoning for efficient long-horizon agents . ArXiv preprint, abs/2506.15841

Pith/arXiv arXiv 2025

[1] [1]

Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, and Yiqun Liu. 2025. https://arxiv.org/abs/2510.17281 MemoryBench : A benchmark for memory and continual learning in LLM systems . ArXiv preprint, abs/2510.17281

Pith/arXiv arXiv 2025

[2] [2]

Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2024. https://arxiv.org/abs/2412.15204 LongBench v2 : Towards deeper understanding and reasoning on realistic long-context multitasks . ArXiv preprint, abs/2412.15204

Pith/arXiv arXiv 2024

[3] [3]

Gunjan Chhablani, Deshraj Khanna, and Singh Taranjeet. 2024. https://github.com/mem0ai/mem0 Mem0: The memory layer for AI agents . GitHub

2024

[4] [4]

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. 2024. https://arxiv.org/abs/2404.16130 From local to global: A graph RAG approach to query-focused summarization . ArXiv preprint, abs/2404.16130

Pith/arXiv arXiv 2024

[5] [5]

Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, Huajun Chen, and Ningyu Zhang. 2025. https://arxiv.org/abs/2510.18866 LightMem : Lightweight and efficient memory-augmented generation . ArXiv preprint, abs/2510.18866

Pith/arXiv arXiv 2025

[6] [6]

Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and Andr \'e s Taylor. 2018. Cypher: An evolving query language for property graphs. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD), pages 1433--1445

2018

[7] [7]

Bernal Jim \'e nez Guti \'e rrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, and Yu Su. 2025. https://arxiv.org/abs/2502.14802 From RAG to memory: Non-parametric continual learning for large language models . ArXiv preprint, abs/2502.14802

Pith/arXiv arXiv 2025

[8] [8]

Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen, Lang Yin, Ze Chen, Tong Arthur Wu, Siru Ouyang, Zihan Wang, Jiaxin Pei, Julian McAuley, Yejin Choi, and Alex Pentland. 2026. https://arxiv.org/abs/2602.16313 MemoryArena : Benchmarking agent memory in interdependent multi-session agentic tasks . ArXiv preprint, abs/2602.16313

arXiv 2026

[9] [9]

Stefan Heule, Emily Jia, and Naman Jain. 2025. https://cursor.com/blog/semsearch Improving agent with semantic search . Cursor Blog. Published November 6, 2025

2025

[10] [10]

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. 2024. https://arxiv.org/abs/2404.06654 RULER : What's the real context size of your long-context language models? ArXiv preprint, abs/2404.06654

Pith/arXiv arXiv 2024

[11] [11]

Tiansheng Hu, Yilun Zhao, Canyu Zhang, Arman Cohan, and Chen Zhao. 2026 a . https://arxiv.org/abs/2602.05975 SAGE : Benchmarking and improving retrieval for deep research agents . ArXiv preprint, abs/2602.05975

arXiv 2026

[12] [12]

Yuanzhe Hu, Yu Wang, and Julian McAuley. 2025. https://arxiv.org/abs/2507.05257 MemoryAgentBench : Evaluating memory in LLM agents via incremental multi-turn interactions . ArXiv preprint, abs/2507.05257

Pith/arXiv arXiv 2025

[13] [13]

Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, and 28 others. 2026 b . https://arxiv.org/abs/2512.13564 Memory in the age of AI agents . ArXiv preprint, abs/2512.13564

Pith/arXiv arXiv 2026

[14] [14]

Jim \'e nez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jim \'e nez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. https://arxiv.org/abs/2310.06770 SWE -bench: Can language models resolve real-world GitHub issues? ArXiv preprint, abs/2310.06770

Pith/arXiv arXiv 2023

[15] [15]

Hao Kang, Ziyang Li, Xinyu Yang, Weili Xu, Yinfang Chen, Junxiong Wang, Beidi Chen, Tushar Krishna, Chenfeng Xu, and Simran Arora. 2026. https://arxiv.org/abs/2602.13692 ThunderAgent : A simple, fast and program-aware agentic inference system . ArXiv preprint, abs/2602.13692

arXiv 2026

[16] [16]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.550 Dense passage retrieval for open-domain question answering . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769--6781, Online. Ass...

work page doi:10.18653/v1/2020.emnlp-main.550 2020

[17] [17]

Dong-Ho Lee, Adyasha Maharana, Jay Pujara, Xiang Ren, and Francesco Barbieri. 2025. https://arxiv.org/abs/2502.13270 REALTALK : A 21-day real-world dataset for long-term conversation . ArXiv preprint, abs/2502.13270

arXiv 2025

[18] [18]

u ttler, Mike Lewis, Wen - tau Yih, Tim Rockt \

Patrick S. H. Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \" u ttler, Mike Lewis, Wen - tau Yih, Tim Rockt \" a schel, Sebastian Riedel, and Douwe Kiela. 2020. https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html Retrieval-augmented generation for knowledge-inte...

2020

[19] [19]

Kai Li, Xuanqing Yu, Ziyi Ni, Yi Zeng, Yao Xu, Zheqing Zhang, Xin Li, Jitao Sang, Xiaogang Duan, Xuelei Wang, Chengbao Liu, and Jie Tan. 2026 a . https://arxiv.org/abs/2601.02845 TiMem : Temporal-hierarchical memory consolidation for long-horizon conversational agents . ArXiv preprint, abs/2601.02845

Pith/arXiv arXiv 2026

[20] [20]

Yifei Li, Weidong Guo, Lingling Zhang, Rongman Xu, Muye Huang, Hui Liu, Lijiao Xu, Yu Xu, and Jun Liu. 2026 b . https://arxiv.org/abs/2602.10715 Locomo-plus: Beyond-factual cognitive memory evaluation framework for LLM agents . ArXiv preprint, abs/2602.10715

arXiv 2026

[21] [21]

Zhuofeng Li, Haoxiang Zhang, Cong Wei, Pan Lu, Ping Nie, Yi Lu, Yuyang Bai, Shangbin Feng, Hangxiao Zhu, Ming Zhong, Yuyu Zhang, Jianwen Xie, Yejin Choi, James Zou, Jiawei Han, Wenhu Chen, Jimmy Lin, Dongfu Jiang, and Yu Zhang. 2026 c . https://arxiv.org/abs/2605.05242 Beyond semantic similarity: Rethinking retrieval for agentic search via direct corpus i...

Pith/arXiv arXiv 2026

[22] [22]

Jiaqi Liu, Yaofeng Su, Peng Xia, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, and Huaxiu Yao. 2026 a . https://arxiv.org/abs/2601.02553 SimpleMem : Efficient lifelong memory for LLM agents . ArXiv preprint, abs/2601.02553

Pith/arXiv arXiv 2026

[23] [23]

Xiaoyuan Liu, Tian Liang, Dongyang Ma, Deyu Zhou, Haitao Mi, Pinjia He, and Yan Wang. 2026 b . https://arxiv.org/abs/2602.12108 The pensieve paradigm: Stateful language models mastering their own context . ArXiv preprint, abs/2602.12108

arXiv 2026

[24] [24]

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. 2024. https://arxiv.org/abs/2402.17753 Evaluating very long-term conversational memory of LLM agents . ArXiv preprint, abs/2402.17753

Pith/arXiv arXiv 2024

[25] [25]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with hum...

2022

[26] [26]

Patil, Kevin Lin, Sarah Wooders, and Joseph E

Charles Packer, Vivian Fang, Shishir G. Patil, Kevin Lin, Sarah Wooders, and Joseph E. Gonzalez. 2023. https://arxiv.org/abs/2310.08560 MemGPT : Towards LLMs as operating systems . ArXiv preprint, abs/2310.08560

Pith/arXiv arXiv 2023

[27] [27]

Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thompson, He He, and Samuel R. Bowman. 2021. https://arxiv.org/abs/2112.08608 QuALITY : Question answering with long input texts, yes! ArXiv preprint, abs/2112.08608

arXiv 2021

[28] [28]

Natchanon Pollertlam and Witchayut Kornsuwannawit. 2026. https://arxiv.org/abs/2603.04814 Beyond the context window: A cost-performance analysis of fact-based memory vs.\ long-context LLM s for persistent agents . ArXiv preprint, abs/2603.04814

arXiv 2026

[29] [29]

Hongjin Qian, Zheng Liu, Peitian Zhang, Kelong Mao, Defu Lian, Zhicheng Dou, and Tiejun Huang. 2024. https://arxiv.org/abs/2409.05591 MemoRAG : Boosting long context processing with global memory-enhanced retrieval augmentation . ArXiv preprint, abs/2409.05591

arXiv 2024

[30] [30]

Alireza Salemi, Sheshera Mysore, Michael Bendersky, and Hamed Zamani. 2024. https://doi.org/10.18653/v1/2024.acl-long.399 LaMP : When large language models meet personalization . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7370--7392. Association for Computational Linguistics

work page doi:10.18653/v1/2024.acl-long.399 2024

[31] [31]

Timo Schick, Jane Dwivedi-Yu, Roberto Dess \`i , Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. https://arxiv.org/abs/2302.04761 Toolformer: Language models can teach themselves to use tools . ArXiv preprint, abs/2302.04761

Pith/arXiv arXiv 2023

[32] [32]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y.K. Li, Y. Wu, and Daya Guo. 2024. https://arxiv.org/abs/2402.03300 DeepSeekMath : Pushing the limits of mathematical reasoning in open language models . ArXiv preprint, abs/2402.03300

Pith/arXiv arXiv 2024

[33] [33]

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre C \^o t \'e , Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. 2020. https://arxiv.org/abs/2010.03768 ALFWorld : Aligning text and embodied environments for interactive learning . ArXiv preprint, abs/2010.03768

Pith/arXiv arXiv 2020

[34] [34]

Yiheng Shu, Saisri Padmaja Jonnalagedda, Xiang Gao, Bernal Jim \'e nez Guti \'e rrez, Weijian Qi, Kamalika Das, Huan Sun, and Yu Su. 2026. https://arxiv.org/abs/2602.13530 REMem : Reasoning with episodic memory in language agent . ArXiv preprint, abs/2602.13530

arXiv 2026

[35] [35]

Anxin Tian, Yiming Li, Xing Li, Hui-Ling Zhen, Lei Chen, Xianzhi Yu, Zhenhua Dong, and Mingxuan Yuan. 2026. https://arxiv.org/abs/2601.08160 SwiftMem : Fast agentic memory via query-aware indexing . ArXiv preprint, abs/2601.08160

arXiv 2026

[36] [36]

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. 2024. https://doi.org/10.1007/s11704-024-40231-1 A survey on large language model based autonomous agents . Frontiers of Computer Science, 18(6):186345

work page doi:10.1007/s11704-024-40231-1 2024

[37] [37]

Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, and Ling Yang. 2026. https://arxiv.org/abs/2603.10165 OpenClaw-RL : Train any agent simply by talking . ArXiv preprint, abs/2603.10165

Pith/arXiv arXiv 2026

[38] [38]

Yu Wang, Ryuichi Takanobu, Zhiqi Liang, Yuzhen Mao, Yuanzhe Hu, Julian McAuley, and Xiaojian Wu. 2025. https://arxiv.org/abs/2509.25911 Mem- : Learning memory construction via reinforcement learning . ArXiv preprint, abs/2509.25911

Pith/arXiv arXiv 2025

[39] [39]

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. 2024. https://arxiv.org/abs/2410.10813 LongMemEval : Benchmarking chat assistants on long-term interactive memory . ArXiv preprint, abs/2410.10813

Pith/arXiv arXiv 2024

[40] [40]

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. 2026. https://arxiv.org/abs/2602.08234 SkillRL : Evolving agents via recursive skill-augmented reinforcement learning . ArXiv preprint, abs/2602.08234

Pith/arXiv arXiv 2026

[41] [41]

Jian Xie, Kai Zhang, Jiangjie Chen, Tinghui Zhu, Renze Lou, Yuandong Tian, Yanghua Xiao, and Yu Su. 2024. https://arxiv.org/abs/2402.01622 TravelPlanner : A benchmark for real-world planning with language agents . ArXiv preprint, abs/2402.01622

arXiv 2024

[42] [42]

Yiming Xiong, Shengran Hu, and Jeff Clune. 2026. https://arxiv.org/abs/2602.07755 Learning to continually learn via meta-learning agentic memory designs . ArXiv preprint, abs/2602.07755

arXiv 2026

[43] [43]

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. 2025. https://arxiv.org/abs/2502.12110 A-MEM : Agentic memory for LLM agents . ArXiv preprint, abs/2502.12110

Pith/arXiv arXiv 2025

[44] [44]

Xiucheng Xu, Bingbing Xu, Xueyun Tian, Zihe Huang, Rongxin Chen, Yunfan Li, and Huawei Shen. 2026. https://arxiv.org/abs/2601.14287 Chain-of-memory: Lightweight memory construction with dynamic evolution for LLM agents . ArXiv preprint, abs/2601.14287

Pith/arXiv arXiv 2026

[45] [45]

Pan, Hinrich Sch \"u tze, Volker Tresp, and Yunpu Ma

Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Jinhe Bi, Kristian Kersting, Jeff Z. Pan, Hinrich Sch \"u tze, Volker Tresp, and Yunpu Ma. 2025. https://arxiv.org/abs/2508.19828 Memory- R1 : Enhancing large language model agents to manage and utilize memories via reinforcement learning . ArXiv preprint, abs/2508.19828

Pith/arXiv arXiv 2025

[46] [46]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others. 2025. https://arxiv.org/abs/2505.09388 Qwen3 technical report . ArXiv preprint, abs/2505.09388

Pith/arXiv arXiv 2025

[47] [47]

Ke Yang, Zixi Chen, Xuan He, Jize Jiang, Michel Galley, Chenglong Wang, Jianfeng Gao, Jiawei Han, and ChengXiang Zhai. 2026. https://arxiv.org/abs/2603.03296 PlugMem : A task-agnostic plugin memory module for LLM agents . ArXiv preprint, abs/2603.03296

arXiv 2026

[48] [48]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. https://doi.org/10.18653/v1/D18-1259 H otpot QA : A dataset for diverse, explainable multi-hop question answering . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369--2380, Brussels...

work page doi:10.18653/v1/d18-1259 2018

[49] [49]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. https://arxiv.org/abs/2210.03629 ReAct : Synergizing reasoning and acting in language models . ArXiv preprint, abs/2210.03629

Pith/arXiv arXiv 2022

[50] [50]

Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, and Hao Zhou. 2025. https://arxiv.org/abs/2507.02259 MemAgent : Reshaping long-context LLM with multi-conv RL -based memory agent . ArXiv preprint, abs/2507.02259

Pith/arXiv arXiv 2025

[51] [51]

Yanwei Yue, Boci Peng, Xuanbo Fan, Jiaxin Guo, Qiankun Li, and Yan Zhang. 2026. https://arxiv.org/abs/2601.23014 Mem-T : Densifying rewards for long-horizon memory agents . ArXiv preprint, abs/2601.23014

arXiv 2026

[52] [52]

Haozhen Zhang, Haodong Yue, Tao Feng, Quanyu Long, Jianzhu Bao, Bowen Jin, Weizhi Zhang, Xiao Li, Jiaxuan You, Chengwei Qin, and Wenya Wang. 2026 a . https://arxiv.org/abs/2602.06025 Learning query-aware budget-tier routing for runtime agent memory . ArXiv preprint, abs/2602.06025

Pith/arXiv arXiv 2026

[53] [53]

Shengtao Zhang, Jiaqian Wang, Ruiwen Zhou, Junwei Liao, Yuchen Feng, Zhuo Li, Yujie Zheng, Weinan Zhang, Ying Wen, Zhiyu Li, Feiyu Xiong, Yutao Qi, Bo Tang, and Muning Wen. 2026 b . https://arxiv.org/abs/2601.03192 MemRL : Self-evolving agents via runtime reinforcement learning on episodic memory . ArXiv preprint, abs/2601.03192

Pith/arXiv arXiv 2026

[54] [54]

Yujie Zhao, Boqin Yuan, Junbo Huang, Haocheng Yuan, Zhongming Yu, Haozhou Xu, Lanxiang Hu, Abhilash Shankarampeta, Zimeng Huang, Wentao Ni, Yuandong Tian, and Jishen Zhao. 2026. https://arxiv.org/abs/2602.22769 AMA-Bench : Evaluating long-horizon memory for agentic applications . ArXiv preprint, abs/2602.22769

Pith/arXiv arXiv 2026

[55] [55]

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. 2023. https://arxiv.org/abs/2305.10250 MemoryBank : Enhancing large language models with long-term memory . ArXiv preprint, abs/2305.10250

Pith/arXiv arXiv 2023

[56] [56]

Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang. 2025. https://arxiv.org/abs/2506.15841 MEM1 : Learning to synergize memory and reasoning for efficient long-horizon agents . ArXiv preprint, abs/2506.15841

Pith/arXiv arXiv 2025