Multi-Head Recurrent Memory Agents

Jiatong Li; Samuel Yeh; Sharon Li

arxiv: 2607.01523 · v1 · pith:LW3A5E4Gnew · submitted 2026-07-01 · 💻 cs.LG · cs.AI· cs.CL

Multi-Head Recurrent Memory Agents

Jiatong Li , Samuel Yeh , Sharon Li This is my paper

Pith reviewed 2026-07-03 20:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords recurrent memorylong contextLLM agentsmemory retentionmulti-head architecturetraining-free optimization

0 comments

The pith

Splitting recurrent memory into independent heads and updating only one per step raises retention from under 30 percent to 74 percent at 896K tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Recurrent memory agents for long-context LLMs lose performance mainly because a single memory block gets overwritten during every update. The paper decomposes the problem into capture versus retention and shows retention is the main failure point. By dividing memory into separate heads and using a select-then-update rule that shields all but one head, the design moves retention protection from model behavior to architecture. A simple least-recently-updated version of this scheme, called MHM-LRU, needs no extra tokens or training yet improves both retention and end-to-end accuracy on benchmarks from 100K to 1M tokens.

Core claim

Existing recurrent memory agents treat memory as one monolithic text block, so every consolidation step risks erasing earlier content and retention collapses with growing context length. Multi-Head Recurrent Memory partitions the fixed-size memory into independent heads and applies a stage-wise select-then-update rule: exactly one head is chosen for the current update while the others stay structurally untouched. The MHM-LRU instantiation enforces uniform head rotation with zero added cost, lifting measured retention on RULER-HQA at 896K tokens from below 30 percent to 73.96 percent and delivering corresponding gains in task accuracy across model families.

What carries the argument

Multi-Head Recurrent Memory (MHM) with stage-wise select-then-update strategy that structurally shields all heads except the single chosen one during each consolidation step.

If this is right

End-to-end accuracy on long-context tasks rises in direct proportion to the measured retention rate.
The improvement holds across different base models and task types without any retraining.
Retention becomes an architectural guarantee rather than an emergent model behavior.
Uniform head rotation adds no token overhead yet prevents the degradation seen in single-block memory.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same head-partition idea could be applied to other fixed-size memory structures such as key-value caches.
Optimal head count or selection policy might vary by task and could be tuned without changing the core shielding rule.
Uniform rotation may limit specialization of individual heads, which could be checked by tracking per-head contribution on mixed tasks.

Load-bearing premise

That keeping most heads untouched during each update will still let the system capture new information without coordination failures or missed updates across heads.

What would settle it

Measure end-to-end accuracy on a task engineered to require simultaneous updates to every memory slot; if accuracy falls below the monolithic baseline, the shielding premise is falsified.

Figures

Figures reproduced from arXiv: 2607.01523 by Jiatong Li, Samuel Yeh, Sharon Li.

**Figure 2.** Figure 2: (a) MCR and MRR of MemAgent [17] (Qwen2.5-14B-Instruct) on RULER-HQA across context lengths from 7K to 896K tokens. While MCR remains stable throughout, MRR degrades sharply with context length, falling below 30% at 896K tokens. (b) MRR and Acc correlation analysis of MemAgent and ReMem [33] on RULER-HQA across context lengths from 7K to 896K tokens. The ‘LR’ denotes linear regression. End-toend accuracy … view at source ↗

**Figure 3.** Figure 3: MRR results across different context lengths. The [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of number of memory heads across context lengths. More Memory Heads Improve Retention. We evaluate the impact of the number of heads H on the memory retention rate (MRR). We vary H from 1 to 16, under each context length. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Accuracy by task type (QA1 – QA10) and context length (128K – 1M) on BABILong. 128K 256K 512K 1M QA1 1-supporting fact QA2 2-supporting facts QA3 3-supporting facts QA4 2-arg relations QA5 3-arg relations QA6 yes-no questions QA7 counting QA8 lists-sets QA9 simple negation QA10 indefinite knowledge 61.5 64.1 41.0 28.2 9.6 5.1 2.6 0.0 17.3 10.3 5.1 5.1 50.0 22.2 30.6 13.9 86.5 71.8 74.4 48.7 48.1 43.6 28.2 … view at source ↗

**Figure 7.** Figure 7: A case study of memory trajectory on BABILong-512K. The task type is QA4: 2-arg [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: LLM prompt template used in MHM-LRU and MHM-Relevance. Italicized tokens in [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: LLM prompt template for memory update in MHM-Concur. Italicized tokens in brackets [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: LLM prompt template for memory selection in MHM-Relevance. Italicized tokens in [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Average memory head distance (dt) with respect to time steps. The higher the dt, the more semantically diverse the memory heads in the time step. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

read the original abstract

Recurrent memory agents extend LLMs to arbitrarily long contexts by iteratively consolidating input into a fixed-size memory window. Despite their scalability, these agents exhibit a well-documented reliability problem: end-to-end performance degrades systematically as context length grows. We diagnose this failure by decomposing performance into two factors--memory capture and memory retention--and quantitatively confirm that retention is the dominant bottleneck. Retention collapses because existing designs maintain memory as a monolithic text block, forcing every update to risk overwriting previously retained content. Motivated by this diagnosis, we propose Multi-Head Recurrent Memory (MHM), a general, training-free framework that partitions memory into independent heads governed by a stage-wise select-then-update strategy. At each step, exactly one head is selected for update while the remaining heads are structurally shielded from overwriting, shifting the burden of retention from model behavior to architectural design. As a lightweight instantiation, we introduce Least-Recently-Updated MHM (MHM-LRU), which guarantees uniform head utilization with zero additional token overhead. Extensive experiments on long-context benchmarks show that MHM-LRU substantially improves both retention and end-to-end accuracy across the 100K--1M token range, where baselines degrade sharply. On RULER-HQA at 896K tokens, MHM-LRU improves the memory retention rate from less than 30% to 73.96%. These gains generalize across model families, scales, and task types, positioning architectural optimization as a practical and cost-efficient path toward reliable long-context recurrent memory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MHM splits memory into heads with one-at-a-time updates to fix retention collapse, and the reported jump on RULER-HQA is the main concrete result.

read the letter

The paper's main contribution is a structural change to recurrent memory agents: instead of treating the memory as one block that risks overwriting on every update, they split it into independent heads and update exactly one per step while shielding the rest. This is presented as a training-free fix motivated by their capture-versus-retention breakdown.

What stands out is the diagnosis itself. They measure the two factors separately and show retention as the dominant failure mode on long contexts. The MHM-LRU version then delivers a clear lift, taking retention from under 30% to about 74% at 896K tokens on RULER-HQA. That number is the strongest evidence they give, and it appears across a few model families.

The soft spot is whether the stage-wise rule actually preserves capture and cross-head reasoning. RULER-HQA stresses single-fact lookup more than coordinated use of multiple heads, so the stress-test concern about possible head starvation or distributed overwrite modes is still open. The abstract does not report error bars or full ablation details on how the selection mechanism behaves when facts are correlated, which leaves the generalization claim thinner than the headline number suggests.

This is for people already working on recurrent or memory-augmented long-context models. The idea is simple enough that a serious referee could check the implementation and run the right follow-up tests in one round. I would send it to review rather than desk-reject; the empirical signal is worth verifying even if the coordination issues turn out to need more work.

Referee Report

1 major / 2 minor

Summary. The paper diagnoses retention as the dominant failure mode in recurrent memory agents for long contexts (due to monolithic memory updates risking overwrites) and proposes Multi-Head Recurrent Memory (MHM), a training-free framework that partitions memory into independent heads with a stage-wise select-then-update rule (exactly one head updated per step). As a concrete instantiation, MHM-LRU uses least-recently-updated selection to guarantee uniform utilization with no extra overhead. Experiments claim that MHM-LRU raises retention from <30% to 73.96% on RULER-HQA at 896K tokens and improves end-to-end accuracy across 100K-1M token regimes on multiple long-context benchmarks, generalizing across model families and scales.

Significance. If the central empirical claims hold, the work supplies a lightweight, parameter-free architectural fix that shifts retention from learned behavior to structural shielding, offering a cost-efficient route to reliable recurrent memory without retraining or added tokens. The diagnosis into capture vs. retention and the zero-overhead LRU instantiation are concrete strengths.

major comments (1)

[Abstract / Experiments] Abstract and Experiments: the claim that gains 'generalize across ... task types' rests on benchmarks whose primary stress (RULER-HQA single-fact retrieval) does not probe the weakest assumption—that the distributed select-then-update rule preserves cross-head consolidation without introducing new failure modes such as head starvation on correlated facts or degraded multi-head reasoning. No additional coordination or multi-fact benchmarks are described that would falsify this risk.

minor comments (2)

[Abstract] Abstract: quantitative claims (73.96% retention, <30% baseline) are reported without error bars, number of runs, or variance, reducing verifiability.
[Abstract] Abstract: full methods details (exact head count, selection implementation, memory window size) are omitted, making the 'lightweight instantiation' claim difficult to reproduce from the given text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review. We address the major comment below and agree that additional validation would strengthen the generalization claims.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments: the claim that gains 'generalize across ... task types' rests on benchmarks whose primary stress (RULER-HQA single-fact retrieval) does not probe the weakest assumption—that the distributed select-then-update rule preserves cross-head consolidation without introducing new failure modes such as head starvation on correlated facts or degraded multi-head reasoning. No additional coordination or multi-fact benchmarks are described that would falsify this risk.

Authors: We agree that RULER-HQA emphasizes single-fact retrieval and that the current set of benchmarks does not explicitly test scenarios involving correlated facts across heads or multi-fact coordination. The manuscript reports gains on multiple long-context benchmarks, but these do not include dedicated multi-fact or cross-head reasoning evaluations that would directly falsify the identified risks. In the revision we will (1) qualify the generalization statement in the abstract to reflect the task types actually evaluated and (2) add experiments on multi-fact retrieval and coordination benchmarks to probe the select-then-update rule under those conditions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; design motivated by diagnosis and validated empirically

full rationale

The paper diagnoses retention failure in monolithic recurrent memory via decomposition into capture/retention factors, then introduces MHM partitioning and stage-wise shielding as an architectural intervention. MHM-LRU is presented as a lightweight instantiation with LRU selection. All performance claims (e.g., retention rate lift on RULER-HQA) rest on external benchmark experiments rather than any equation or parameter fit that reduces the output to the input by construction. No self-citations are invoked as uniqueness theorems or load-bearing premises for the central result. The derivation chain is self-contained against the stated benchmarks and does not exhibit self-definitional, fitted-prediction, or ansatz-smuggling patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the new multi-head memory architecture and the empirical diagnosis of retention as the bottleneck; no free parameters, standard axioms, or invented entities with independent evidence are specified beyond the framework itself.

invented entities (1)

Multi-Head Recurrent Memory heads no independent evidence
purpose: Partition memory to enable selective updates that shield retention
New architectural component introduced to address the monolithic memory overwriting problem.

pith-pipeline@v0.9.1-grok · 5797 in / 1146 out tokens · 30230 ms · 2026-07-03T20:49:37.494410+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 10 canonical work pages · 5 internal anchors

[1]

Chain of agents: Large language models collaborating on long-context tasks.Advances in Neural Information Processing Systems, 37:132208–132237, 2024

Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, and Sercan Ö Arık. Chain of agents: Large language models collaborating on long-context tasks.Advances in Neural Information Processing Systems, 37:132208–132237, 2024

2024
[2]

Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression

Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1658–1677, 2024

2024
[3]

How long can context length of open-source llms truly promise? InNeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023

Dacheng Li, Rulin Shao, Anze Xie, Ying Sheng, Lianmin Zheng, Joseph Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. How long can context length of open-source llms truly promise? InNeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023

2023
[4]

Jiaqi Li, Mengmeng Wang, Zilong Zheng, and Muhan Zhang. Loogle: Can long-context language models understand long contexts? InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16304–16333, 2024

2024
[5]

Infllm: Training-free long-context extrapolation for llms with an efficient context memory.Advances in neural information processing systems, 37:119638–119661, 2024

Chaojun Xiao, Pengle Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, and Maosong Sun. Infllm: Training-free long-context extrapolation for llms with an efficient context memory.Advances in neural information processing systems, 37:119638–119661, 2024

2024
[6]

Leave no document behind: Benchmarking long-context llms with extended multi-doc qa

Minzheng Wang, Longze Chen, Fu Cheng, Shengyi Liao, Xinghua Zhang, Bingli Wu, Haiyang Yu, Nan Xu, Lei Zhang, Run Luo, et al. Leave no document behind: Benchmarking long-context llms with extended multi-doc qa. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5627–5646, 2024

2024
[7]

Mmlongbench-doc: Benchmarking long-context document understanding with visualizations.Advances in Neural Information Processing Systems, 37:95963–96010, 2024

Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, et al. Mmlongbench-doc: Benchmarking long-context document understanding with visualizations.Advances in Neural Information Processing Systems, 37:95963–96010, 2024

2024
[8]

Long context vs

Xinze Li, Yushi Bai, Bowen Jin, Fengbin Zhu, Liangming Pan, and Yixin Cao. Long context vs. rag: Strategies for processing long documents in llms. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 4110– 4113, 2025

2025
[9]

Re- sum: Unlocking long-horizon search intelligence via context summarization.Arxiv:2509.13313, 2025

Xixi Wu, Kuan Li, Yida Zhao, Liwen Zhang, Litu Ou, Huifeng Yin, Zhongwang Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Minhao Cheng, Shuai Wang, Hong Cheng, and Jingren Zhou. Re- sum: Unlocking long-horizon search intelligence via context summarization.Arxiv:2509.13313, 2025

work page arXiv 2025
[10]

arXiv preprint arXiv:2506.18096 , year=

Yuxuan Huang, Yihang Chen, Haozheng Zhang, Kang Li, Huichi Zhou, Meng Fang, Linyi Yang, Xiaoguang Li, Lifeng Shang, Songcen Xu, et al. Deep research agents: A systematic examination and roadmap.Arxiv:2506.18096, 2025

work page arXiv 2025
[11]

From web search towards agentic deep research: Incentivizing search with reasoning agents.Arxiv:2506.18959, 2025

Weizhi Zhang, Yangning Li, Yuanchen Bei, Junyu Luo, Guancheng Wan, Liangwei Yang, Chenxuan Xie, Yuyao Yang, Wei-Chieh Huang, Chunyu Miao, et al. From web search towards agentic deep research: Incentivizing search with reasoning agents.Arxiv:2506.18959, 2025

work page arXiv 2025
[12]

Iterresearch: Rethinking long-horizon agents via markovian state reconstruction

Guoxin Chen, Zile Qiao, Xuanzhong Chen, Donglei Yu, Haotian Xu, Xin Zhao, Ruihua Song, Wenbiao Yin, Huifeng Yin, Liwen Zhang, Kuan Li, Minpeng Liao, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Iterresearch: Rethinking long-horizon agents via markovian state reconstruction. InThe Fourteenth International Conference on Learning Representations, 2026. 10

2026
[13]

In prospect and retrospect: Reflective memory management for long-term personalized dialogue agents

Zhen Tan, Jun Yan, I-Hung Hsu, Rujun Han, Zifeng Wang, Long Le, Yiwen Song, Yanfei Chen, Hamid Palangi, George Lee, et al. In prospect and retrospect: Reflective memory management for long-term personalized dialogue agents. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8416–8439, 2025

2025
[14]

Recurrentgpt: Interactive generation of (arbitrarily) long text,

Wangchunshu Zhou, Yuchen Eleanor Jiang, Peng Cui, Tiannan Wang, Zhenxin Xiao, Yifan Hou, Ryan Cotterell, and Mrinmaya Sachan. Recurrentgpt: Interactive generation of (arbitrarily) long text.Arxiv:2305.13304, 2023

work page arXiv 2023
[15]

MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

Qianhao Yuan, Jie Lou, Zichao Li, Jiawei Chen, Yaojie Lu, Hongyu Lin, Le Sun, Debing Zhang, and Xianpei Han. Memsearcher: Training llms to reason, search and manage memory via end-to-end reinforcement learning.Arxiv:2511.02805, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

MEM1: Learning to synergize memory and reasoning for efficient long-horizon agents

Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Bryan Kian Hsiang Low, and Paul Pu Liang. MEM1: Learning to synergize memory and reasoning for efficient long-horizon agents. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[17]

Memagent: Reshaping long-context LLM with multi-conv RL-based memory agent

Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, and Hao Zhou. Memagent: Reshaping long-context LLM with multi-conv RL-based memory agent. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[18]

Huerta, and Hao Peng

Yufeng Du, Minyang Tian, Srikanth Ronanki, Subendhu Rongali, Sravan Babu Bodapati, Aram Galstyan, Azton Wells, Roy Schwartz, Eliu A. Huerta, and Hao Peng. Context length alone hurts LLM performance despite perfect retrieval. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 23281–23298, 2025

2025
[19]

The impact of positional encoding on length generalization in transformers

Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan, Payel Das, and Siva Reddy. The impact of positional encoding on length generalization in transformers. InThirty-seventh Conference on Neural Information Processing Systems, 2023

2023
[20]

Two stones hit one bird: Bilevel positional encoding for better length extrapolation

Zhenyu He, Guhao Feng, Shengjie Luo, Kai Yang, Liwei Wang, Jingjing Xu, Zhi Zhang, Hongxia Yang, and Di He. Two stones hit one bird: Bilevel positional encoding for better length extrapolation. InForty-first International Conference on Machine Learning, 2024

2024
[21]

An efficient recipe for long context extension via middle-focused positional encoding

Tong Wu, Yanpeng Zhao, and Zilong Zheng. An efficient recipe for long context extension via middle-focused positional encoding. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024
[22]

Flexprefill: A context- aware sparse attention mechanism for efficient long-sequence inference

Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, and Xun Zhou. Flexprefill: A context- aware sparse attention mechanism for efficient long-sequence inference. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[23]

LServe: Efficient long-sequence LLM serving with unified sparse attention

Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, and Song Han. LServe: Efficient long-sequence LLM serving with unified sparse attention. InEighth Conference on Machine Learning and Systems, 2025

2025
[24]

Efficient content-based sparse attention with routing transformers.Transactions of the Association for Computational Linguistics, 9:53–68, 2021

Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse attention with routing transformers.Transactions of the Association for Computational Linguistics, 9:53–68, 2021

2021
[25]

Qwen2.5-1M Technical Report

An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, Junyang Lin, Kai Dang, Kexin Yang, Le Yu, Mei Li, Minmin Sun, Qin Zhu, Rui Men, Tao He, Weijia Xu, Wenbiao Yin, Wenyuan Yu, Xiafei Qiu, Xingzhang Ren, Xinlong Yang, Yong Li, Zhiying Xu, and Zipeng Zhang. Qwen2.5-1m technical re...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Introducing claude sonnet 4.6, 2024

Anthropic. Introducing claude sonnet 4.6, 2024

2024
[27]

Gemini 3.1 pro: A smarter model for your most complex tasks, 2026

Google. Gemini 3.1 pro: A smarter model for your most complex tasks, 2026

2026
[28]

Recurrent memory transformer.Advances in Neural Information Processing Systems, 35:11079–11091, 2022

Aydar Bulatov, Yury Kuratov, and Mikhail Burtsev. Recurrent memory transformer.Advances in Neural Information Processing Systems, 35:11079–11091, 2022. 11

2022
[29]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. Arxiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Rwkv: Reinventing rnns for the transformer era

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, et al. Rwkv: Reinventing rnns for the transformer era. InFindings of the association for computational linguistics: EMNLP 2023, pages 14048–14077, 2023

2023
[31]

Titans: Learning to Memorize at Test Time

Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time. Arxiv:2501.00663, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

When to memorize and when to stop: Gated recurrent memory for long-context reasoning.Arxiv:2602.10560, 2026

Leheng Sheng, Yongtao Zhang, Wenchang Ma, Yaorui Shi, Ting Huang, Xiang Wang, An Zhang, Ke Shen, and Tat-Seng Chua. When to memorize and when to stop: Gated recurrent memory for long-context reasoning.Arxiv:2602.10560, 2026

work page arXiv 2026
[33]

Look back to reason forward: Revisitable memory for long-context LLM agents

Yaorui Shi, Yuxin Chen, Siyuan Wang, Sihang Li, Hengxing Cai, Qi GU, Xiang Wang, and An Zhang. Look back to reason forward: Revisitable memory for long-context LLM agents. In The Fourteenth International Conference on Learning Representations, 2026

2026
[34]

Working memory: Theories, models, and controversies.Annual review of psychology, 63(1):1–29, 2012

Alan Baddeley. Working memory: Theories, models, and controversies.Annual review of psychology, 63(1):1–29, 2012

2012
[35]

Forgetting as retrieval failure.Animal memory, pages 45–109, 1971

Norman E Spear. Forgetting as retrieval failure.Animal memory, pages 45–109, 1971

1971
[36]

A dissociation of encoding and retrieval processes in the human hippocampus

Laura L Eldridge, Stephen A Engel, Michael M Zeineh, Susan Y Bookheimer, and Barbara J Knowlton. A dissociation of encoding and retrieval processes in the human hippocampus. Journal of Neuroscience, 25(13):3280–3286, 2005

2005
[37]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tian- hao Li, Tingyu...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling, 2024

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling, 2024

2024
[39]

Cohen, Ruslan Salakhut- dinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhut- dinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, 2018

2018
[40]

On the existence of a spectrum of policies that subsumes the least recently used (lru) and least frequently used (lfu) policies

Donghee Lee, Jongmoo Choi, Jong-Hun Kim, Sam H Noh, Sang Lyul Min, Yookun Cho, and Chong Sang Kim. On the existence of a spectrum of policies that subsumes the least recently used (lru) and least frequently used (lfu) policies. InProceedings of the 1999 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, pages 134–143, 1999

1999
[41]

Least-recently-used caching with dependent requests.Theoretical computer science, 326(1-3):293–327, 2004

Predrag R Jelenkovi ´c and Ana Radovanovi ´c. Least-recently-used caching with dependent requests.Theoretical computer science, 326(1-3):293–327, 2004

2004
[42]

Outperforming lru with an adaptive replacement cache algorithm.Computer, 37(4):58–65, 2004

Nimrod Megiddo and Dharmendra S Modha. Outperforming lru with an adaptive replacement cache algorithm.Computer, 37(4):58–65, 2004

2004
[43]

There- fore, the answer is (insert answer here)

Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Igorevich Sorokin, Artyom Sorokin, and Mikhail Burtsev. BABILong: Testing the limits of LLMs with long context reasoning-in-a-haystack. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. 12 Appendix Table of Contents A Multi-Head Recurre...

2024

[1] [1]

Chain of agents: Large language models collaborating on long-context tasks.Advances in Neural Information Processing Systems, 37:132208–132237, 2024

Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, and Sercan Ö Arık. Chain of agents: Large language models collaborating on long-context tasks.Advances in Neural Information Processing Systems, 37:132208–132237, 2024

2024

[2] [2]

Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression

Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1658–1677, 2024

2024

[3] [3]

How long can context length of open-source llms truly promise? InNeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023

Dacheng Li, Rulin Shao, Anze Xie, Ying Sheng, Lianmin Zheng, Joseph Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. How long can context length of open-source llms truly promise? InNeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023

2023

[4] [4]

Jiaqi Li, Mengmeng Wang, Zilong Zheng, and Muhan Zhang. Loogle: Can long-context language models understand long contexts? InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16304–16333, 2024

2024

[5] [5]

Infllm: Training-free long-context extrapolation for llms with an efficient context memory.Advances in neural information processing systems, 37:119638–119661, 2024

Chaojun Xiao, Pengle Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, and Maosong Sun. Infllm: Training-free long-context extrapolation for llms with an efficient context memory.Advances in neural information processing systems, 37:119638–119661, 2024

2024

[6] [6]

Leave no document behind: Benchmarking long-context llms with extended multi-doc qa

Minzheng Wang, Longze Chen, Fu Cheng, Shengyi Liao, Xinghua Zhang, Bingli Wu, Haiyang Yu, Nan Xu, Lei Zhang, Run Luo, et al. Leave no document behind: Benchmarking long-context llms with extended multi-doc qa. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5627–5646, 2024

2024

[7] [7]

Mmlongbench-doc: Benchmarking long-context document understanding with visualizations.Advances in Neural Information Processing Systems, 37:95963–96010, 2024

Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, et al. Mmlongbench-doc: Benchmarking long-context document understanding with visualizations.Advances in Neural Information Processing Systems, 37:95963–96010, 2024

2024

[8] [8]

Long context vs

Xinze Li, Yushi Bai, Bowen Jin, Fengbin Zhu, Liangming Pan, and Yixin Cao. Long context vs. rag: Strategies for processing long documents in llms. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 4110– 4113, 2025

2025

[9] [9]

Re- sum: Unlocking long-horizon search intelligence via context summarization.Arxiv:2509.13313, 2025

Xixi Wu, Kuan Li, Yida Zhao, Liwen Zhang, Litu Ou, Huifeng Yin, Zhongwang Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Minhao Cheng, Shuai Wang, Hong Cheng, and Jingren Zhou. Re- sum: Unlocking long-horizon search intelligence via context summarization.Arxiv:2509.13313, 2025

work page arXiv 2025

[10] [10]

arXiv preprint arXiv:2506.18096 , year=

Yuxuan Huang, Yihang Chen, Haozheng Zhang, Kang Li, Huichi Zhou, Meng Fang, Linyi Yang, Xiaoguang Li, Lifeng Shang, Songcen Xu, et al. Deep research agents: A systematic examination and roadmap.Arxiv:2506.18096, 2025

work page arXiv 2025

[11] [11]

From web search towards agentic deep research: Incentivizing search with reasoning agents.Arxiv:2506.18959, 2025

Weizhi Zhang, Yangning Li, Yuanchen Bei, Junyu Luo, Guancheng Wan, Liangwei Yang, Chenxuan Xie, Yuyao Yang, Wei-Chieh Huang, Chunyu Miao, et al. From web search towards agentic deep research: Incentivizing search with reasoning agents.Arxiv:2506.18959, 2025

work page arXiv 2025

[12] [12]

Iterresearch: Rethinking long-horizon agents via markovian state reconstruction

Guoxin Chen, Zile Qiao, Xuanzhong Chen, Donglei Yu, Haotian Xu, Xin Zhao, Ruihua Song, Wenbiao Yin, Huifeng Yin, Liwen Zhang, Kuan Li, Minpeng Liao, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Iterresearch: Rethinking long-horizon agents via markovian state reconstruction. InThe Fourteenth International Conference on Learning Representations, 2026. 10

2026

[13] [13]

In prospect and retrospect: Reflective memory management for long-term personalized dialogue agents

Zhen Tan, Jun Yan, I-Hung Hsu, Rujun Han, Zifeng Wang, Long Le, Yiwen Song, Yanfei Chen, Hamid Palangi, George Lee, et al. In prospect and retrospect: Reflective memory management for long-term personalized dialogue agents. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8416–8439, 2025

2025

[14] [14]

Recurrentgpt: Interactive generation of (arbitrarily) long text,

Wangchunshu Zhou, Yuchen Eleanor Jiang, Peng Cui, Tiannan Wang, Zhenxin Xiao, Yifan Hou, Ryan Cotterell, and Mrinmaya Sachan. Recurrentgpt: Interactive generation of (arbitrarily) long text.Arxiv:2305.13304, 2023

work page arXiv 2023

[15] [15]

MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

Qianhao Yuan, Jie Lou, Zichao Li, Jiawei Chen, Yaojie Lu, Hongyu Lin, Le Sun, Debing Zhang, and Xianpei Han. Memsearcher: Training llms to reason, search and manage memory via end-to-end reinforcement learning.Arxiv:2511.02805, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

MEM1: Learning to synergize memory and reasoning for efficient long-horizon agents

Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Bryan Kian Hsiang Low, and Paul Pu Liang. MEM1: Learning to synergize memory and reasoning for efficient long-horizon agents. InThe Fourteenth International Conference on Learning Representations, 2026

2026

[17] [17]

Memagent: Reshaping long-context LLM with multi-conv RL-based memory agent

Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, and Hao Zhou. Memagent: Reshaping long-context LLM with multi-conv RL-based memory agent. InThe Fourteenth International Conference on Learning Representations, 2026

2026

[18] [18]

Huerta, and Hao Peng

Yufeng Du, Minyang Tian, Srikanth Ronanki, Subendhu Rongali, Sravan Babu Bodapati, Aram Galstyan, Azton Wells, Roy Schwartz, Eliu A. Huerta, and Hao Peng. Context length alone hurts LLM performance despite perfect retrieval. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 23281–23298, 2025

2025

[19] [19]

The impact of positional encoding on length generalization in transformers

Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan, Payel Das, and Siva Reddy. The impact of positional encoding on length generalization in transformers. InThirty-seventh Conference on Neural Information Processing Systems, 2023

2023

[20] [20]

Two stones hit one bird: Bilevel positional encoding for better length extrapolation

Zhenyu He, Guhao Feng, Shengjie Luo, Kai Yang, Liwei Wang, Jingjing Xu, Zhi Zhang, Hongxia Yang, and Di He. Two stones hit one bird: Bilevel positional encoding for better length extrapolation. InForty-first International Conference on Machine Learning, 2024

2024

[21] [21]

An efficient recipe for long context extension via middle-focused positional encoding

Tong Wu, Yanpeng Zhao, and Zilong Zheng. An efficient recipe for long context extension via middle-focused positional encoding. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024

[22] [22]

Flexprefill: A context- aware sparse attention mechanism for efficient long-sequence inference

Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, and Xun Zhou. Flexprefill: A context- aware sparse attention mechanism for efficient long-sequence inference. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[23] [23]

LServe: Efficient long-sequence LLM serving with unified sparse attention

Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, and Song Han. LServe: Efficient long-sequence LLM serving with unified sparse attention. InEighth Conference on Machine Learning and Systems, 2025

2025

[24] [24]

Efficient content-based sparse attention with routing transformers.Transactions of the Association for Computational Linguistics, 9:53–68, 2021

Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse attention with routing transformers.Transactions of the Association for Computational Linguistics, 9:53–68, 2021

2021

[25] [25]

Qwen2.5-1M Technical Report

An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, Junyang Lin, Kai Dang, Kexin Yang, Le Yu, Mei Li, Minmin Sun, Qin Zhu, Rui Men, Tao He, Weijia Xu, Wenbiao Yin, Wenyuan Yu, Xiafei Qiu, Xingzhang Ren, Xinlong Yang, Yong Li, Zhiying Xu, and Zipeng Zhang. Qwen2.5-1m technical re...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Introducing claude sonnet 4.6, 2024

Anthropic. Introducing claude sonnet 4.6, 2024

2024

[27] [27]

Gemini 3.1 pro: A smarter model for your most complex tasks, 2026

Google. Gemini 3.1 pro: A smarter model for your most complex tasks, 2026

2026

[28] [28]

Recurrent memory transformer.Advances in Neural Information Processing Systems, 35:11079–11091, 2022

Aydar Bulatov, Yury Kuratov, and Mikhail Burtsev. Recurrent memory transformer.Advances in Neural Information Processing Systems, 35:11079–11091, 2022. 11

2022

[29] [29]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. Arxiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Rwkv: Reinventing rnns for the transformer era

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, et al. Rwkv: Reinventing rnns for the transformer era. InFindings of the association for computational linguistics: EMNLP 2023, pages 14048–14077, 2023

2023

[31] [31]

Titans: Learning to Memorize at Test Time

Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time. Arxiv:2501.00663, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

When to memorize and when to stop: Gated recurrent memory for long-context reasoning.Arxiv:2602.10560, 2026

Leheng Sheng, Yongtao Zhang, Wenchang Ma, Yaorui Shi, Ting Huang, Xiang Wang, An Zhang, Ke Shen, and Tat-Seng Chua. When to memorize and when to stop: Gated recurrent memory for long-context reasoning.Arxiv:2602.10560, 2026

work page arXiv 2026

[33] [33]

Look back to reason forward: Revisitable memory for long-context LLM agents

Yaorui Shi, Yuxin Chen, Siyuan Wang, Sihang Li, Hengxing Cai, Qi GU, Xiang Wang, and An Zhang. Look back to reason forward: Revisitable memory for long-context LLM agents. In The Fourteenth International Conference on Learning Representations, 2026

2026

[34] [34]

Working memory: Theories, models, and controversies.Annual review of psychology, 63(1):1–29, 2012

Alan Baddeley. Working memory: Theories, models, and controversies.Annual review of psychology, 63(1):1–29, 2012

2012

[35] [35]

Forgetting as retrieval failure.Animal memory, pages 45–109, 1971

Norman E Spear. Forgetting as retrieval failure.Animal memory, pages 45–109, 1971

1971

[36] [36]

A dissociation of encoding and retrieval processes in the human hippocampus

Laura L Eldridge, Stephen A Engel, Michael M Zeineh, Susan Y Bookheimer, and Barbara J Knowlton. A dissociation of encoding and retrieval processes in the human hippocampus. Journal of Neuroscience, 25(13):3280–3286, 2005

2005

[37] [37]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tian- hao Li, Tingyu...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling, 2024

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling, 2024

2024

[39] [39]

Cohen, Ruslan Salakhut- dinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhut- dinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, 2018

2018

[40] [40]

On the existence of a spectrum of policies that subsumes the least recently used (lru) and least frequently used (lfu) policies

Donghee Lee, Jongmoo Choi, Jong-Hun Kim, Sam H Noh, Sang Lyul Min, Yookun Cho, and Chong Sang Kim. On the existence of a spectrum of policies that subsumes the least recently used (lru) and least frequently used (lfu) policies. InProceedings of the 1999 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, pages 134–143, 1999

1999

[41] [41]

Least-recently-used caching with dependent requests.Theoretical computer science, 326(1-3):293–327, 2004

Predrag R Jelenkovi ´c and Ana Radovanovi ´c. Least-recently-used caching with dependent requests.Theoretical computer science, 326(1-3):293–327, 2004

2004

[42] [42]

Outperforming lru with an adaptive replacement cache algorithm.Computer, 37(4):58–65, 2004

Nimrod Megiddo and Dharmendra S Modha. Outperforming lru with an adaptive replacement cache algorithm.Computer, 37(4):58–65, 2004

2004

[43] [43]

There- fore, the answer is (insert answer here)

Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Igorevich Sorokin, Artyom Sorokin, and Mikhail Burtsev. BABILong: Testing the limits of LLMs with long context reasoning-in-a-haystack. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. 12 Appendix Table of Contents A Multi-Head Recurre...

2024