pith. sign in

arxiv: 2607.01523 · v1 · pith:LW3A5E4Gnew · submitted 2026-07-01 · 💻 cs.LG · cs.AI· cs.CL

Multi-Head Recurrent Memory Agents

Pith reviewed 2026-07-03 20:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords recurrent memorylong contextLLM agentsmemory retentionmulti-head architecturetraining-free optimization
0
0 comments X

The pith

Splitting recurrent memory into independent heads and updating only one per step raises retention from under 30 percent to 74 percent at 896K tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Recurrent memory agents for long-context LLMs lose performance mainly because a single memory block gets overwritten during every update. The paper decomposes the problem into capture versus retention and shows retention is the main failure point. By dividing memory into separate heads and using a select-then-update rule that shields all but one head, the design moves retention protection from model behavior to architecture. A simple least-recently-updated version of this scheme, called MHM-LRU, needs no extra tokens or training yet improves both retention and end-to-end accuracy on benchmarks from 100K to 1M tokens.

Core claim

Existing recurrent memory agents treat memory as one monolithic text block, so every consolidation step risks erasing earlier content and retention collapses with growing context length. Multi-Head Recurrent Memory partitions the fixed-size memory into independent heads and applies a stage-wise select-then-update rule: exactly one head is chosen for the current update while the others stay structurally untouched. The MHM-LRU instantiation enforces uniform head rotation with zero added cost, lifting measured retention on RULER-HQA at 896K tokens from below 30 percent to 73.96 percent and delivering corresponding gains in task accuracy across model families.

What carries the argument

Multi-Head Recurrent Memory (MHM) with stage-wise select-then-update strategy that structurally shields all heads except the single chosen one during each consolidation step.

If this is right

  • End-to-end accuracy on long-context tasks rises in direct proportion to the measured retention rate.
  • The improvement holds across different base models and task types without any retraining.
  • Retention becomes an architectural guarantee rather than an emergent model behavior.
  • Uniform head rotation adds no token overhead yet prevents the degradation seen in single-block memory.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same head-partition idea could be applied to other fixed-size memory structures such as key-value caches.
  • Optimal head count or selection policy might vary by task and could be tuned without changing the core shielding rule.
  • Uniform rotation may limit specialization of individual heads, which could be checked by tracking per-head contribution on mixed tasks.

Load-bearing premise

That keeping most heads untouched during each update will still let the system capture new information without coordination failures or missed updates across heads.

What would settle it

Measure end-to-end accuracy on a task engineered to require simultaneous updates to every memory slot; if accuracy falls below the monolithic baseline, the shielding premise is falsified.

Figures

Figures reproduced from arXiv: 2607.01523 by Jiatong Li, Samuel Yeh, Sharon Li.

Figure 1
Figure 1. Figure 1: Left: An overview of the multi-head recurrent memory (MHM). Right: An overview of the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) MCR and MRR of MemAgent [17] (Qwen2.5-14B-Instruct) on RULER-HQA across context lengths from 7K to 896K tokens. While MCR remains stable throughout, MRR degrades sharply with context length, falling below 30% at 896K tokens. (b) MRR and Acc correlation analysis of MemAgent and ReMem [33] on RULER-HQA across context lengths from 7K to 896K tokens. The ‘LR’ denotes linear regression. End-to￾end accuracy … view at source ↗
Figure 3
Figure 3. Figure 3: MRR results across different context lengths. The [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Effect of number of memory heads across context lengths. More Memory Heads Improve Retention. We evalu￾ate the impact of the number of heads H on the memory retention rate (MRR). We vary H from 1 to 16, under each context length. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy by task type (QA1 – QA10) and context length (128K – 1M) on BABILong. 128K 256K 512K 1M QA1 1-supporting fact QA2 2-supporting facts QA3 3-supporting facts QA4 2-arg relations QA5 3-arg relations QA6 yes-no questions QA7 counting QA8 lists-sets QA9 simple negation QA10 indefinite knowledge 61.5 64.1 41.0 28.2 9.6 5.1 2.6 0.0 17.3 10.3 5.1 5.1 50.0 22.2 30.6 13.9 86.5 71.8 74.4 48.7 48.1 43.6 28.2 … view at source ↗
Figure 7
Figure 7. Figure 7: A case study of memory trajectory on BABILong-512K. The task type is QA4: 2-arg [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: LLM prompt template used in MHM-LRU and MHM-Relevance. Italicized tokens in [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: LLM prompt template for memory update in MHM-Concur. Italicized tokens in brackets [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: LLM prompt template for memory selection in MHM-Relevance. Italicized tokens in [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Average memory head distance (dt) with respect to time steps. The higher the dt, the more semantically diverse the memory heads in the time step. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
read the original abstract

Recurrent memory agents extend LLMs to arbitrarily long contexts by iteratively consolidating input into a fixed-size memory window. Despite their scalability, these agents exhibit a well-documented reliability problem: end-to-end performance degrades systematically as context length grows. We diagnose this failure by decomposing performance into two factors--memory capture and memory retention--and quantitatively confirm that retention is the dominant bottleneck. Retention collapses because existing designs maintain memory as a monolithic text block, forcing every update to risk overwriting previously retained content. Motivated by this diagnosis, we propose Multi-Head Recurrent Memory (MHM), a general, training-free framework that partitions memory into independent heads governed by a stage-wise select-then-update strategy. At each step, exactly one head is selected for update while the remaining heads are structurally shielded from overwriting, shifting the burden of retention from model behavior to architectural design. As a lightweight instantiation, we introduce Least-Recently-Updated MHM (MHM-LRU), which guarantees uniform head utilization with zero additional token overhead. Extensive experiments on long-context benchmarks show that MHM-LRU substantially improves both retention and end-to-end accuracy across the 100K--1M token range, where baselines degrade sharply. On RULER-HQA at 896K tokens, MHM-LRU improves the memory retention rate from less than 30% to 73.96%. These gains generalize across model families, scales, and task types, positioning architectural optimization as a practical and cost-efficient path toward reliable long-context recurrent memory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper diagnoses retention as the dominant failure mode in recurrent memory agents for long contexts (due to monolithic memory updates risking overwrites) and proposes Multi-Head Recurrent Memory (MHM), a training-free framework that partitions memory into independent heads with a stage-wise select-then-update rule (exactly one head updated per step). As a concrete instantiation, MHM-LRU uses least-recently-updated selection to guarantee uniform utilization with no extra overhead. Experiments claim that MHM-LRU raises retention from <30% to 73.96% on RULER-HQA at 896K tokens and improves end-to-end accuracy across 100K-1M token regimes on multiple long-context benchmarks, generalizing across model families and scales.

Significance. If the central empirical claims hold, the work supplies a lightweight, parameter-free architectural fix that shifts retention from learned behavior to structural shielding, offering a cost-efficient route to reliable recurrent memory without retraining or added tokens. The diagnosis into capture vs. retention and the zero-overhead LRU instantiation are concrete strengths.

major comments (1)
  1. [Abstract / Experiments] Abstract and Experiments: the claim that gains 'generalize across ... task types' rests on benchmarks whose primary stress (RULER-HQA single-fact retrieval) does not probe the weakest assumption—that the distributed select-then-update rule preserves cross-head consolidation without introducing new failure modes such as head starvation on correlated facts or degraded multi-head reasoning. No additional coordination or multi-fact benchmarks are described that would falsify this risk.
minor comments (2)
  1. [Abstract] Abstract: quantitative claims (73.96% retention, <30% baseline) are reported without error bars, number of runs, or variance, reducing verifiability.
  2. [Abstract] Abstract: full methods details (exact head count, selection implementation, memory window size) are omitted, making the 'lightweight instantiation' claim difficult to reproduce from the given text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review. We address the major comment below and agree that additional validation would strengthen the generalization claims.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments: the claim that gains 'generalize across ... task types' rests on benchmarks whose primary stress (RULER-HQA single-fact retrieval) does not probe the weakest assumption—that the distributed select-then-update rule preserves cross-head consolidation without introducing new failure modes such as head starvation on correlated facts or degraded multi-head reasoning. No additional coordination or multi-fact benchmarks are described that would falsify this risk.

    Authors: We agree that RULER-HQA emphasizes single-fact retrieval and that the current set of benchmarks does not explicitly test scenarios involving correlated facts across heads or multi-fact coordination. The manuscript reports gains on multiple long-context benchmarks, but these do not include dedicated multi-fact or cross-head reasoning evaluations that would directly falsify the identified risks. In the revision we will (1) qualify the generalization statement in the abstract to reflect the task types actually evaluated and (2) add experiments on multi-fact retrieval and coordination benchmarks to probe the select-then-update rule under those conditions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; design motivated by diagnosis and validated empirically

full rationale

The paper diagnoses retention failure in monolithic recurrent memory via decomposition into capture/retention factors, then introduces MHM partitioning and stage-wise shielding as an architectural intervention. MHM-LRU is presented as a lightweight instantiation with LRU selection. All performance claims (e.g., retention rate lift on RULER-HQA) rest on external benchmark experiments rather than any equation or parameter fit that reduces the output to the input by construction. No self-citations are invoked as uniqueness theorems or load-bearing premises for the central result. The derivation chain is self-contained against the stated benchmarks and does not exhibit self-definitional, fitted-prediction, or ansatz-smuggling patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the new multi-head memory architecture and the empirical diagnosis of retention as the bottleneck; no free parameters, standard axioms, or invented entities with independent evidence are specified beyond the framework itself.

invented entities (1)
  • Multi-Head Recurrent Memory heads no independent evidence
    purpose: Partition memory to enable selective updates that shield retention
    New architectural component introduced to address the monolithic memory overwriting problem.

pith-pipeline@v0.9.1-grok · 5797 in / 1146 out tokens · 30230 ms · 2026-07-03T20:49:37.494410+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 10 canonical work pages · 5 internal anchors

  1. [1]

    Chain of agents: Large language models collaborating on long-context tasks.Advances in Neural Information Processing Systems, 37:132208–132237, 2024

    Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, and Sercan Ö Arık. Chain of agents: Large language models collaborating on long-context tasks.Advances in Neural Information Processing Systems, 37:132208–132237, 2024

  2. [2]

    Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression

    Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1658–1677, 2024

  3. [3]

    How long can context length of open-source llms truly promise? InNeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023

    Dacheng Li, Rulin Shao, Anze Xie, Ying Sheng, Lianmin Zheng, Joseph Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. How long can context length of open-source llms truly promise? InNeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023

  4. [4]

    Jiaqi Li, Mengmeng Wang, Zilong Zheng, and Muhan Zhang. Loogle: Can long-context language models understand long contexts? InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16304–16333, 2024

  5. [5]

    Infllm: Training-free long-context extrapolation for llms with an efficient context memory.Advances in neural information processing systems, 37:119638–119661, 2024

    Chaojun Xiao, Pengle Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, and Maosong Sun. Infllm: Training-free long-context extrapolation for llms with an efficient context memory.Advances in neural information processing systems, 37:119638–119661, 2024

  6. [6]

    Leave no document behind: Benchmarking long-context llms with extended multi-doc qa

    Minzheng Wang, Longze Chen, Fu Cheng, Shengyi Liao, Xinghua Zhang, Bingli Wu, Haiyang Yu, Nan Xu, Lei Zhang, Run Luo, et al. Leave no document behind: Benchmarking long-context llms with extended multi-doc qa. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5627–5646, 2024

  7. [7]

    Mmlongbench-doc: Benchmarking long-context document understanding with visualizations.Advances in Neural Information Processing Systems, 37:95963–96010, 2024

    Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, et al. Mmlongbench-doc: Benchmarking long-context document understanding with visualizations.Advances in Neural Information Processing Systems, 37:95963–96010, 2024

  8. [8]

    Long context vs

    Xinze Li, Yushi Bai, Bowen Jin, Fengbin Zhu, Liangming Pan, and Yixin Cao. Long context vs. rag: Strategies for processing long documents in llms. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 4110– 4113, 2025

  9. [9]

    Re- sum: Unlocking long-horizon search intelligence via context summarization.Arxiv:2509.13313, 2025

    Xixi Wu, Kuan Li, Yida Zhao, Liwen Zhang, Litu Ou, Huifeng Yin, Zhongwang Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Minhao Cheng, Shuai Wang, Hong Cheng, and Jingren Zhou. Re- sum: Unlocking long-horizon search intelligence via context summarization.Arxiv:2509.13313, 2025

  10. [10]

    arXiv preprint arXiv:2506.18096 , year=

    Yuxuan Huang, Yihang Chen, Haozheng Zhang, Kang Li, Huichi Zhou, Meng Fang, Linyi Yang, Xiaoguang Li, Lifeng Shang, Songcen Xu, et al. Deep research agents: A systematic examination and roadmap.Arxiv:2506.18096, 2025

  11. [11]

    From web search towards agentic deep research: Incentivizing search with reasoning agents.Arxiv:2506.18959, 2025

    Weizhi Zhang, Yangning Li, Yuanchen Bei, Junyu Luo, Guancheng Wan, Liangwei Yang, Chenxuan Xie, Yuyao Yang, Wei-Chieh Huang, Chunyu Miao, et al. From web search towards agentic deep research: Incentivizing search with reasoning agents.Arxiv:2506.18959, 2025

  12. [12]

    Iterresearch: Rethinking long-horizon agents via markovian state reconstruction

    Guoxin Chen, Zile Qiao, Xuanzhong Chen, Donglei Yu, Haotian Xu, Xin Zhao, Ruihua Song, Wenbiao Yin, Huifeng Yin, Liwen Zhang, Kuan Li, Minpeng Liao, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Iterresearch: Rethinking long-horizon agents via markovian state reconstruction. InThe Fourteenth International Conference on Learning Representations, 2026. 10

  13. [13]

    In prospect and retrospect: Reflective memory management for long-term personalized dialogue agents

    Zhen Tan, Jun Yan, I-Hung Hsu, Rujun Han, Zifeng Wang, Long Le, Yiwen Song, Yanfei Chen, Hamid Palangi, George Lee, et al. In prospect and retrospect: Reflective memory management for long-term personalized dialogue agents. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8416–8439, 2025

  14. [14]

    Recurrentgpt: Interactive generation of (arbitrarily) long text,

    Wangchunshu Zhou, Yuchen Eleanor Jiang, Peng Cui, Tiannan Wang, Zhenxin Xiao, Yifan Hou, Ryan Cotterell, and Mrinmaya Sachan. Recurrentgpt: Interactive generation of (arbitrarily) long text.Arxiv:2305.13304, 2023

  15. [15]

    MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

    Qianhao Yuan, Jie Lou, Zichao Li, Jiawei Chen, Yaojie Lu, Hongyu Lin, Le Sun, Debing Zhang, and Xianpei Han. Memsearcher: Training llms to reason, search and manage memory via end-to-end reinforcement learning.Arxiv:2511.02805, 2025

  16. [16]

    MEM1: Learning to synergize memory and reasoning for efficient long-horizon agents

    Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Bryan Kian Hsiang Low, and Paul Pu Liang. MEM1: Learning to synergize memory and reasoning for efficient long-horizon agents. InThe Fourteenth International Conference on Learning Representations, 2026

  17. [17]

    Memagent: Reshaping long-context LLM with multi-conv RL-based memory agent

    Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, and Hao Zhou. Memagent: Reshaping long-context LLM with multi-conv RL-based memory agent. InThe Fourteenth International Conference on Learning Representations, 2026

  18. [18]

    Huerta, and Hao Peng

    Yufeng Du, Minyang Tian, Srikanth Ronanki, Subendhu Rongali, Sravan Babu Bodapati, Aram Galstyan, Azton Wells, Roy Schwartz, Eliu A. Huerta, and Hao Peng. Context length alone hurts LLM performance despite perfect retrieval. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 23281–23298, 2025

  19. [19]

    The impact of positional encoding on length generalization in transformers

    Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan, Payel Das, and Siva Reddy. The impact of positional encoding on length generalization in transformers. InThirty-seventh Conference on Neural Information Processing Systems, 2023

  20. [20]

    Two stones hit one bird: Bilevel positional encoding for better length extrapolation

    Zhenyu He, Guhao Feng, Shengjie Luo, Kai Yang, Liwei Wang, Jingjing Xu, Zhi Zhang, Hongxia Yang, and Di He. Two stones hit one bird: Bilevel positional encoding for better length extrapolation. InForty-first International Conference on Machine Learning, 2024

  21. [21]

    An efficient recipe for long context extension via middle-focused positional encoding

    Tong Wu, Yanpeng Zhao, and Zilong Zheng. An efficient recipe for long context extension via middle-focused positional encoding. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  22. [22]

    Flexprefill: A context- aware sparse attention mechanism for efficient long-sequence inference

    Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, and Xun Zhou. Flexprefill: A context- aware sparse attention mechanism for efficient long-sequence inference. InThe Thirteenth International Conference on Learning Representations, 2025

  23. [23]

    LServe: Efficient long-sequence LLM serving with unified sparse attention

    Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, and Song Han. LServe: Efficient long-sequence LLM serving with unified sparse attention. InEighth Conference on Machine Learning and Systems, 2025

  24. [24]

    Efficient content-based sparse attention with routing transformers.Transactions of the Association for Computational Linguistics, 9:53–68, 2021

    Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse attention with routing transformers.Transactions of the Association for Computational Linguistics, 9:53–68, 2021

  25. [25]

    Qwen2.5-1M Technical Report

    An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, Junyang Lin, Kai Dang, Kexin Yang, Le Yu, Mei Li, Minmin Sun, Qin Zhu, Rui Men, Tao He, Weijia Xu, Wenbiao Yin, Wenyuan Yu, Xiafei Qiu, Xingzhang Ren, Xinlong Yang, Yong Li, Zhiying Xu, and Zipeng Zhang. Qwen2.5-1m technical re...

  26. [26]

    Introducing claude sonnet 4.6, 2024

    Anthropic. Introducing claude sonnet 4.6, 2024

  27. [27]

    Gemini 3.1 pro: A smarter model for your most complex tasks, 2026

    Google. Gemini 3.1 pro: A smarter model for your most complex tasks, 2026

  28. [28]

    Recurrent memory transformer.Advances in Neural Information Processing Systems, 35:11079–11091, 2022

    Aydar Bulatov, Yury Kuratov, and Mikhail Burtsev. Recurrent memory transformer.Advances in Neural Information Processing Systems, 35:11079–11091, 2022. 11

  29. [29]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. Arxiv:2312.00752, 2023

  30. [30]

    Rwkv: Reinventing rnns for the transformer era

    Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, et al. Rwkv: Reinventing rnns for the transformer era. InFindings of the association for computational linguistics: EMNLP 2023, pages 14048–14077, 2023

  31. [31]

    Titans: Learning to Memorize at Test Time

    Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time. Arxiv:2501.00663, 2024

  32. [32]

    When to memorize and when to stop: Gated recurrent memory for long-context reasoning.Arxiv:2602.10560, 2026

    Leheng Sheng, Yongtao Zhang, Wenchang Ma, Yaorui Shi, Ting Huang, Xiang Wang, An Zhang, Ke Shen, and Tat-Seng Chua. When to memorize and when to stop: Gated recurrent memory for long-context reasoning.Arxiv:2602.10560, 2026

  33. [33]

    Look back to reason forward: Revisitable memory for long-context LLM agents

    Yaorui Shi, Yuxin Chen, Siyuan Wang, Sihang Li, Hengxing Cai, Qi GU, Xiang Wang, and An Zhang. Look back to reason forward: Revisitable memory for long-context LLM agents. In The Fourteenth International Conference on Learning Representations, 2026

  34. [34]

    Working memory: Theories, models, and controversies.Annual review of psychology, 63(1):1–29, 2012

    Alan Baddeley. Working memory: Theories, models, and controversies.Annual review of psychology, 63(1):1–29, 2012

  35. [35]

    Forgetting as retrieval failure.Animal memory, pages 45–109, 1971

    Norman E Spear. Forgetting as retrieval failure.Animal memory, pages 45–109, 1971

  36. [36]

    A dissociation of encoding and retrieval processes in the human hippocampus

    Laura L Eldridge, Stephen A Engel, Michael M Zeineh, Susan Y Bookheimer, and Barbara J Knowlton. A dissociation of encoding and retrieval processes in the human hippocampus. Journal of Neuroscience, 25(13):3280–3286, 2005

  37. [37]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tian- hao Li, Tingyu...

  38. [38]

    RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling, 2024

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling, 2024

  39. [39]

    Cohen, Ruslan Salakhut- dinov, and Christopher D

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhut- dinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, 2018

  40. [40]

    On the existence of a spectrum of policies that subsumes the least recently used (lru) and least frequently used (lfu) policies

    Donghee Lee, Jongmoo Choi, Jong-Hun Kim, Sam H Noh, Sang Lyul Min, Yookun Cho, and Chong Sang Kim. On the existence of a spectrum of policies that subsumes the least recently used (lru) and least frequently used (lfu) policies. InProceedings of the 1999 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, pages 134–143, 1999

  41. [41]

    Least-recently-used caching with dependent requests.Theoretical computer science, 326(1-3):293–327, 2004

    Predrag R Jelenkovi ´c and Ana Radovanovi ´c. Least-recently-used caching with dependent requests.Theoretical computer science, 326(1-3):293–327, 2004

  42. [42]

    Outperforming lru with an adaptive replacement cache algorithm.Computer, 37(4):58–65, 2004

    Nimrod Megiddo and Dharmendra S Modha. Outperforming lru with an adaptive replacement cache algorithm.Computer, 37(4):58–65, 2004

  43. [43]

    There- fore, the answer is (insert answer here)

    Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Igorevich Sorokin, Artyom Sorokin, and Mikhail Burtsev. BABILong: Testing the limits of LLMs with long context reasoning-in-a-haystack. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. 12 Appendix Table of Contents A Multi-Head Recurre...