pith. machine review for the scientific record. sign in

arxiv: 2501.00663 · v1 · submitted 2024-12-31 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

Titans: Learning to Memorize at Test Time

Ali Behrouz, Peilin Zhong, Vahab Mirrokni

Pith reviewed 2026-05-14 22:03 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords Titansneural long-term memoryattention mechanismslong context modelingrecurrent modelslanguage modelingtime seriesneedle-in-haystack
0
0 comments X

The pith

Titans combine attention with a learnable neural long-term memory to handle contexts over two million tokens more effectively than Transformers or linear recurrent models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Titans, a family of architectures that add a neural long-term memory module to standard attention. This module learns to memorize and retrieve historical context during both training and inference, allowing the attention mechanism to focus on the current context while still accessing distant past information. The approach aims to combine the accurate dependency modeling of attention with the persistent storage of recurrent-style memory, but with parallelizable training. Experiments on language modeling, common-sense reasoning, genomics, and time series demonstrate superior performance over Transformers and modern linear recurrent models. Titans also maintain higher accuracy when scaling to context windows larger than 2 million tokens in retrieval tasks.

Core claim

Titans introduce a neural long-term memory module that learns to memorize historical context at test time. This module operates alongside attention, which serves as short-term memory for accurate current dependencies, while the neural memory provides persistent long-term storage. The architecture enables fast parallelizable training and fast inference, and three variants show how to incorporate the memory effectively. This results in models that outperform prior approaches on multiple tasks and scale to contexts exceeding 2M tokens with improved needle-in-haystack accuracy.

What carries the argument

The neural long-term memory module, which learns to store and retrieve relevant historical information to complement attention's focus on the current context.

If this is right

  • Titans outperform Transformers and linear recurrent models on language modeling, common-sense reasoning, genomics, and time series tasks.
  • The models scale effectively to context windows larger than 2 million tokens.
  • Titans achieve higher accuracy in needle-in-haystack tasks at large context sizes compared to baselines.
  • Training remains fast and parallelizable while inference stays fast due to the memory design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The memory module could allow models to handle even longer sequences without increasing the attention window size during training.
  • This approach might generalize to domains like video processing or scientific simulations that require retaining information over very long periods.
  • Future work could explore making the memory module's capacity adaptive based on the task.

Load-bearing premise

The neural memory module can be trained to reliably store and retrieve relevant information from history without catastrophic forgetting or introducing new errors that cancel out the benefits.

What would settle it

A test where Titans show no improvement or worse performance than baselines on long-context needle-in-haystack tasks at scales over 2 million tokens, or exhibit clear signs of memory failure like forgetting key facts.

read the original abstract

Over more than a decade there has been an extensive research effort on how to effectively utilize recurrent models and attention. While recurrent models aim to compress the data into a fixed-size memory (called hidden state), attention allows attending to the entire context window, capturing the direct dependencies of all tokens. This more accurate modeling of dependencies, however, comes with a quadratic cost, limiting the model to a fixed-length context. We present a new neural long-term memory module that learns to memorize historical context and helps attention to attend to the current context while utilizing long past information. We show that this neural memory has the advantage of fast parallelizable training while maintaining a fast inference. From a memory perspective, we argue that attention due to its limited context but accurate dependency modeling performs as a short-term memory, while neural memory due to its ability to memorize the data, acts as a long-term, more persistent, memory. Based on these two modules, we introduce a new family of architectures, called Titans, and present three variants to address how one can effectively incorporate memory into this architecture. Our experimental results on language modeling, common-sense reasoning, genomics, and time series tasks show that Titans are more effective than Transformers and recent modern linear recurrent models. They further can effectively scale to larger than 2M context window size with higher accuracy in needle-in-haystack tasks compared to baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims to introduce Titans, a family of architectures combining attention (as short-term memory) with a new neural long-term memory module that learns to memorize and retrieve historical context during test-time inference. Three variants are presented for integrating the memory; experiments on language modeling, commonsense reasoning, genomics, and time series show Titans outperforming Transformers and modern linear recurrent models, with effective scaling to contexts larger than 2M tokens and higher needle-in-haystack accuracy.

Significance. If the empirical results hold under rigorous controls, the work could meaningfully advance efficient long-context modeling by offering a persistent memory mechanism that avoids full quadratic attention while supporting fast inference and generalization beyond training lengths.

major comments (3)
  1. [Experiments] Experiments section: the performance claims lack error bars, ablation studies isolating the neural memory module's contribution, and explicit reporting of training context lengths, which are required to substantiate the >2M scaling result in needle-in-haystack tasks.
  2. [Architecture] Architecture section: the test-time update rule for the neural long-term memory module is specified at a high level without equations or analysis demonstrating stability or resistance to catastrophic forgetting under unsupervised next-token prediction.
  3. [Needle-in-haystack Evaluation] Needle-in-haystack results: the superior accuracy for contexts >2M is presented without detailing baseline implementations, memory state initialization, or controls confirming generalization beyond training lengths.
minor comments (2)
  1. [Abstract] Abstract: the claim of 'fast parallelizable training' would benefit from explicit complexity comparisons (e.g., O(N) vs. O(N^2)) to the cited linear recurrent baselines.
  2. [Introduction] Notation: the distinction between the neural memory hidden state and standard RNN states should be formalized with consistent symbols to improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity, rigor, and reproducibility.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the performance claims lack error bars, ablation studies isolating the neural memory module's contribution, and explicit reporting of training context lengths, which are required to substantiate the >2M scaling result in needle-in-haystack tasks.

    Authors: We agree that error bars, targeted ablations, and explicit training context lengths are necessary to strengthen the empirical claims. In the revised manuscript, we will add error bars computed over multiple random seeds for all reported metrics. We will include new ablation studies that isolate the contribution of the neural long-term memory module (e.g., Titans without the memory module vs. full Titans). We will also explicitly state the training context lengths used for each model and task to support the >2M scaling results. revision: yes

  2. Referee: [Architecture] Architecture section: the test-time update rule for the neural long-term memory module is specified at a high level without equations or analysis demonstrating stability or resistance to catastrophic forgetting under unsupervised next-token prediction.

    Authors: We will expand the Architecture section to include the full mathematical formulation of the test-time update rule, including the precise equations governing the memory state evolution. We will add a dedicated subsection providing stability analysis (e.g., bounds on state norms) and empirical evaluations of resistance to catastrophic forgetting, including controlled experiments under unsupervised next-token prediction on long sequences. revision: yes

  3. Referee: [Needle-in-haystack Evaluation] Needle-in-haystack results: the superior accuracy for contexts >2M is presented without detailing baseline implementations, memory state initialization, or controls confirming generalization beyond training lengths.

    Authors: We will revise the Needle-in-haystack Evaluation section to provide complete details on baseline implementations (including exact model variants and hyperparameters), memory state initialization procedures at test time, and explicit controls (e.g., training-length-matched vs. extended-context evaluations) that confirm generalization beyond the training context lengths. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical validation of architecture

full rationale

The paper introduces Titans as a family of architectures combining attention (short-term memory) with a new neural long-term memory module. Central claims concern empirical superiority on language modeling, reasoning, genomics, time-series tasks and scaling beyond 2M context in needle-in-haystack evaluations. No derivation chain, equations, or first-principles predictions appear that reduce to fitted parameters, self-definitions, or self-citation loops. Architecture choices are presented as design decisions tested experimentally rather than derived quantities that collapse to their inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The architecture introduces a new neural memory module whose update and retrieval rules are not derived from first principles but are learned; the paper relies on standard transformer training assumptions and the empirical claim that the memory can be trained in parallel without instability.

axioms (1)
  • domain assumption Standard transformer attention and recurrent hidden-state dynamics can be combined with an additional learned memory without introducing unmanageable training instability.
    Invoked when the authors state that the memory enables fast parallelizable training while maintaining fast inference.
invented entities (1)
  • Neural long-term memory module no independent evidence
    purpose: To store and retrieve historical context beyond the attention window in a learnable, persistent way.
    The module is presented as a new component that learns to memorize; no independent evidence (e.g., predicted behavior on held-out data outside the reported tasks) is given in the abstract.

pith-pipeline@v0.9.0 · 5548 in / 1446 out tokens · 31186 ms · 2026-05-14T22:03:30.442238+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. WriteSAE: Sparse Autoencoders for Recurrent State

    cs.LG 2026-05 unverdicted novelty 8.0

    WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.

  2. LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models

    cs.LG 2026-05 unverdicted novelty 7.0

    LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.

  3. Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences

    cs.LG 2026-04 unverdicted novelty 7.0

    Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.

  4. Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training

    cs.CV 2026-04 unverdicted novelty 7.0

    Mem3R achieves better long-sequence 3D reconstruction by decoupling tracking and mapping with a hybrid memory of TTT-updated MLP and explicit tokens, reducing model size and trajectory errors.

  5. OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention

    cs.LG 2026-05 unverdicted novelty 6.0

    OSDN adds online diagonal preconditioning to the Delta Rule, preserving chunkwise parallelism while proving super-geometric convergence and delivering 32-39% recall gains at 340M-1.3B scales.

  6. Cognifold: Always-On Proactive Memory via Cognitive Folding

    cs.AI 2026-05 unverdicted novelty 6.0

    Cognifold is a new proactive memory architecture that folds event streams into emergent cognitive structures by extending complementary learning systems theory with a prefrontal intent layer and graph topology self-or...

  7. $\delta$-mem: Efficient Online Memory for Large Language Models

    cs.AI 2026-05 unverdicted novelty 6.0

    δ-mem augments frozen LLMs with an 8x8 online memory state updated by delta-rule learning to generate low-rank attention corrections, delivering 1.10x average gains over the backbone and larger improvements on memory-...

  8. A Single-Layer Model Can Do Language Modeling

    cs.CL 2026-05 unverdicted novelty 6.0

    A 130M-parameter 1-layer GPN achieves FineWeb-Edu perplexity 18.06, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34).

  9. The Impossibility Triangle of Long-Context Modeling

    cs.CL 2026-05 unverdicted novelty 6.0

    No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.

  10. Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning

    cs.LG 2026-04 conditional novelty 6.0

    Memory tokens are required for non-trivial performance in adaptive Universal Transformers on Sudoku-Extreme, with 8-32 tokens yielding stable 57% exact-match accuracy while trading off against ponder depth.

  11. Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration

    cs.AI 2026-04 unverdicted novelty 6.0

    LLM agents trained with a task-success reward on self-generated knowledge can spontaneously explore and adapt to new environments without any rewards or instructions at inference, yielding 20% gains on web tasks and a...

  12. DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks

    cs.CV 2026-04 unverdicted novelty 6.0

    CLWM with DINOv3 targets, O(1) TTT memory, SAI latency masking, and EmbodiChain training achieves SOTA dual-arm simulation performance and zero-shot sim-to-real transfer that beats real-data finetuned baselines.

  13. Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction

    cs.CV 2026-04 unverdicted novelty 6.0

    Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.

  14. In-Place Test-Time Training

    cs.LG 2026-04 conditional novelty 6.0

    In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.

  15. MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    cs.CL 2025-06 unverdicted novelty 6.0

    MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training...

  16. Mela: Test-Time Memory Consolidation based on Transformation Hypothesis

    cs.CL 2026-05 unverdicted novelty 5.0

    Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.

  17. Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference

    cs.DC 2026-03 unverdicted novelty 5.0

    Unifying LLM memory optimizations into a Prepare-Compute-Retrieve-Apply pipeline and accelerating it on GPU-FPGA hardware yields up to 2.2x faster inference and 4.7x less energy than GPU-only baselines.

  18. Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    cs.AI 2025-03 unverdicted novelty 5.0

    The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

  19. Gated Delta Networks: Improving Mamba2 with Delta Rule

    cs.CL 2024-12 unverdicted novelty 5.0

    Gated DeltaNet integrates gating and delta rules into linear transformers, outperforming Mamba2 and DeltaNet on language modeling, reasoning, retrieval, and long-context tasks.

  20. From Storage to Experience: A Survey on the Evolution of LLM Agent Memory Mechanisms

    cs.AI 2026-05 unverdicted novelty 4.0

    LLM agent memory is organized into Storage (preserving trajectories), Reflection (refining them), and Experience (abstracting into reusable knowledge) stages driven by needs for long-range consistency, dynamic adaptat...

Reference graph

Works this paper leans on

139 extracted references · 139 canonical work pages · cited by 20 Pith papers · 20 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. “Gpt-4 technical report”. In:arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    Linear Transformers with Learnable Kernel Functions are Better In-Context Models

    Yaroslav Aksenov, Nikita Balagansky, Sofia Maria Lo Cicero Vaina, Boris Shaposhnikov, Alexey Gorbatovski, and Daniil Gavrilov. “Linear Transformers with Learnable Kernel Functions are Better In-Context Models”. In:arXiv preprint arXiv:2402.10644 (2024)

  3. [3]

    Learning to learn by gradient descent by gradient descent

    Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. “Learning to learn by gradient descent by gradient descent”. In:Advances in neural information processing systems 29 (2016)

  4. [4]

    Exploring length generalization in large language models

    Cem Anil, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra, Vinay Ramasesh, Ambrose Slone, Guy Gur-Ari, Ethan Dyer, and Behnam Neyshabur. “Exploring length generalization in large language models”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 38546–38556

  5. [5]

    Simple linear attention language models balance the recall-throughput tradeoff

    Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, James Zou, Atri Rudra, and Christo- pher Re. “Simple linear attention language models balance the recall-throughput tradeoff”. In:Forty-first International Conference on Machine Learning . 2024. url: https://openreview.net/forum?id=e93ffDcpH3

  6. [6]

    Neural Machine Translation by Jointly Learning to Align and Translate

    Dzmitry Bahdanau. “Neural machine translation by jointly learning to align and translate”. In: arXiv preprint arXiv:1409.0473 (2014)

  7. [7]

    The Pitfalls of Memo- rization: When Memorization Hurts Generalization

    Reza Bayat, Mohammad Pezeshki, Elvis Dohmatob, David Lopez-Paz, and Pascal Vincent. “The Pitfalls of Memo- rization: When Memorization Hurts Generalization”. In: arXiv preprint arXiv:2412.07684 (2024)

  8. [8]

    xLSTM: Extended Long Short-Term Memory

    Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. “xLSTM: Extended Long Short-Term Memory”. In: arXiv preprint arXiv:2405.04517 (2024)

  9. [9]

    Mambamixer: Efficient selective state space models with dual token and channel selection

    Ali Behrouz, Michele Santacatterina, and Ramin Zabih. “Mambamixer: Efficient selective state space models with dual token and channel selection”. In: arXiv preprint arXiv:2403.19888 (2024)

  10. [10]

    Memory Layers at Scale

    Vincent-Pierre Berges, Barlas Oğuz, Daniel Haziza, Wen-tau Yih, Luke Zettlemoyer, and Gargi Gosh. “Memory Layers at Scale”. In: arXiv preprint arXiv:2412.09764 (2024)

  11. [11]

    Birth of a transformer: A memory viewpoint

    Alberto Bietti, Vivien Cabannes, Diane Bouchacourt, Herve Jegou, and Leon Bottou. “Birth of a transformer: A memory viewpoint”. In: Advances in Neural Information Processing Systems 36 (2024)

  12. [12]

    Piqa: Reasoning about physical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. “Piqa: Reasoning about physical commonsense in natural language”. In: Proceedings of the AAAI conference on artificial intelligence . Vol. 34. 05. 2020, pp. 7432–7439

  13. [13]

    RecurrentGemma: Moving Past Transformers for Efficient Open Language Models

    Aleksandar Botev, Soham De, Samuel L Smith, Anushan Fernando, George-Cristian Muraru, Ruba Haroun, Leonard Berrada, Razvan Pascanu, Pier Giuseppe Sessa, Robert Dadashi, et al. “RecurrentGemma: Moving Past Transformers for Efficient Open Language Models”. In:arXiv preprint arXiv:2404.07839 (2024)

  14. [14]

    Local learning algorithms

    Léon Bottou and Vladimir Vapnik. “Local learning algorithms”. In: Neural computation 4.6 (1992), pp. 888–900

  15. [15]

    Scaling transformer to 1m tokens and beyond with rmt

    Aydar Bulatov, Yuri Kuratov, Yermek Kapushev, and Mikhail S Burtsev. “Scaling transformer to 1m tokens and beyond with rmt”. In: arXiv preprint arXiv:2304.11062 (2023)

  16. [16]

    Recurrent memory transformer

    Aydar Bulatov, Yury Kuratov, and Mikhail Burtsev. “Recurrent memory transformer”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 11079–11091

  17. [17]

    An Evolved Universal Transformer Memory

    Edoardo Cetin, Qi Sun, Tianyu Zhao, and Yujin Tang. “An Evolved Universal Transformer Memory”. In: arXiv preprint arXiv:2410.13166 (2024)

  18. [18]

    Scatterbrain: Unifying sparse and low-rank attention

    Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, and Christopher Ré. “Scatterbrain: Unifying sparse and low-rank attention”. In: Advances in Neural Information Processing Systems 34 (2021), pp. 17413–17426

  19. [19]

    Rethinking Attention with Performers

    Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J Colwell, and Adrian Weller. “Rethinking Attention with Performers”. In:International Conference on Learning Representations

  20. [20]

    url: https://openreview.net/forum?id=Ua6zuk0WRH

  21. [21]

    BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. “BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions”. In:Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers...

  22. [22]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. “Think you have solved question answering? try arc, the ai2 reasoning challenge”. In:arXiv preprint arXiv:1803.05457 (2018)

  23. [23]

    What are the differences between long-term, short-term, and working memory?

    Nelson Cowan. “What are the differences between long-term, short-term, and working memory?” In:Progress in brain research 169 (2008), pp. 323–338

  24. [24]

    Transformer- XL: Attentive Language Models beyond a Fixed-Length Context

    Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc Viet Le, and Ruslan Salakhutdinov. “Transformer- XL: Attentive Language Models beyond a Fixed-Length Context”. In: ACL (1). Ed. by Anna Korhonen, David R. Traum, and Lluís Màrquez. Association for Computational Linguistics, 2019, pp. 2978–2988.isbn: 978-1-950737-48-2

  25. [25]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning”. In:The Twelfth Inter- national Conference on Learning Representations . 2024. url: https://openreview.net/forum?id=mZn2Xyh9Ec

  26. [26]

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”. In:Advances in Neural Information Processing Systems . Ed. by S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh. Vol. 35. Curran Associates, Inc., 2022, pp. 16344–16359. url: https://proceedings.neu...

  27. [27]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    Tri Dao and Albert Gu. “Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality”. In: arXiv preprint arXiv:2405.21060 (2024)

  28. [28]

    Long-term Forecasting with TiDE: Time-series Dense Encoder

    Abhimanyu Das, Weihao Kong, Andrew Leach, Shaan K Mathur, Rajat Sen, and Rose Yu. “Long-term Forecasting with TiDE: Time-series Dense Encoder”. In: Transactions on Machine Learning Research (2023). issn: 2835-8856. url: https://openreview.net/forum?id=pCbC3aQB5W

  29. [29]

    Griffin: Mixing gated linear recurrences with local attention for efficient language models

    Soham De, Samuel L Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, et al. “Griffin: Mixing gated linear recurrences with local attention for efficient language models”. In:arXiv preprint arXiv:2402.19427 (2024)

  30. [30]

    Flex Attention: A Programming Model for Generating Optimized Attention Kernels

    Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. “Flex Attention: A Programming Model for Generating Optimized Attention Kernels”. In: arXiv preprint arXiv:2412.05496 (2024)

  31. [31]

    Hymba: A Hybrid-head Architecture for Small Language Models

    Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, et al. “Hymba: A Hybrid-head Architecture for Small Language Models”. In: arXiv preprint arXiv:2411.13676 (2024)

  32. [32]

    Sigmoid-weighted linear units for neural network function approxi- mation in reinforcement learning

    Stefan Elfwing, Eiji Uchibe, and Kenji Doya. “Sigmoid-weighted linear units for neural network function approxi- mation in reinforcement learning”. In: Neural networks 107 (2018), pp. 3–11

  33. [33]

    Learn to remember: Transformer with recurrent memory for document-level machine translation

    Yukun Feng, Feng Li, Ziang Song, Boyuan Zheng, and Philipp Koehn. “Learn to remember: Transformer with recurrent memory for document-level machine translation”. In: arXiv preprint arXiv:2205.01546 (2022)

  34. [34]

    Hungry Hungry Hippos: Towards Language Modeling with State Space Models

    Daniel Y Fu, Tri Dao, Khaled Kamal Saab, Armin W Thomas, Atri Rudra, and Christopher Re. “Hungry Hungry Hippos: Towards Language Modeling with State Space Models”. In:The Eleventh International Conference on Learning Representations. 2023. url: https://openreview.net/forum?id=COZDy0WYGg

  35. [35]

    Test-time training with masked autoencoders

    Yossi Gandelsman, Yu Sun, Xinlei Chen, and Alexei Efros. “Test-time training with masked autoencoders”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 29374–29385

  36. [36]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. “The pile: An 800gb dataset of diverse text for language modeling”. In:arXiv preprint arXiv:2101.00027 (2020)

  37. [37]

    Learning to forget: Continual prediction with LSTM

    Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. “Learning to forget: Continual prediction with LSTM”. In: Neural computation 12.10 (2000), pp. 2451–2471

  38. [38]

    Neural Turing Machines

    Alex Graves, Greg Wayne, and Ivo Danihelka. Neural Turing Machines . 2014. arXiv: 1410.5401 [cs.NE] . url: https://arxiv.org/abs/1410.5401

  39. [39]

    LSTM: A search space odyssey

    Klaus Greff, Rupesh K Srivastava, Jan Koutník, Bas R Steunebrink, and Jürgen Schmidhuber. “LSTM: A search space odyssey”. In: IEEE transactions on neural networks and learning systems 28.10 (2016), pp. 2222–2232

  40. [40]

    Genomic benchmarks: a collection of datasets for genomic sequence classification

    Katarína Grešová, Vlastimil Martinek, David Čechák, Petr Šimeček, and Panagiotis Alexiou. “Genomic benchmarks: a collection of datasets for genomic sequence classification”. In: BMC Genomic Data 24.1 (2023), p. 25

  41. [41]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. “Mamba: Linear-Time Sequence Modeling with Selective State Spaces”. In:First Conference on Language Modeling. 2024. url: https://openreview.net/forum?id=tEYskw1VY2

  42. [42]

    Efficiently Modeling Long Sequences with Structured State Spaces

    Albert Gu, Karan Goel, and Christopher Re. “Efficiently Modeling Long Sequences with Structured State Spaces”. In: International Conference on Learning Representations . 2022. url: https : / / openreview . net / forum ? id = uYLFoz1vlAC. 19

  43. [43]

    LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models

    Chi Han, Qifan Wang, Hao Peng, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. “LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models”. In: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Ed. by Kevin Duh, Helen...

  44. [44]

    Liquid Structural State-Space Models

    Ramin Hasani, Mathias Lechner, Tsun-Hsuan Wang, Makram Chahine, Alexander Amini, and Daniela Rus. “Liquid Structural State-Space Models”. In: The Eleventh International Conference on Learning Representations . 2023. url: https://openreview.net/forum?id=g4OTKRKfS7R

  45. [45]

    CAMELoT: Towards Large Language Models with Training-Free Consolidated Associative Memory

    Zexue He, Leonid Karlinsky, Donghyun Kim, Julian McAuley, Dmitry Krotov, and Rogerio Feris. “CAMELoT: Towards Large Language Models with Training-Free Consolidated Associative Memory”. In: arXiv preprint arXiv:2402.13449 (2024)

  46. [46]

    The organization of behavior: A neuropsychological theory

    Donald Olding Hebb. The organization of behavior: A neuropsychological theory . Psychology press, 2005

  47. [47]

    Neural networks and physical systems with emergent collective computational abilities

    John J Hopfield. “Neural networks and physical systems with emergent collective computational abilities.” In: Proceedings of the national academy of sciences 79.8 (1982), pp. 2554–2558

  48. [48]

    Multilayer feedforward networks are universal approxi- mators

    Kurt Hornik, Maxwell Stinchcombe, and Halbert White. “Multilayer feedforward networks are universal approxi- mators”. In: Neural networks 2.5 (1989), pp. 359–366

  49. [49]

    RULER: What’s the Real Context Size of Your Long-Context Language Models?

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. “RULER: What’s the Real Context Size of Your Long-Context Language Models?” In:First Conference on Language Modeling. 2024. url: https://openreview.net/forum?id=kIoBbc76Sy

  50. [50]

    Block-recurrent transformers

    DeLesley Hutchins, Imanol Schlag, Yuhuai Wu, Ethan Dyer, and Behnam Neyshabur. “Block-recurrent transformers”. In: Advances in neural information processing systems 35 (2022), pp. 33248–33261

  51. [51]

    The dual form of neural networks revisited: Connecting test time predictions to training patterns via spotlights of attention

    Kazuki Irie, Róbert Csordás, and Jürgen Schmidhuber. “The dual form of neural networks revisited: Connecting test time predictions to training patterns via spotlights of attention”. In: International Conference on Machine Learning . PMLR. 2022, pp. 9639–9659

  52. [52]

    Going beyond linear transformers with recurrent fast weight programmers

    Kazuki Irie, Imanol Schlag, Róbert Csordás, and Jürgen Schmidhuber. “Going beyond linear transformers with recurrent fast weight programmers”. In: Advances in neural information processing systems 34 (2021), pp. 7703–7717

  53. [53]

    Online domain adaptation of a pre-trained cascade of classifiers

    Vidit Jain and Erik Learned-Miller. “Online domain adaptation of a pre-trained cascade of classifiers”. In:CVPR

  54. [54]

    2011, pp

    IEEE. 2011, pp. 577–584

  55. [55]

    Mistral 7B

    Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. “Mistral 7B”. In:arXiv preprint arXiv:2310.06825 (2023)

  56. [56]

    PolySketchFormer: Fast Transformers via Sketching Polyno- mial Kernels

    Praneeth Kacham, Vahab Mirrokni, and Peilin Zhong. “PolySketchFormer: Fast Transformers via Sketching Polyno- mial Kernels”. In: Forty-first International Conference on Machine Learning . 2024. url: https://openreview.net/ forum?id=ghYrfdJfjK

  57. [57]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. “Scaling laws for neural language models”. In:arXiv preprint arXiv:2001.08361 (2020)

  58. [58]

    Transformers are rnns: Fast au- toregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. “Transformers are rnns: Fast au- toregressive transformers with linear attention”. In: International conference on machine learning . PMLR. 2020, pp. 5156–5165

  59. [59]

    Generalization through Memorization: Nearest Neighbor Language Models

    Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. “Generalization through Memorization: Nearest Neighbor Language Models”. In: International Conference on Learning Representations . 2020. url: https://openreview.net/forum?id=HklBjCEKvH

  60. [60]

    BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

    Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Igorevich Sorokin, Artyom Sorokin, and Mikhail Burtsev. “BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack”. In: The Thirty- eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track . 2024. url: https : //openreview.net/forum?id=u7m2CG84BQ

  61. [61]

    Self-attentive associative memory

    Hung Le, Truyen Tran, and Svetha Venkatesh. “Self-attentive associative memory”. In:International conference on machine learning. PMLR. 2020, pp. 5682–5691

  62. [62]

    Retrieval-augmented generation for knowledge-intensive nlp tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. “Retrieval-augmented generation for knowledge-intensive nlp tasks”. In: Advances in Neural Information Processing Systems 33 (2020), pp. 9459–9474. 20

  63. [63]

    Learning, Forgetting, Remembering: Insights From Tracking LLM Mem- orization During Training

    Danny Leybzon and Corentin Kervadec. “Learning, Forgetting, Remembering: Insights From Tracking LLM Mem- orization During Training”. In:Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. 2024, pp. 43–57

  64. [64]

    Revisiting long-term time series forecasting: An investigation on linear mapping

    Zhe Li, Shiyi Qi, Yiduo Li, and Zenglin Xu. “Revisiting long-term time series forecasting: An investigation on linear mapping”. In: arXiv preprint arXiv:2305.10721 (2023)

  65. [65]

    Longhorn: State space models are amortized online learners

    Bo Liu, Rui Wang, Lemeng Wu, Yihao Feng, Peter Stone, and Qiang Liu. “Longhorn: State space models are amortized online learners”. In: arXiv preprint arXiv:2407.14207 (2024)

  66. [66]

    Lost in the middle: How language models use long contexts

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. “Lost in the middle: How language models use long contexts”. In:Transactions of the Association for Computational Linguistics 12 (2024), pp. 157–173

  67. [67]

    iTransformer: Inverted Transformers Are Effective for Time Series Forecasting

    Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long. “itransformer: Inverted transformers are effective for time series forecasting”. In:arXiv preprint arXiv:2310.06625 (2023)

  68. [68]

    The structure of value: Accounting for taste

    George Mandler. “The structure of value: Accounting for taste”. In:Affect and cognition. Psychology Press, 2014, pp. 3–36

  69. [69]

    Long Range Language Modeling via Gated State Spaces

    Harsh Mehta, Ankit Gupta, Ashok Cutkosky, and Behnam Neyshabur. “Long Range Language Modeling via Gated State Spaces”. In: The Eleventh International Conference on Learning Representations . 2023. url: https : //openreview.net/forum?id=5MkYIYCbva

  70. [70]

    Pointer Sentinel Mixture Models

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. “Pointer Sentinel Mixture Models”. In: International Conference on Learning Representations . 2017. url: https://openreview.net/forum?id=Byj72udxe

  71. [71]

    The Illusion of State in State-Space Models

    William Merrill, Jackson Petty, and Ashish Sabharwal. “The Illusion of State in State-Space Models”. In:Forty-first International Conference on Machine Learning . 2024. url: https://openreview.net/forum?id=QZgo9JZpLq

  72. [72]

    Online model distillation for efficient video inference

    Ravi Teja Mullapudi, Steven Chen, Keyi Zhang, Deva Ramanan, and Kayvon Fatahalian. “Online model distillation for efficient video inference”. In: Proceedings of the IEEE/CVF International conference on computer vision . 2019, pp. 3573–3582

  73. [73]

    Leave no context behind: Efficient infinite context transformers with infini-attention

    Tsendsuren Munkhdalai, Manaal Faruqui, and Siddharth Gopal. “Leave no context behind: Efficient infinite context transformers with infini-attention”. In: arXiv preprint arXiv:2404.07143 (2024)

  74. [74]

    Metalearned neural memory

    Tsendsuren Munkhdalai, Alessandro Sordoni, Tong Wang, and Adam Trischler. “Metalearned neural memory”. In: Advances in Neural Information Processing Systems 32 (2019)

  75. [75]

    Neural semantic encoders

    Tsendsuren Munkhdalai and Hong Yu. “Neural semantic encoders”. In:Proceedings of the conference. Association for Computational Linguistics. Meeting. Vol. 1. NIH Public Access. 2017, p. 397

  76. [76]

    Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution

    Eric Nguyen, Michael Poli, Marjan Faizi, Armin Thomas, Michael Wornow, Callum Birch-Sykes, Stefano Massaroli, Aman Patel, Clayton Rabideau, Yoshua Bengio, et al. “Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution”. In: Advances in neural information processing systems 36 (2024)

  77. [77]

    On First-Order Meta-Learning Algorithms

    A Nichol. “On first-order meta-learning algorithms”. In: arXiv preprint arXiv:1803.02999 (2018)

  78. [78]

    A Time Series is Worth 64 Words: Long-term Forecasting with Transformers

    Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. “A time series is worth 64 words: Long-term forecasting with transformers”. In: arXiv preprint arXiv:2211.14730 (2022)

  79. [79]

    Learning and memory

    Hideyuki Okano, Tomoo Hirano, and Evan Balaban. “Learning and memory”. In:Proceedings of the National Academy of Sciences 97.23 (2000), pp. 12403–12404

  80. [80]

    Resurrecting recurrent neural networks for long sequences

    Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. “Resurrecting recurrent neural networks for long sequences”. In:International Conference on Machine Learning . PMLR. 2023, pp. 26670–26698

Showing first 80 references.