arxiv: 2501.00663 · v1 · submitted 2024-12-31 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

Titans: Learning to Memorize at Test Time

Ali Behrouz, Peilin Zhong, Vahab Mirrokni

Pith reviewed 2026-05-14 22:03 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords Titansneural long-term memoryattention mechanismslong context modelingrecurrent modelslanguage modelingtime seriesneedle-in-haystack

0 comments

The pith

Titans combine attention with a learnable neural long-term memory to handle contexts over two million tokens more effectively than Transformers or linear recurrent models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Titans, a family of architectures that add a neural long-term memory module to standard attention. This module learns to memorize and retrieve historical context during both training and inference, allowing the attention mechanism to focus on the current context while still accessing distant past information. The approach aims to combine the accurate dependency modeling of attention with the persistent storage of recurrent-style memory, but with parallelizable training. Experiments on language modeling, common-sense reasoning, genomics, and time series demonstrate superior performance over Transformers and modern linear recurrent models. Titans also maintain higher accuracy when scaling to context windows larger than 2 million tokens in retrieval tasks.

Core claim

Titans introduce a neural long-term memory module that learns to memorize historical context at test time. This module operates alongside attention, which serves as short-term memory for accurate current dependencies, while the neural memory provides persistent long-term storage. The architecture enables fast parallelizable training and fast inference, and three variants show how to incorporate the memory effectively. This results in models that outperform prior approaches on multiple tasks and scale to contexts exceeding 2M tokens with improved needle-in-haystack accuracy.

What carries the argument

The neural long-term memory module, which learns to store and retrieve relevant historical information to complement attention's focus on the current context.

If this is right

Titans outperform Transformers and linear recurrent models on language modeling, common-sense reasoning, genomics, and time series tasks.
The models scale effectively to context windows larger than 2 million tokens.
Titans achieve higher accuracy in needle-in-haystack tasks at large context sizes compared to baselines.
Training remains fast and parallelizable while inference stays fast due to the memory design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The memory module could allow models to handle even longer sequences without increasing the attention window size during training.
This approach might generalize to domains like video processing or scientific simulations that require retaining information over very long periods.
Future work could explore making the memory module's capacity adaptive based on the task.

Load-bearing premise

The neural memory module can be trained to reliably store and retrieve relevant information from history without catastrophic forgetting or introducing new errors that cancel out the benefits.

What would settle it

A test where Titans show no improvement or worse performance than baselines on long-context needle-in-haystack tasks at scales over 2 million tokens, or exhibit clear signs of memory failure like forgetting key facts.

read the original abstract

Over more than a decade there has been an extensive research effort on how to effectively utilize recurrent models and attention. While recurrent models aim to compress the data into a fixed-size memory (called hidden state), attention allows attending to the entire context window, capturing the direct dependencies of all tokens. This more accurate modeling of dependencies, however, comes with a quadratic cost, limiting the model to a fixed-length context. We present a new neural long-term memory module that learns to memorize historical context and helps attention to attend to the current context while utilizing long past information. We show that this neural memory has the advantage of fast parallelizable training while maintaining a fast inference. From a memory perspective, we argue that attention due to its limited context but accurate dependency modeling performs as a short-term memory, while neural memory due to its ability to memorize the data, acts as a long-term, more persistent, memory. Based on these two modules, we introduce a new family of architectures, called Titans, and present three variants to address how one can effectively incorporate memory into this architecture. Our experimental results on language modeling, common-sense reasoning, genomics, and time series tasks show that Titans are more effective than Transformers and recent modern linear recurrent models. They further can effectively scale to larger than 2M context window size with higher accuracy in needle-in-haystack tasks compared to baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Titans adds a learnable neural long-term memory module alongside attention, with reported gains on multiple tasks and scaling past 2M tokens, though stability of the unsupervised memory updates remains an open question.

read the letter

Colleague, the main thing to know is that this paper introduces a neural memory module meant to act as persistent long-term storage while attention handles short-term dependencies, with three integration variants and claims of better performance than transformers or linear recurrent models on language modeling, reasoning, genomics, and time series plus stronger needle-in-haystack results beyond 2M tokens. The parallel training and fast inference are practical upsides if they deliver as described. The framing of the two modules as complementary memory types is a clean way to motivate the design, and the breadth of tasks gives the empirical side some reach. The soft spots center on whether the memory module actually retains and retrieves useful history reliably under test-time updates. Standard next-token training gives only indirect pressure, so collapse or forgetting on long sequences is a real possibility that the high-level results do not yet rule out; without detailed ablations on memory capacity, update mechanics, and retention over extended contexts, the scaling claim rests on comparisons that could shift with tighter controls. This is aimed at people working on long-context architectures who want concrete alternatives to pure attention scaling. A reader looking for new module ideas will find something worth examining even if the numbers need verification. I would send it for peer review to get the experimental details and stability checks properly examined.

Referee Report

3 major / 2 minor

Summary. The paper claims to introduce Titans, a family of architectures combining attention (as short-term memory) with a new neural long-term memory module that learns to memorize and retrieve historical context during test-time inference. Three variants are presented for integrating the memory; experiments on language modeling, commonsense reasoning, genomics, and time series show Titans outperforming Transformers and modern linear recurrent models, with effective scaling to contexts larger than 2M tokens and higher needle-in-haystack accuracy.

Significance. If the empirical results hold under rigorous controls, the work could meaningfully advance efficient long-context modeling by offering a persistent memory mechanism that avoids full quadratic attention while supporting fast inference and generalization beyond training lengths.

major comments (3)

[Experiments] Experiments section: the performance claims lack error bars, ablation studies isolating the neural memory module's contribution, and explicit reporting of training context lengths, which are required to substantiate the >2M scaling result in needle-in-haystack tasks.
[Architecture] Architecture section: the test-time update rule for the neural long-term memory module is specified at a high level without equations or analysis demonstrating stability or resistance to catastrophic forgetting under unsupervised next-token prediction.
[Needle-in-haystack Evaluation] Needle-in-haystack results: the superior accuracy for contexts >2M is presented without detailing baseline implementations, memory state initialization, or controls confirming generalization beyond training lengths.

minor comments (2)

[Abstract] Abstract: the claim of 'fast parallelizable training' would benefit from explicit complexity comparisons (e.g., O(N) vs. O(N^2)) to the cited linear recurrent baselines.
[Introduction] Notation: the distinction between the neural memory hidden state and standard RNN states should be formalized with consistent symbols to improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity, rigor, and reproducibility.

read point-by-point responses

Referee: [Experiments] Experiments section: the performance claims lack error bars, ablation studies isolating the neural memory module's contribution, and explicit reporting of training context lengths, which are required to substantiate the >2M scaling result in needle-in-haystack tasks.

Authors: We agree that error bars, targeted ablations, and explicit training context lengths are necessary to strengthen the empirical claims. In the revised manuscript, we will add error bars computed over multiple random seeds for all reported metrics. We will include new ablation studies that isolate the contribution of the neural long-term memory module (e.g., Titans without the memory module vs. full Titans). We will also explicitly state the training context lengths used for each model and task to support the >2M scaling results. revision: yes
Referee: [Architecture] Architecture section: the test-time update rule for the neural long-term memory module is specified at a high level without equations or analysis demonstrating stability or resistance to catastrophic forgetting under unsupervised next-token prediction.

Authors: We will expand the Architecture section to include the full mathematical formulation of the test-time update rule, including the precise equations governing the memory state evolution. We will add a dedicated subsection providing stability analysis (e.g., bounds on state norms) and empirical evaluations of resistance to catastrophic forgetting, including controlled experiments under unsupervised next-token prediction on long sequences. revision: yes
Referee: [Needle-in-haystack Evaluation] Needle-in-haystack results: the superior accuracy for contexts >2M is presented without detailing baseline implementations, memory state initialization, or controls confirming generalization beyond training lengths.

Authors: We will revise the Needle-in-haystack Evaluation section to provide complete details on baseline implementations (including exact model variants and hyperparameters), memory state initialization procedures at test time, and explicit controls (e.g., training-length-matched vs. extended-context evaluations) that confirm generalization beyond the training context lengths. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical validation of architecture

full rationale

The paper introduces Titans as a family of architectures combining attention (short-term memory) with a new neural long-term memory module. Central claims concern empirical superiority on language modeling, reasoning, genomics, time-series tasks and scaling beyond 2M context in needle-in-haystack evaluations. No derivation chain, equations, or first-principles predictions appear that reduce to fitted parameters, self-definitions, or self-citation loops. Architecture choices are presented as design decisions tested experimentally rather than derived quantities that collapse to their inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The architecture introduces a new neural memory module whose update and retrieval rules are not derived from first principles but are learned; the paper relies on standard transformer training assumptions and the empirical claim that the memory can be trained in parallel without instability.

axioms (1)

domain assumption Standard transformer attention and recurrent hidden-state dynamics can be combined with an additional learned memory without introducing unmanageable training instability.
Invoked when the authors state that the memory enables fast parallelizable training while maintaining fast inference.

invented entities (1)

Neural long-term memory module no independent evidence
purpose: To store and retrieve historical context beyond the attention window in a learnable, persistent way.
The module is presented as a new component that learns to memorize; no independent evidence (e.g., predicted behavior on held-out data outside the reported tasks) is given in the abstract.

pith-pipeline@v0.9.0 · 5548 in / 1446 out tokens · 31186 ms · 2026-05-14T22:03:30.442238+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

WriteSAE: Sparse Autoencoders for Recurrent State
cs.LG 2026-05 unverdicted novelty 8.0

WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.
LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models
cs.LG 2026-05 unverdicted novelty 7.0

LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.
Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences
cs.LG 2026-04 unverdicted novelty 7.0

Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training
cs.CV 2026-04 unverdicted novelty 7.0

Mem3R achieves better long-sequence 3D reconstruction by decoupling tracking and mapping with a hybrid memory of TTT-updated MLP and explicit tokens, reducing model size and trajectory errors.
OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention
cs.LG 2026-05 unverdicted novelty 6.0

OSDN adds online diagonal preconditioning to the Delta Rule, preserving chunkwise parallelism while proving super-geometric convergence and delivering 32-39% recall gains at 340M-1.3B scales.
Cognifold: Always-On Proactive Memory via Cognitive Folding
cs.AI 2026-05 unverdicted novelty 6.0

Cognifold is a new proactive memory architecture that folds event streams into emergent cognitive structures by extending complementary learning systems theory with a prefrontal intent layer and graph topology self-or...
$\delta$-mem: Efficient Online Memory for Large Language Models
cs.AI 2026-05 unverdicted novelty 6.0

δ-mem augments frozen LLMs with an 8x8 online memory state updated by delta-rule learning to generate low-rank attention corrections, delivering 1.10x average gains over the backbone and larger improvements on memory-...
A Single-Layer Model Can Do Language Modeling
cs.CL 2026-05 unverdicted novelty 6.0

A 130M-parameter 1-layer GPN achieves FineWeb-Edu perplexity 18.06, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34).
The Impossibility Triangle of Long-Context Modeling
cs.CL 2026-05 unverdicted novelty 6.0

No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.
Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning
cs.LG 2026-04 conditional novelty 6.0

Memory tokens are required for non-trivial performance in adaptive Universal Transformers on Sudoku-Extreme, with 8-32 tokens yielding stable 57% exact-match accuracy while trading off against ponder depth.
Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration
cs.AI 2026-04 unverdicted novelty 6.0

LLM agents trained with a task-success reward on self-generated knowledge can spontaneously explore and adapt to new environments without any rewards or instructions at inference, yielding 20% gains on web tasks and a...
DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks
cs.CV 2026-04 unverdicted novelty 6.0

CLWM with DINOv3 targets, O(1) TTT memory, SAI latency masking, and EmbodiChain training achieves SOTA dual-arm simulation performance and zero-shot sim-to-real transfer that beats real-data finetuned baselines.
Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction
cs.CV 2026-04 unverdicted novelty 6.0

Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.
In-Place Test-Time Training
cs.LG 2026-04 conditional novelty 6.0

In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
cs.CL 2025-06 unverdicted novelty 6.0

MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training...
Mela: Test-Time Memory Consolidation based on Transformation Hypothesis
cs.CL 2026-05 unverdicted novelty 5.0

Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.
Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference
cs.DC 2026-03 unverdicted novelty 5.0

Unifying LLM memory optimizations into a Prepare-Compute-Retrieve-Apply pipeline and accelerating it on GPU-FPGA hardware yields up to 2.2x faster inference and 4.7x less energy than GPU-only baselines.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
cs.AI 2025-03 unverdicted novelty 5.0

The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
Gated Delta Networks: Improving Mamba2 with Delta Rule
cs.CL 2024-12 unverdicted novelty 5.0

Gated DeltaNet integrates gating and delta rules into linear transformers, outperforming Mamba2 and DeltaNet on language modeling, reasoning, retrieval, and long-context tasks.
From Storage to Experience: A Survey on the Evolution of LLM Agent Memory Mechanisms
cs.AI 2026-05 unverdicted novelty 4.0

LLM agent memory is organized into Storage (preserving trajectories), Reflection (refining them), and Experience (abstracting into reusable knowledge) stages driven by needs for long-range consistency, dynamic adaptat...

Reference graph

Works this paper leans on

139 extracted references · 139 canonical work pages · cited by 20 Pith papers · 20 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. “Gpt-4 technical report”. In:arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Linear Transformers with Learnable Kernel Functions are Better In-Context Models

Yaroslav Aksenov, Nikita Balagansky, Sofia Maria Lo Cicero Vaina, Boris Shaposhnikov, Alexey Gorbatovski, and Daniil Gavrilov. “Linear Transformers with Learnable Kernel Functions are Better In-Context Models”. In:arXiv preprint arXiv:2402.10644 (2024)

work page arXiv 2024
[3]

Learning to learn by gradient descent by gradient descent

Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. “Learning to learn by gradient descent by gradient descent”. In:Advances in neural information processing systems 29 (2016)

work page 2016
[4]

Exploring length generalization in large language models

Cem Anil, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra, Vinay Ramasesh, Ambrose Slone, Guy Gur-Ari, Ethan Dyer, and Behnam Neyshabur. “Exploring length generalization in large language models”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 38546–38556

work page 2022
[5]

Simple linear attention language models balance the recall-throughput tradeoff

Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, James Zou, Atri Rudra, and Christo- pher Re. “Simple linear attention language models balance the recall-throughput tradeoff”. In:Forty-first International Conference on Machine Learning . 2024. url: https://openreview.net/forum?id=e93ffDcpH3

work page 2024
[6]

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau. “Neural machine translation by jointly learning to align and translate”. In: arXiv preprint arXiv:1409.0473 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[7]

The Pitfalls of Memo- rization: When Memorization Hurts Generalization

Reza Bayat, Mohammad Pezeshki, Elvis Dohmatob, David Lopez-Paz, and Pascal Vincent. “The Pitfalls of Memo- rization: When Memorization Hurts Generalization”. In: arXiv preprint arXiv:2412.07684 (2024)

work page arXiv 2024
[8]

xLSTM: Extended Long Short-Term Memory

Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. “xLSTM: Extended Long Short-Term Memory”. In: arXiv preprint arXiv:2405.04517 (2024)

work page arXiv 2024
[9]

Mambamixer: Efficient selective state space models with dual token and channel selection

Ali Behrouz, Michele Santacatterina, and Ramin Zabih. “Mambamixer: Efficient selective state space models with dual token and channel selection”. In: arXiv preprint arXiv:2403.19888 (2024)

work page arXiv 2024
[10]

Memory Layers at Scale

Vincent-Pierre Berges, Barlas Oğuz, Daniel Haziza, Wen-tau Yih, Luke Zettlemoyer, and Gargi Gosh. “Memory Layers at Scale”. In: arXiv preprint arXiv:2412.09764 (2024)

work page arXiv 2024
[11]

Birth of a transformer: A memory viewpoint

Alberto Bietti, Vivien Cabannes, Diane Bouchacourt, Herve Jegou, and Leon Bottou. “Birth of a transformer: A memory viewpoint”. In: Advances in Neural Information Processing Systems 36 (2024)

work page 2024
[12]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. “Piqa: Reasoning about physical commonsense in natural language”. In: Proceedings of the AAAI conference on artificial intelligence . Vol. 34. 05. 2020, pp. 7432–7439

work page 2020
[13]

RecurrentGemma: Moving Past Transformers for Efficient Open Language Models

Aleksandar Botev, Soham De, Samuel L Smith, Anushan Fernando, George-Cristian Muraru, Ruba Haroun, Leonard Berrada, Razvan Pascanu, Pier Giuseppe Sessa, Robert Dadashi, et al. “RecurrentGemma: Moving Past Transformers for Efficient Open Language Models”. In:arXiv preprint arXiv:2404.07839 (2024)

work page arXiv 2024
[14]

Local learning algorithms

Léon Bottou and Vladimir Vapnik. “Local learning algorithms”. In: Neural computation 4.6 (1992), pp. 888–900

work page 1992
[15]

Scaling transformer to 1m tokens and beyond with rmt

Aydar Bulatov, Yuri Kuratov, Yermek Kapushev, and Mikhail S Burtsev. “Scaling transformer to 1m tokens and beyond with rmt”. In: arXiv preprint arXiv:2304.11062 (2023)

work page arXiv 2023
[16]

Recurrent memory transformer

Aydar Bulatov, Yury Kuratov, and Mikhail Burtsev. “Recurrent memory transformer”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 11079–11091

work page 2022
[17]

An Evolved Universal Transformer Memory

Edoardo Cetin, Qi Sun, Tianyu Zhao, and Yujin Tang. “An Evolved Universal Transformer Memory”. In: arXiv preprint arXiv:2410.13166 (2024)

work page arXiv 2024
[18]

Scatterbrain: Unifying sparse and low-rank attention

Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, and Christopher Ré. “Scatterbrain: Unifying sparse and low-rank attention”. In: Advances in Neural Information Processing Systems 34 (2021), pp. 17413–17426

work page 2021
[19]

Rethinking Attention with Performers

Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J Colwell, and Adrian Weller. “Rethinking Attention with Performers”. In:International Conference on Learning Representations

work page
[20]

url: https://openreview.net/forum?id=Ua6zuk0WRH

work page
[21]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. “BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions”. In:Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers...

work page doi:10.18653/v1/n19-1300 2019
[22]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. “Think you have solved question answering? try arc, the ai2 reasoning challenge”. In:arXiv preprint arXiv:1803.05457 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[23]

What are the differences between long-term, short-term, and working memory?

Nelson Cowan. “What are the differences between long-term, short-term, and working memory?” In:Progress in brain research 169 (2008), pp. 323–338

work page 2008
[24]

Transformer- XL: Attentive Language Models beyond a Fixed-Length Context

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc Viet Le, and Ruslan Salakhutdinov. “Transformer- XL: Attentive Language Models beyond a Fixed-Length Context”. In: ACL (1). Ed. by Anna Korhonen, David R. Traum, and Lluís Màrquez. Association for Computational Linguistics, 2019, pp. 2978–2988.isbn: 978-1-950737-48-2

work page 2019
[25]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning”. In:The Twelfth Inter- national Conference on Learning Representations . 2024. url: https://openreview.net/forum?id=mZn2Xyh9Ec

work page 2024
[26]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”. In:Advances in Neural Information Processing Systems . Ed. by S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh. Vol. 35. Curran Associates, Inc., 2022, pp. 16344–16359. url: https://proceedings.neu...

work page 2022
[27]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. “Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality”. In: arXiv preprint arXiv:2405.21060 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Long-term Forecasting with TiDE: Time-series Dense Encoder

Abhimanyu Das, Weihao Kong, Andrew Leach, Shaan K Mathur, Rajat Sen, and Rose Yu. “Long-term Forecasting with TiDE: Time-series Dense Encoder”. In: Transactions on Machine Learning Research (2023). issn: 2835-8856. url: https://openreview.net/forum?id=pCbC3aQB5W

work page 2023
[29]

Griffin: Mixing gated linear recurrences with local attention for efficient language models

Soham De, Samuel L Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, et al. “Griffin: Mixing gated linear recurrences with local attention for efficient language models”. In:arXiv preprint arXiv:2402.19427 (2024)

work page arXiv 2024
[30]

Flex Attention: A Programming Model for Generating Optimized Attention Kernels

Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. “Flex Attention: A Programming Model for Generating Optimized Attention Kernels”. In: arXiv preprint arXiv:2412.05496 (2024)

work page arXiv 2024
[31]

Hymba: A Hybrid-head Architecture for Small Language Models

Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, et al. “Hymba: A Hybrid-head Architecture for Small Language Models”. In: arXiv preprint arXiv:2411.13676 (2024)

work page arXiv 2024
[32]

Sigmoid-weighted linear units for neural network function approxi- mation in reinforcement learning

Stefan Elfwing, Eiji Uchibe, and Kenji Doya. “Sigmoid-weighted linear units for neural network function approxi- mation in reinforcement learning”. In: Neural networks 107 (2018), pp. 3–11

work page 2018
[33]

Learn to remember: Transformer with recurrent memory for document-level machine translation

Yukun Feng, Feng Li, Ziang Song, Boyuan Zheng, and Philipp Koehn. “Learn to remember: Transformer with recurrent memory for document-level machine translation”. In: arXiv preprint arXiv:2205.01546 (2022)

work page arXiv 2022
[34]

Hungry Hungry Hippos: Towards Language Modeling with State Space Models

Daniel Y Fu, Tri Dao, Khaled Kamal Saab, Armin W Thomas, Atri Rudra, and Christopher Re. “Hungry Hungry Hippos: Towards Language Modeling with State Space Models”. In:The Eleventh International Conference on Learning Representations. 2023. url: https://openreview.net/forum?id=COZDy0WYGg

work page 2023
[35]

Test-time training with masked autoencoders

Yossi Gandelsman, Yu Sun, Xinlei Chen, and Alexei Efros. “Test-time training with masked autoencoders”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 29374–29385

work page 2022
[36]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. “The pile: An 800gb dataset of diverse text for language modeling”. In:arXiv preprint arXiv:2101.00027 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[37]

Learning to forget: Continual prediction with LSTM

Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. “Learning to forget: Continual prediction with LSTM”. In: Neural computation 12.10 (2000), pp. 2451–2471

work page 2000
[38]

Neural Turing Machines

Alex Graves, Greg Wayne, and Ivo Danihelka. Neural Turing Machines . 2014. arXiv: 1410.5401 [cs.NE] . url: https://arxiv.org/abs/1410.5401

work page internal anchor Pith review Pith/arXiv arXiv 2014
[39]

LSTM: A search space odyssey

Klaus Greff, Rupesh K Srivastava, Jan Koutník, Bas R Steunebrink, and Jürgen Schmidhuber. “LSTM: A search space odyssey”. In: IEEE transactions on neural networks and learning systems 28.10 (2016), pp. 2222–2232

work page 2016
[40]

Genomic benchmarks: a collection of datasets for genomic sequence classification

Katarína Grešová, Vlastimil Martinek, David Čechák, Petr Šimeček, and Panagiotis Alexiou. “Genomic benchmarks: a collection of datasets for genomic sequence classification”. In: BMC Genomic Data 24.1 (2023), p. 25

work page 2023
[41]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. “Mamba: Linear-Time Sequence Modeling with Selective State Spaces”. In:First Conference on Language Modeling. 2024. url: https://openreview.net/forum?id=tEYskw1VY2

work page 2024
[42]

Efficiently Modeling Long Sequences with Structured State Spaces

Albert Gu, Karan Goel, and Christopher Re. “Efficiently Modeling Long Sequences with Structured State Spaces”. In: International Conference on Learning Representations . 2022. url: https : / / openreview . net / forum ? id = uYLFoz1vlAC. 19

work page 2022
[43]

LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models

Chi Han, Qifan Wang, Hao Peng, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. “LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models”. In: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Ed. by Kevin Duh, Helen...

work page doi:10.18653/v1/2024.naacl-long.222 2024
[44]

Liquid Structural State-Space Models

Ramin Hasani, Mathias Lechner, Tsun-Hsuan Wang, Makram Chahine, Alexander Amini, and Daniela Rus. “Liquid Structural State-Space Models”. In: The Eleventh International Conference on Learning Representations . 2023. url: https://openreview.net/forum?id=g4OTKRKfS7R

work page 2023
[45]

CAMELoT: Towards Large Language Models with Training-Free Consolidated Associative Memory

Zexue He, Leonid Karlinsky, Donghyun Kim, Julian McAuley, Dmitry Krotov, and Rogerio Feris. “CAMELoT: Towards Large Language Models with Training-Free Consolidated Associative Memory”. In: arXiv preprint arXiv:2402.13449 (2024)

work page arXiv 2024
[46]

The organization of behavior: A neuropsychological theory

Donald Olding Hebb. The organization of behavior: A neuropsychological theory . Psychology press, 2005

work page 2005
[47]

Neural networks and physical systems with emergent collective computational abilities

John J Hopfield. “Neural networks and physical systems with emergent collective computational abilities.” In: Proceedings of the national academy of sciences 79.8 (1982), pp. 2554–2558

work page 1982
[48]

Multilayer feedforward networks are universal approxi- mators

Kurt Hornik, Maxwell Stinchcombe, and Halbert White. “Multilayer feedforward networks are universal approxi- mators”. In: Neural networks 2.5 (1989), pp. 359–366

work page 1989
[49]

RULER: What’s the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. “RULER: What’s the Real Context Size of Your Long-Context Language Models?” In:First Conference on Language Modeling. 2024. url: https://openreview.net/forum?id=kIoBbc76Sy

work page 2024
[50]

Block-recurrent transformers

DeLesley Hutchins, Imanol Schlag, Yuhuai Wu, Ethan Dyer, and Behnam Neyshabur. “Block-recurrent transformers”. In: Advances in neural information processing systems 35 (2022), pp. 33248–33261

work page 2022
[51]

The dual form of neural networks revisited: Connecting test time predictions to training patterns via spotlights of attention

Kazuki Irie, Róbert Csordás, and Jürgen Schmidhuber. “The dual form of neural networks revisited: Connecting test time predictions to training patterns via spotlights of attention”. In: International Conference on Machine Learning . PMLR. 2022, pp. 9639–9659

work page 2022
[52]

Going beyond linear transformers with recurrent fast weight programmers

Kazuki Irie, Imanol Schlag, Róbert Csordás, and Jürgen Schmidhuber. “Going beyond linear transformers with recurrent fast weight programmers”. In: Advances in neural information processing systems 34 (2021), pp. 7703–7717

work page 2021
[53]

Online domain adaptation of a pre-trained cascade of classifiers

Vidit Jain and Erik Learned-Miller. “Online domain adaptation of a pre-trained cascade of classifiers”. In:CVPR

work page
[54]

2011, pp

IEEE. 2011, pp. 577–584

work page 2011
[55]

Mistral 7B

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. “Mistral 7B”. In:arXiv preprint arXiv:2310.06825 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

PolySketchFormer: Fast Transformers via Sketching Polyno- mial Kernels

Praneeth Kacham, Vahab Mirrokni, and Peilin Zhong. “PolySketchFormer: Fast Transformers via Sketching Polyno- mial Kernels”. In: Forty-first International Conference on Machine Learning . 2024. url: https://openreview.net/ forum?id=ghYrfdJfjK

work page 2024
[57]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. “Scaling laws for neural language models”. In:arXiv preprint arXiv:2001.08361 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2001
[58]

Transformers are rnns: Fast au- toregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. “Transformers are rnns: Fast au- toregressive transformers with linear attention”. In: International conference on machine learning . PMLR. 2020, pp. 5156–5165

work page 2020
[59]

Generalization through Memorization: Nearest Neighbor Language Models

Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. “Generalization through Memorization: Nearest Neighbor Language Models”. In: International Conference on Learning Representations . 2020. url: https://openreview.net/forum?id=HklBjCEKvH

work page 2020
[60]

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Igorevich Sorokin, Artyom Sorokin, and Mikhail Burtsev. “BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack”. In: The Thirty- eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track . 2024. url: https : //openreview.net/forum?id=u7m2CG84BQ

work page 2024
[61]

Self-attentive associative memory

Hung Le, Truyen Tran, and Svetha Venkatesh. “Self-attentive associative memory”. In:International conference on machine learning. PMLR. 2020, pp. 5682–5691

work page 2020
[62]

Retrieval-augmented generation for knowledge-intensive nlp tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. “Retrieval-augmented generation for knowledge-intensive nlp tasks”. In: Advances in Neural Information Processing Systems 33 (2020), pp. 9459–9474. 20

work page 2020
[63]

Learning, Forgetting, Remembering: Insights From Tracking LLM Mem- orization During Training

Danny Leybzon and Corentin Kervadec. “Learning, Forgetting, Remembering: Insights From Tracking LLM Mem- orization During Training”. In:Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. 2024, pp. 43–57

work page 2024
[64]

Revisiting long-term time series forecasting: An investigation on linear mapping

Zhe Li, Shiyi Qi, Yiduo Li, and Zenglin Xu. “Revisiting long-term time series forecasting: An investigation on linear mapping”. In: arXiv preprint arXiv:2305.10721 (2023)

work page arXiv 2023
[65]

Longhorn: State space models are amortized online learners

Bo Liu, Rui Wang, Lemeng Wu, Yihao Feng, Peter Stone, and Qiang Liu. “Longhorn: State space models are amortized online learners”. In: arXiv preprint arXiv:2407.14207 (2024)

work page arXiv 2024
[66]

Lost in the middle: How language models use long contexts

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. “Lost in the middle: How language models use long contexts”. In:Transactions of the Association for Computational Linguistics 12 (2024), pp. 157–173

work page 2024
[67]

iTransformer: Inverted Transformers Are Effective for Time Series Forecasting

Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long. “itransformer: Inverted transformers are effective for time series forecasting”. In:arXiv preprint arXiv:2310.06625 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[68]

The structure of value: Accounting for taste

George Mandler. “The structure of value: Accounting for taste”. In:Affect and cognition. Psychology Press, 2014, pp. 3–36

work page 2014
[69]

Long Range Language Modeling via Gated State Spaces

Harsh Mehta, Ankit Gupta, Ashok Cutkosky, and Behnam Neyshabur. “Long Range Language Modeling via Gated State Spaces”. In: The Eleventh International Conference on Learning Representations . 2023. url: https : //openreview.net/forum?id=5MkYIYCbva

work page 2023
[70]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. “Pointer Sentinel Mixture Models”. In: International Conference on Learning Representations . 2017. url: https://openreview.net/forum?id=Byj72udxe

work page 2017
[71]

The Illusion of State in State-Space Models

William Merrill, Jackson Petty, and Ashish Sabharwal. “The Illusion of State in State-Space Models”. In:Forty-first International Conference on Machine Learning . 2024. url: https://openreview.net/forum?id=QZgo9JZpLq

work page 2024
[72]

Online model distillation for efficient video inference

Ravi Teja Mullapudi, Steven Chen, Keyi Zhang, Deva Ramanan, and Kayvon Fatahalian. “Online model distillation for efficient video inference”. In: Proceedings of the IEEE/CVF International conference on computer vision . 2019, pp. 3573–3582

work page 2019
[73]

Leave no context behind: Efficient infinite context transformers with infini-attention

Tsendsuren Munkhdalai, Manaal Faruqui, and Siddharth Gopal. “Leave no context behind: Efficient infinite context transformers with infini-attention”. In: arXiv preprint arXiv:2404.07143 (2024)

work page arXiv 2024
[74]

Metalearned neural memory

Tsendsuren Munkhdalai, Alessandro Sordoni, Tong Wang, and Adam Trischler. “Metalearned neural memory”. In: Advances in Neural Information Processing Systems 32 (2019)

work page 2019
[75]

Neural semantic encoders

Tsendsuren Munkhdalai and Hong Yu. “Neural semantic encoders”. In:Proceedings of the conference. Association for Computational Linguistics. Meeting. Vol. 1. NIH Public Access. 2017, p. 397

work page 2017
[76]

Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution

Eric Nguyen, Michael Poli, Marjan Faizi, Armin Thomas, Michael Wornow, Callum Birch-Sykes, Stefano Massaroli, Aman Patel, Clayton Rabideau, Yoshua Bengio, et al. “Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution”. In: Advances in neural information processing systems 36 (2024)

work page 2024
[77]

On First-Order Meta-Learning Algorithms

A Nichol. “On first-order meta-learning algorithms”. In: arXiv preprint arXiv:1803.02999 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[78]

A Time Series is Worth 64 Words: Long-term Forecasting with Transformers

Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. “A time series is worth 64 words: Long-term forecasting with transformers”. In: arXiv preprint arXiv:2211.14730 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[79]

Learning and memory

Hideyuki Okano, Tomoo Hirano, and Evan Balaban. “Learning and memory”. In:Proceedings of the National Academy of Sciences 97.23 (2000), pp. 12403–12404

work page 2000
[80]

Resurrecting recurrent neural networks for long sequences

Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. “Resurrecting recurrent neural networks for long sequences”. In:International Conference on Machine Learning . PMLR. 2023, pp. 26670–26698

work page 2023

Showing first 80 references.