Recognition: 2 theorem links
Titans: Learning to Memorize at Test Time
Pith reviewed 2026-05-14 22:03 UTC · model grok-4.3
The pith
Titans combine attention with a learnable neural long-term memory to handle contexts over two million tokens more effectively than Transformers or linear recurrent models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Titans introduce a neural long-term memory module that learns to memorize historical context at test time. This module operates alongside attention, which serves as short-term memory for accurate current dependencies, while the neural memory provides persistent long-term storage. The architecture enables fast parallelizable training and fast inference, and three variants show how to incorporate the memory effectively. This results in models that outperform prior approaches on multiple tasks and scale to contexts exceeding 2M tokens with improved needle-in-haystack accuracy.
What carries the argument
The neural long-term memory module, which learns to store and retrieve relevant historical information to complement attention's focus on the current context.
If this is right
- Titans outperform Transformers and linear recurrent models on language modeling, common-sense reasoning, genomics, and time series tasks.
- The models scale effectively to context windows larger than 2 million tokens.
- Titans achieve higher accuracy in needle-in-haystack tasks at large context sizes compared to baselines.
- Training remains fast and parallelizable while inference stays fast due to the memory design.
Where Pith is reading between the lines
- The memory module could allow models to handle even longer sequences without increasing the attention window size during training.
- This approach might generalize to domains like video processing or scientific simulations that require retaining information over very long periods.
- Future work could explore making the memory module's capacity adaptive based on the task.
Load-bearing premise
The neural memory module can be trained to reliably store and retrieve relevant information from history without catastrophic forgetting or introducing new errors that cancel out the benefits.
What would settle it
A test where Titans show no improvement or worse performance than baselines on long-context needle-in-haystack tasks at scales over 2 million tokens, or exhibit clear signs of memory failure like forgetting key facts.
read the original abstract
Over more than a decade there has been an extensive research effort on how to effectively utilize recurrent models and attention. While recurrent models aim to compress the data into a fixed-size memory (called hidden state), attention allows attending to the entire context window, capturing the direct dependencies of all tokens. This more accurate modeling of dependencies, however, comes with a quadratic cost, limiting the model to a fixed-length context. We present a new neural long-term memory module that learns to memorize historical context and helps attention to attend to the current context while utilizing long past information. We show that this neural memory has the advantage of fast parallelizable training while maintaining a fast inference. From a memory perspective, we argue that attention due to its limited context but accurate dependency modeling performs as a short-term memory, while neural memory due to its ability to memorize the data, acts as a long-term, more persistent, memory. Based on these two modules, we introduce a new family of architectures, called Titans, and present three variants to address how one can effectively incorporate memory into this architecture. Our experimental results on language modeling, common-sense reasoning, genomics, and time series tasks show that Titans are more effective than Transformers and recent modern linear recurrent models. They further can effectively scale to larger than 2M context window size with higher accuracy in needle-in-haystack tasks compared to baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce Titans, a family of architectures combining attention (as short-term memory) with a new neural long-term memory module that learns to memorize and retrieve historical context during test-time inference. Three variants are presented for integrating the memory; experiments on language modeling, commonsense reasoning, genomics, and time series show Titans outperforming Transformers and modern linear recurrent models, with effective scaling to contexts larger than 2M tokens and higher needle-in-haystack accuracy.
Significance. If the empirical results hold under rigorous controls, the work could meaningfully advance efficient long-context modeling by offering a persistent memory mechanism that avoids full quadratic attention while supporting fast inference and generalization beyond training lengths.
major comments (3)
- [Experiments] Experiments section: the performance claims lack error bars, ablation studies isolating the neural memory module's contribution, and explicit reporting of training context lengths, which are required to substantiate the >2M scaling result in needle-in-haystack tasks.
- [Architecture] Architecture section: the test-time update rule for the neural long-term memory module is specified at a high level without equations or analysis demonstrating stability or resistance to catastrophic forgetting under unsupervised next-token prediction.
- [Needle-in-haystack Evaluation] Needle-in-haystack results: the superior accuracy for contexts >2M is presented without detailing baseline implementations, memory state initialization, or controls confirming generalization beyond training lengths.
minor comments (2)
- [Abstract] Abstract: the claim of 'fast parallelizable training' would benefit from explicit complexity comparisons (e.g., O(N) vs. O(N^2)) to the cited linear recurrent baselines.
- [Introduction] Notation: the distinction between the neural memory hidden state and standard RNN states should be formalized with consistent symbols to improve clarity.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity, rigor, and reproducibility.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the performance claims lack error bars, ablation studies isolating the neural memory module's contribution, and explicit reporting of training context lengths, which are required to substantiate the >2M scaling result in needle-in-haystack tasks.
Authors: We agree that error bars, targeted ablations, and explicit training context lengths are necessary to strengthen the empirical claims. In the revised manuscript, we will add error bars computed over multiple random seeds for all reported metrics. We will include new ablation studies that isolate the contribution of the neural long-term memory module (e.g., Titans without the memory module vs. full Titans). We will also explicitly state the training context lengths used for each model and task to support the >2M scaling results. revision: yes
-
Referee: [Architecture] Architecture section: the test-time update rule for the neural long-term memory module is specified at a high level without equations or analysis demonstrating stability or resistance to catastrophic forgetting under unsupervised next-token prediction.
Authors: We will expand the Architecture section to include the full mathematical formulation of the test-time update rule, including the precise equations governing the memory state evolution. We will add a dedicated subsection providing stability analysis (e.g., bounds on state norms) and empirical evaluations of resistance to catastrophic forgetting, including controlled experiments under unsupervised next-token prediction on long sequences. revision: yes
-
Referee: [Needle-in-haystack Evaluation] Needle-in-haystack results: the superior accuracy for contexts >2M is presented without detailing baseline implementations, memory state initialization, or controls confirming generalization beyond training lengths.
Authors: We will revise the Needle-in-haystack Evaluation section to provide complete details on baseline implementations (including exact model variants and hyperparameters), memory state initialization procedures at test time, and explicit controls (e.g., training-length-matched vs. extended-context evaluations) that confirm generalization beyond the training context lengths. revision: yes
Circularity Check
No significant circularity; claims rest on empirical validation of architecture
full rationale
The paper introduces Titans as a family of architectures combining attention (short-term memory) with a new neural long-term memory module. Central claims concern empirical superiority on language modeling, reasoning, genomics, time-series tasks and scaling beyond 2M context in needle-in-haystack evaluations. No derivation chain, equations, or first-principles predictions appear that reduce to fitted parameters, self-definitions, or self-citation loops. Architecture choices are presented as design decisions tested experimentally rather than derived quantities that collapse to their inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard transformer attention and recurrent hidden-state dynamics can be combined with an additional learned memory without introducing unmanageable training instability.
invented entities (1)
-
Neural long-term memory module
no independent evidence
Forward citations
Cited by 20 Pith papers
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.
-
LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models
LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.
-
Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences
Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
-
Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training
Mem3R achieves better long-sequence 3D reconstruction by decoupling tracking and mapping with a hybrid memory of TTT-updated MLP and explicit tokens, reducing model size and trajectory errors.
-
OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention
OSDN adds online diagonal preconditioning to the Delta Rule, preserving chunkwise parallelism while proving super-geometric convergence and delivering 32-39% recall gains at 340M-1.3B scales.
-
Cognifold: Always-On Proactive Memory via Cognitive Folding
Cognifold is a new proactive memory architecture that folds event streams into emergent cognitive structures by extending complementary learning systems theory with a prefrontal intent layer and graph topology self-or...
-
$\delta$-mem: Efficient Online Memory for Large Language Models
δ-mem augments frozen LLMs with an 8x8 online memory state updated by delta-rule learning to generate low-rank attention corrections, delivering 1.10x average gains over the backbone and larger improvements on memory-...
-
A Single-Layer Model Can Do Language Modeling
A 130M-parameter 1-layer GPN achieves FineWeb-Edu perplexity 18.06, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34).
-
The Impossibility Triangle of Long-Context Modeling
No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.
-
Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning
Memory tokens are required for non-trivial performance in adaptive Universal Transformers on Sudoku-Extreme, with 8-32 tokens yielding stable 57% exact-match accuracy while trading off against ponder depth.
-
Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration
LLM agents trained with a task-success reward on self-generated knowledge can spontaneously explore and adapt to new environments without any rewards or instructions at inference, yielding 20% gains on web tasks and a...
-
DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks
CLWM with DINOv3 targets, O(1) TTT memory, SAI latency masking, and EmbodiChain training achieves SOTA dual-arm simulation performance and zero-shot sim-to-real transfer that beats real-data finetuned baselines.
-
Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction
Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.
-
In-Place Test-Time Training
In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
-
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training...
-
Mela: Test-Time Memory Consolidation based on Transformation Hypothesis
Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.
-
Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference
Unifying LLM memory optimizations into a Prepare-Compute-Retrieve-Apply pipeline and accelerating it on GPU-FPGA hardware yields up to 2.2x faster inference and 4.7x less energy than GPU-only baselines.
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
-
Gated Delta Networks: Improving Mamba2 with Delta Rule
Gated DeltaNet integrates gating and delta rules into linear transformers, outperforming Mamba2 and DeltaNet on language modeling, reasoning, retrieval, and long-context tasks.
-
From Storage to Experience: A Survey on the Evolution of LLM Agent Memory Mechanisms
LLM agent memory is organized into Storage (preserving trajectories), Reflection (refining them), and Experience (abstracting into reusable knowledge) stages driven by needs for long-range consistency, dynamic adaptat...
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. “Gpt-4 technical report”. In:arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Linear Transformers with Learnable Kernel Functions are Better In-Context Models
Yaroslav Aksenov, Nikita Balagansky, Sofia Maria Lo Cicero Vaina, Boris Shaposhnikov, Alexey Gorbatovski, and Daniil Gavrilov. “Linear Transformers with Learnable Kernel Functions are Better In-Context Models”. In:arXiv preprint arXiv:2402.10644 (2024)
-
[3]
Learning to learn by gradient descent by gradient descent
Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. “Learning to learn by gradient descent by gradient descent”. In:Advances in neural information processing systems 29 (2016)
work page 2016
-
[4]
Exploring length generalization in large language models
Cem Anil, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra, Vinay Ramasesh, Ambrose Slone, Guy Gur-Ari, Ethan Dyer, and Behnam Neyshabur. “Exploring length generalization in large language models”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 38546–38556
work page 2022
-
[5]
Simple linear attention language models balance the recall-throughput tradeoff
Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, James Zou, Atri Rudra, and Christo- pher Re. “Simple linear attention language models balance the recall-throughput tradeoff”. In:Forty-first International Conference on Machine Learning . 2024. url: https://openreview.net/forum?id=e93ffDcpH3
work page 2024
-
[6]
Neural Machine Translation by Jointly Learning to Align and Translate
Dzmitry Bahdanau. “Neural machine translation by jointly learning to align and translate”. In: arXiv preprint arXiv:1409.0473 (2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[7]
The Pitfalls of Memo- rization: When Memorization Hurts Generalization
Reza Bayat, Mohammad Pezeshki, Elvis Dohmatob, David Lopez-Paz, and Pascal Vincent. “The Pitfalls of Memo- rization: When Memorization Hurts Generalization”. In: arXiv preprint arXiv:2412.07684 (2024)
-
[8]
xLSTM: Extended Long Short-Term Memory
Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. “xLSTM: Extended Long Short-Term Memory”. In: arXiv preprint arXiv:2405.04517 (2024)
-
[9]
Mambamixer: Efficient selective state space models with dual token and channel selection
Ali Behrouz, Michele Santacatterina, and Ramin Zabih. “Mambamixer: Efficient selective state space models with dual token and channel selection”. In: arXiv preprint arXiv:2403.19888 (2024)
-
[10]
Vincent-Pierre Berges, Barlas Oğuz, Daniel Haziza, Wen-tau Yih, Luke Zettlemoyer, and Gargi Gosh. “Memory Layers at Scale”. In: arXiv preprint arXiv:2412.09764 (2024)
-
[11]
Birth of a transformer: A memory viewpoint
Alberto Bietti, Vivien Cabannes, Diane Bouchacourt, Herve Jegou, and Leon Bottou. “Birth of a transformer: A memory viewpoint”. In: Advances in Neural Information Processing Systems 36 (2024)
work page 2024
-
[12]
Piqa: Reasoning about physical commonsense in natural language
Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. “Piqa: Reasoning about physical commonsense in natural language”. In: Proceedings of the AAAI conference on artificial intelligence . Vol. 34. 05. 2020, pp. 7432–7439
work page 2020
-
[13]
RecurrentGemma: Moving Past Transformers for Efficient Open Language Models
Aleksandar Botev, Soham De, Samuel L Smith, Anushan Fernando, George-Cristian Muraru, Ruba Haroun, Leonard Berrada, Razvan Pascanu, Pier Giuseppe Sessa, Robert Dadashi, et al. “RecurrentGemma: Moving Past Transformers for Efficient Open Language Models”. In:arXiv preprint arXiv:2404.07839 (2024)
-
[14]
Léon Bottou and Vladimir Vapnik. “Local learning algorithms”. In: Neural computation 4.6 (1992), pp. 888–900
work page 1992
-
[15]
Scaling transformer to 1m tokens and beyond with rmt
Aydar Bulatov, Yuri Kuratov, Yermek Kapushev, and Mikhail S Burtsev. “Scaling transformer to 1m tokens and beyond with rmt”. In: arXiv preprint arXiv:2304.11062 (2023)
-
[16]
Aydar Bulatov, Yury Kuratov, and Mikhail Burtsev. “Recurrent memory transformer”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 11079–11091
work page 2022
-
[17]
An Evolved Universal Transformer Memory
Edoardo Cetin, Qi Sun, Tianyu Zhao, and Yujin Tang. “An Evolved Universal Transformer Memory”. In: arXiv preprint arXiv:2410.13166 (2024)
-
[18]
Scatterbrain: Unifying sparse and low-rank attention
Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, and Christopher Ré. “Scatterbrain: Unifying sparse and low-rank attention”. In: Advances in Neural Information Processing Systems 34 (2021), pp. 17413–17426
work page 2021
-
[19]
Rethinking Attention with Performers
Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J Colwell, and Adrian Weller. “Rethinking Attention with Performers”. In:International Conference on Learning Representations
-
[20]
url: https://openreview.net/forum?id=Ua6zuk0WRH
-
[21]
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. “BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions”. In:Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers...
-
[22]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. “Think you have solved question answering? try arc, the ai2 reasoning challenge”. In:arXiv preprint arXiv:1803.05457 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[23]
What are the differences between long-term, short-term, and working memory?
Nelson Cowan. “What are the differences between long-term, short-term, and working memory?” In:Progress in brain research 169 (2008), pp. 323–338
work page 2008
-
[24]
Transformer- XL: Attentive Language Models beyond a Fixed-Length Context
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc Viet Le, and Ruslan Salakhutdinov. “Transformer- XL: Attentive Language Models beyond a Fixed-Length Context”. In: ACL (1). Ed. by Anna Korhonen, David R. Traum, and Lluís Màrquez. Association for Computational Linguistics, 2019, pp. 2978–2988.isbn: 978-1-950737-48-2
work page 2019
-
[25]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Tri Dao. “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning”. In:The Twelfth Inter- national Conference on Learning Representations . 2024. url: https://openreview.net/forum?id=mZn2Xyh9Ec
work page 2024
-
[26]
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”. In:Advances in Neural Information Processing Systems . Ed. by S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh. Vol. 35. Curran Associates, Inc., 2022, pp. 16344–16359. url: https://proceedings.neu...
work page 2022
-
[27]
Tri Dao and Albert Gu. “Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality”. In: arXiv preprint arXiv:2405.21060 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Long-term Forecasting with TiDE: Time-series Dense Encoder
Abhimanyu Das, Weihao Kong, Andrew Leach, Shaan K Mathur, Rajat Sen, and Rose Yu. “Long-term Forecasting with TiDE: Time-series Dense Encoder”. In: Transactions on Machine Learning Research (2023). issn: 2835-8856. url: https://openreview.net/forum?id=pCbC3aQB5W
work page 2023
-
[29]
Griffin: Mixing gated linear recurrences with local attention for efficient language models
Soham De, Samuel L Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, et al. “Griffin: Mixing gated linear recurrences with local attention for efficient language models”. In:arXiv preprint arXiv:2402.19427 (2024)
-
[30]
Flex Attention: A Programming Model for Generating Optimized Attention Kernels
Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. “Flex Attention: A Programming Model for Generating Optimized Attention Kernels”. In: arXiv preprint arXiv:2412.05496 (2024)
-
[31]
Hymba: A Hybrid-head Architecture for Small Language Models
Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, et al. “Hymba: A Hybrid-head Architecture for Small Language Models”. In: arXiv preprint arXiv:2411.13676 (2024)
-
[32]
Sigmoid-weighted linear units for neural network function approxi- mation in reinforcement learning
Stefan Elfwing, Eiji Uchibe, and Kenji Doya. “Sigmoid-weighted linear units for neural network function approxi- mation in reinforcement learning”. In: Neural networks 107 (2018), pp. 3–11
work page 2018
-
[33]
Learn to remember: Transformer with recurrent memory for document-level machine translation
Yukun Feng, Feng Li, Ziang Song, Boyuan Zheng, and Philipp Koehn. “Learn to remember: Transformer with recurrent memory for document-level machine translation”. In: arXiv preprint arXiv:2205.01546 (2022)
-
[34]
Hungry Hungry Hippos: Towards Language Modeling with State Space Models
Daniel Y Fu, Tri Dao, Khaled Kamal Saab, Armin W Thomas, Atri Rudra, and Christopher Re. “Hungry Hungry Hippos: Towards Language Modeling with State Space Models”. In:The Eleventh International Conference on Learning Representations. 2023. url: https://openreview.net/forum?id=COZDy0WYGg
work page 2023
-
[35]
Test-time training with masked autoencoders
Yossi Gandelsman, Yu Sun, Xinlei Chen, and Alexei Efros. “Test-time training with masked autoencoders”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 29374–29385
work page 2022
-
[36]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. “The pile: An 800gb dataset of diverse text for language modeling”. In:arXiv preprint arXiv:2101.00027 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[37]
Learning to forget: Continual prediction with LSTM
Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. “Learning to forget: Continual prediction with LSTM”. In: Neural computation 12.10 (2000), pp. 2451–2471
work page 2000
-
[38]
Alex Graves, Greg Wayne, and Ivo Danihelka. Neural Turing Machines . 2014. arXiv: 1410.5401 [cs.NE] . url: https://arxiv.org/abs/1410.5401
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[39]
Klaus Greff, Rupesh K Srivastava, Jan Koutník, Bas R Steunebrink, and Jürgen Schmidhuber. “LSTM: A search space odyssey”. In: IEEE transactions on neural networks and learning systems 28.10 (2016), pp. 2222–2232
work page 2016
-
[40]
Genomic benchmarks: a collection of datasets for genomic sequence classification
Katarína Grešová, Vlastimil Martinek, David Čechák, Petr Šimeček, and Panagiotis Alexiou. “Genomic benchmarks: a collection of datasets for genomic sequence classification”. In: BMC Genomic Data 24.1 (2023), p. 25
work page 2023
-
[41]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. “Mamba: Linear-Time Sequence Modeling with Selective State Spaces”. In:First Conference on Language Modeling. 2024. url: https://openreview.net/forum?id=tEYskw1VY2
work page 2024
-
[42]
Efficiently Modeling Long Sequences with Structured State Spaces
Albert Gu, Karan Goel, and Christopher Re. “Efficiently Modeling Long Sequences with Structured State Spaces”. In: International Conference on Learning Representations . 2022. url: https : / / openreview . net / forum ? id = uYLFoz1vlAC. 19
work page 2022
-
[43]
LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models
Chi Han, Qifan Wang, Hao Peng, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. “LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models”. In: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Ed. by Kevin Duh, Helen...
-
[44]
Liquid Structural State-Space Models
Ramin Hasani, Mathias Lechner, Tsun-Hsuan Wang, Makram Chahine, Alexander Amini, and Daniela Rus. “Liquid Structural State-Space Models”. In: The Eleventh International Conference on Learning Representations . 2023. url: https://openreview.net/forum?id=g4OTKRKfS7R
work page 2023
-
[45]
CAMELoT: Towards Large Language Models with Training-Free Consolidated Associative Memory
Zexue He, Leonid Karlinsky, Donghyun Kim, Julian McAuley, Dmitry Krotov, and Rogerio Feris. “CAMELoT: Towards Large Language Models with Training-Free Consolidated Associative Memory”. In: arXiv preprint arXiv:2402.13449 (2024)
-
[46]
The organization of behavior: A neuropsychological theory
Donald Olding Hebb. The organization of behavior: A neuropsychological theory . Psychology press, 2005
work page 2005
-
[47]
Neural networks and physical systems with emergent collective computational abilities
John J Hopfield. “Neural networks and physical systems with emergent collective computational abilities.” In: Proceedings of the national academy of sciences 79.8 (1982), pp. 2554–2558
work page 1982
-
[48]
Multilayer feedforward networks are universal approxi- mators
Kurt Hornik, Maxwell Stinchcombe, and Halbert White. “Multilayer feedforward networks are universal approxi- mators”. In: Neural networks 2.5 (1989), pp. 359–366
work page 1989
-
[49]
RULER: What’s the Real Context Size of Your Long-Context Language Models?
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. “RULER: What’s the Real Context Size of Your Long-Context Language Models?” In:First Conference on Language Modeling. 2024. url: https://openreview.net/forum?id=kIoBbc76Sy
work page 2024
-
[50]
DeLesley Hutchins, Imanol Schlag, Yuhuai Wu, Ethan Dyer, and Behnam Neyshabur. “Block-recurrent transformers”. In: Advances in neural information processing systems 35 (2022), pp. 33248–33261
work page 2022
-
[51]
Kazuki Irie, Róbert Csordás, and Jürgen Schmidhuber. “The dual form of neural networks revisited: Connecting test time predictions to training patterns via spotlights of attention”. In: International Conference on Machine Learning . PMLR. 2022, pp. 9639–9659
work page 2022
-
[52]
Going beyond linear transformers with recurrent fast weight programmers
Kazuki Irie, Imanol Schlag, Róbert Csordás, and Jürgen Schmidhuber. “Going beyond linear transformers with recurrent fast weight programmers”. In: Advances in neural information processing systems 34 (2021), pp. 7703–7717
work page 2021
-
[53]
Online domain adaptation of a pre-trained cascade of classifiers
Vidit Jain and Erik Learned-Miller. “Online domain adaptation of a pre-trained cascade of classifiers”. In:CVPR
- [54]
-
[55]
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. “Mistral 7B”. In:arXiv preprint arXiv:2310.06825 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[56]
PolySketchFormer: Fast Transformers via Sketching Polyno- mial Kernels
Praneeth Kacham, Vahab Mirrokni, and Peilin Zhong. “PolySketchFormer: Fast Transformers via Sketching Polyno- mial Kernels”. In: Forty-first International Conference on Machine Learning . 2024. url: https://openreview.net/ forum?id=ghYrfdJfjK
work page 2024
-
[57]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. “Scaling laws for neural language models”. In:arXiv preprint arXiv:2001.08361 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[58]
Transformers are rnns: Fast au- toregressive transformers with linear attention
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. “Transformers are rnns: Fast au- toregressive transformers with linear attention”. In: International conference on machine learning . PMLR. 2020, pp. 5156–5165
work page 2020
-
[59]
Generalization through Memorization: Nearest Neighbor Language Models
Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. “Generalization through Memorization: Nearest Neighbor Language Models”. In: International Conference on Learning Representations . 2020. url: https://openreview.net/forum?id=HklBjCEKvH
work page 2020
-
[60]
BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack
Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Igorevich Sorokin, Artyom Sorokin, and Mikhail Burtsev. “BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack”. In: The Thirty- eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track . 2024. url: https : //openreview.net/forum?id=u7m2CG84BQ
work page 2024
-
[61]
Self-attentive associative memory
Hung Le, Truyen Tran, and Svetha Venkatesh. “Self-attentive associative memory”. In:International conference on machine learning. PMLR. 2020, pp. 5682–5691
work page 2020
-
[62]
Retrieval-augmented generation for knowledge-intensive nlp tasks
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. “Retrieval-augmented generation for knowledge-intensive nlp tasks”. In: Advances in Neural Information Processing Systems 33 (2020), pp. 9459–9474. 20
work page 2020
-
[63]
Learning, Forgetting, Remembering: Insights From Tracking LLM Mem- orization During Training
Danny Leybzon and Corentin Kervadec. “Learning, Forgetting, Remembering: Insights From Tracking LLM Mem- orization During Training”. In:Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. 2024, pp. 43–57
work page 2024
-
[64]
Revisiting long-term time series forecasting: An investigation on linear mapping
Zhe Li, Shiyi Qi, Yiduo Li, and Zenglin Xu. “Revisiting long-term time series forecasting: An investigation on linear mapping”. In: arXiv preprint arXiv:2305.10721 (2023)
-
[65]
Longhorn: State space models are amortized online learners
Bo Liu, Rui Wang, Lemeng Wu, Yihao Feng, Peter Stone, and Qiang Liu. “Longhorn: State space models are amortized online learners”. In: arXiv preprint arXiv:2407.14207 (2024)
-
[66]
Lost in the middle: How language models use long contexts
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. “Lost in the middle: How language models use long contexts”. In:Transactions of the Association for Computational Linguistics 12 (2024), pp. 157–173
work page 2024
-
[67]
iTransformer: Inverted Transformers Are Effective for Time Series Forecasting
Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long. “itransformer: Inverted transformers are effective for time series forecasting”. In:arXiv preprint arXiv:2310.06625 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[68]
The structure of value: Accounting for taste
George Mandler. “The structure of value: Accounting for taste”. In:Affect and cognition. Psychology Press, 2014, pp. 3–36
work page 2014
-
[69]
Long Range Language Modeling via Gated State Spaces
Harsh Mehta, Ankit Gupta, Ashok Cutkosky, and Behnam Neyshabur. “Long Range Language Modeling via Gated State Spaces”. In: The Eleventh International Conference on Learning Representations . 2023. url: https : //openreview.net/forum?id=5MkYIYCbva
work page 2023
-
[70]
Pointer Sentinel Mixture Models
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. “Pointer Sentinel Mixture Models”. In: International Conference on Learning Representations . 2017. url: https://openreview.net/forum?id=Byj72udxe
work page 2017
-
[71]
The Illusion of State in State-Space Models
William Merrill, Jackson Petty, and Ashish Sabharwal. “The Illusion of State in State-Space Models”. In:Forty-first International Conference on Machine Learning . 2024. url: https://openreview.net/forum?id=QZgo9JZpLq
work page 2024
-
[72]
Online model distillation for efficient video inference
Ravi Teja Mullapudi, Steven Chen, Keyi Zhang, Deva Ramanan, and Kayvon Fatahalian. “Online model distillation for efficient video inference”. In: Proceedings of the IEEE/CVF International conference on computer vision . 2019, pp. 3573–3582
work page 2019
-
[73]
Leave no context behind: Efficient infinite context transformers with infini-attention
Tsendsuren Munkhdalai, Manaal Faruqui, and Siddharth Gopal. “Leave no context behind: Efficient infinite context transformers with infini-attention”. In: arXiv preprint arXiv:2404.07143 (2024)
-
[74]
Tsendsuren Munkhdalai, Alessandro Sordoni, Tong Wang, and Adam Trischler. “Metalearned neural memory”. In: Advances in Neural Information Processing Systems 32 (2019)
work page 2019
-
[75]
Tsendsuren Munkhdalai and Hong Yu. “Neural semantic encoders”. In:Proceedings of the conference. Association for Computational Linguistics. Meeting. Vol. 1. NIH Public Access. 2017, p. 397
work page 2017
-
[76]
Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution
Eric Nguyen, Michael Poli, Marjan Faizi, Armin Thomas, Michael Wornow, Callum Birch-Sykes, Stefano Massaroli, Aman Patel, Clayton Rabideau, Yoshua Bengio, et al. “Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution”. In: Advances in neural information processing systems 36 (2024)
work page 2024
-
[77]
On First-Order Meta-Learning Algorithms
A Nichol. “On first-order meta-learning algorithms”. In: arXiv preprint arXiv:1803.02999 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[78]
A Time Series is Worth 64 Words: Long-term Forecasting with Transformers
Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. “A time series is worth 64 words: Long-term forecasting with transformers”. In: arXiv preprint arXiv:2211.14730 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[79]
Hideyuki Okano, Tomoo Hirano, and Evan Balaban. “Learning and memory”. In:Proceedings of the National Academy of Sciences 97.23 (2000), pp. 12403–12404
work page 2000
-
[80]
Resurrecting recurrent neural networks for long sequences
Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. “Resurrecting recurrent neural networks for long sequences”. In:International Conference on Machine Learning . PMLR. 2023, pp. 26670–26698
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.