pith. machine review for the scientific record. sign in

arxiv: 2407.04620 · v4 · submitted 2024-07-05 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:15 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords test-time trainingRNNlong contextsequence modelinglinear complexityhidden stateself-supervised update
0
0 comments X

The pith

RNNs can match long-context performance by updating a learnable hidden-state model via self-supervised steps at test time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to build sequence layers that combine linear complexity with high expressive power. The hidden state is no longer a fixed vector but a small machine learning model whose parameters are adjusted by gradient descent on the incoming test sequence itself. This lets the layer adapt its representation to the specific data seen so far, so that perplexity keeps falling as more tokens arrive. A reader would care because the approach avoids both the quadratic cost of attention and the performance saturation of conventional RNNs after roughly 16k tokens.

Core claim

TTT layers instantiate the hidden state as a trainable model and replace the usual recurrence with a step of self-supervised learning performed on the test sequence. For the two concrete cases examined, TTT-Linear uses a linear model and TTT-MLP uses a two-layer network; both keep lowering perplexity when conditioned on longer contexts, while a strong Mamba baseline plateaus after 16k tokens. The evaluation covers models from 125M to 1.3B parameters and directly compares against a Transformer baseline.

What carries the argument

The TTT layer, whose hidden state is itself a small model updated by one or more gradient steps of self-supervised learning on the current test sequence.

If this is right

  • Linear-complexity layers can continue to benefit from additional context beyond the point where fixed-state RNNs saturate.
  • The same architecture family can be scaled from 125M to over a billion parameters while preserving the long-context scaling behavior.
  • Memory and compute trade-offs shift from attention's quadratic growth to the cost of storing and updating the internal model parameters.
  • Future layer designs can focus on improving the I/O efficiency of the gradient steps without changing the core recurrence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Dynamic adaptation of the hidden state could reduce reliance on extremely long fixed context windows if the model learns useful patterns from recent tokens alone.
  • The same mechanism might be applied to online settings where new data arrives continuously and the model must improve without a separate training phase.
  • If the internal model can be made lighter, TTT layers could serve as drop-in replacements for attention in resource-constrained inference environments.

Load-bearing premise

Gradient-based self-supervised updates performed on the hidden-state model during inference stay stable, cheap enough to run, and do not overfit or degrade the output.

What would settle it

A controlled run in which TTT-Linear or TTT-MLP stops improving perplexity after 16k tokens or begins to produce unstable outputs when the test-time updates are enabled.

read the original abstract

Self-attention performs well in long context but has quadratic complexity. Existing RNN layers have linear complexity, but their performance in long context is limited by the expressive power of their hidden states. We present a practical framework for instantiating sequence modeling layers with linear complexity and expressive hidden states. The key idea is to make the hidden state a machine learning model itself, and the update rule a step of self-supervised learning. Since the hidden state is updated by training even on test sequences, our layers are called Test-Time Training (TTT) layers. We consider two instantiations: TTT-Linear and TTT-MLP, whose hidden state is a linear model and a two-layer MLP respectively. We evaluate our instantiations at the scale of 125M to 1.3B parameters, comparing with a strong Transformer and Mamba, a modern RNN. Similar to Transformer, TTT-Linear and TTT-MLP can keep reducing perplexity by conditioning on more tokens, while Mamba cannot after 16k context. TTT-MLP still faces challenges in memory I/O, but shows larger potential in long context, pointing to a promising direction for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Test-Time Training (TTT) layers as a framework for sequence modeling with linear complexity but expressive hidden states. The hidden state is instantiated as a learnable model (linear regressor or 2-layer MLP) whose parameters are updated via self-supervised gradient steps on the input sequence at test time. Two variants, TTT-Linear and TTT-MLP, are evaluated at 125M–1.3B parameter scales against a strong Transformer baseline and Mamba; the key empirical result is that TTT models continue to reduce perplexity as context grows beyond 16k tokens while Mamba plateaus.

Significance. If the central empirical claim holds, the work supplies a concrete route to linear-complexity models whose hidden states adapt via test-time learning, yielding continued gains on long contexts where standard RNNs saturate. The scaling experiments to 1.3B parameters and direct head-to-head comparisons with Mamba and Transformer constitute reproducible empirical evidence that strengthens the case for test-time adaptation as a viable direction.

major comments (2)
  1. [§4 (Experiments)] §4 (Experiments): The claim that TTT-Linear/MLP continue reducing perplexity with >16k tokens while Mamba plateaus depends on the hidden-state model receiving stable, beneficial self-supervised gradient updates at inference. The section reports final perplexity numbers but provides no analysis of update stability (gradient norms, per-step loss trajectories, or divergence checks) or sensitivity to the number of gradient steps and learning-rate schedule used during test-time training. This is load-bearing for the scaling advantage.
  2. [§3 (Method)] §3 (Method): The update rule for the hidden-state parameters (linear or MLP) is defined as a self-supervised step, yet the manuscript does not specify the exact optimizer, step count per token/segment, or regularization used at test time. Without these details it is impossible to assess whether the reported linear-complexity advantage remains tractable and non-overfitting at 1.3B scale.
minor comments (2)
  1. [Abstract] Abstract and §4: The phrase 'memory I/O issues for TTT-MLP' is stated without any quantitative breakdown (e.g., peak memory vs. context length or wall-clock overhead relative to Mamba). Adding a short table or plot would clarify the practical limitation.
  2. [§3 (Method)] Notation in §3: The symbols for the hidden-state model parameters and the self-supervised loss are introduced without an explicit table of definitions, making cross-references to the update equations harder to follow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of TTT layers for long-context scaling. We address each major comment below and will incorporate the requested details and analyses into the revised manuscript.

read point-by-point responses
  1. Referee: [§4 (Experiments)] The claim that TTT-Linear/MLP continue reducing perplexity with >16k tokens while Mamba plateaus depends on the hidden-state model receiving stable, beneficial self-supervised gradient updates at inference. The section reports final perplexity numbers but provides no analysis of update stability (gradient norms, per-step loss trajectories, or divergence checks) or sensitivity to the number of gradient steps and learning-rate schedule used during test-time training. This is load-bearing for the scaling advantage.

    Authors: We agree that stability analysis is necessary to support the central empirical claim. In the revised version we will add to §4 new figures and text reporting (i) gradient-norm trajectories during test-time updates on long sequences, (ii) per-step self-supervised loss curves on held-out segments, (iii) explicit checks for divergence or instability, and (iv) ablation tables showing sensitivity of final perplexity to the number of gradient steps and the learning-rate schedule used at test time. These additions will directly substantiate that the observed scaling advantage arises from stable, beneficial updates. revision: yes

  2. Referee: [§3 (Method)] The update rule for the hidden-state parameters (linear or MLP) is defined as a self-supervised step, yet the manuscript does not specify the exact optimizer, step count per token/segment, or regularization used at test time. Without these details it is impossible to assess whether the reported linear-complexity advantage remains tractable and non-overfitting at 1.3B scale.

    Authors: We acknowledge the omission of precise test-time hyperparameters. The revised §3 will explicitly state the optimizer (Adam with β1=0.9, β2=0.999), the exact number of gradient steps performed per token or per segment, the learning-rate value and any decay schedule, and the regularization applied (weight decay of 0.01 together with gradient clipping at norm 1.0). These details will be provided for both TTT-Linear and TTT-MLP so that readers can verify tractability and reproducibility at the 1.3B scale. revision: yes

Circularity Check

0 steps flagged

No significant circularity; architectural proposal with direct empirical validation

full rationale

The paper defines TTT layers by making the hidden state itself a learnable model (linear or 2-layer MLP) whose parameters are updated via a self-supervised gradient step on each test token or segment. This is an explicit architectural choice, not a mathematical derivation that reduces to prior equations or fitted inputs. No load-bearing self-citations, uniqueness theorems from the same authors, or ansatzes smuggled via prior work appear in the core construction. The central scaling claim (TTT continues reducing perplexity beyond 16k tokens while Mamba plateaus) rests on direct experimental comparisons at 125M–1.3B scale rather than any reduction of outputs to inputs by construction. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that self-supervised test-time updates can meaningfully increase hidden-state expressiveness in linear-complexity layers; no free parameters or invented entities are quantified in the abstract.

axioms (1)
  • domain assumption Self-supervised gradient updates on a small model serving as hidden state improve expressiveness without instability at test time
    This is the load-bearing premise that allows linear complexity to coexist with high capacity.
invented entities (1)
  • TTT layer no independent evidence
    purpose: Sequence modeling layer whose hidden state is a trainable model updated at test time
    New architectural primitive introduced to overcome limited expressiveness of standard RNN states.

pith-pipeline@v0.9.0 · 5551 in / 1236 out tokens · 50554 ms · 2026-05-15T05:15:14.448221+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. WriteSAE: Sparse Autoencoders for Recurrent State

    cs.LG 2026-05 unverdicted novelty 8.0

    WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.

  2. WriteSAE: Sparse Autoencoders for Recurrent State

    cs.LG 2026-05 unverdicted novelty 8.0

    WriteSAE decomposes recurrent model cache writes into substitutable atoms with a closed-form logit shift, achieving high substitution success and targeted behavioral installs on models like Qwen3.5 and Mamba-2.

  3. Test-Time Learning with an Evolving Library

    cs.LG 2026-05 unverdicted novelty 7.0

    EvoLib enables LLMs to accumulate, reuse, and evolve knowledge abstractions from inference trajectories at test time, yielding substantial gains on math reasoning, code generation, and agentic benchmarks without param...

  4. Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training

    cs.CV 2026-04 unverdicted novelty 7.0

    Mem3R achieves better long-sequence 3D reconstruction by decoupling tracking and mapping with a hybrid memory of TTT-updated MLP and explicit tokens, reducing model size and trajectory errors.

  5. OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention

    cs.LG 2026-05 unverdicted novelty 6.0

    OSDN adds online diagonal preconditioning to the Delta Rule, preserving chunkwise parallelism while proving super-geometric convergence and delivering 32-39% recall gains at 340M-1.3B scales.

  6. A Single-Layer Model Can Do Language Modeling

    cs.CL 2026-05 unverdicted novelty 6.0

    A 130M-parameter 1-layer GPN achieves FineWeb-Edu perplexity 18.06, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34).

  7. Linearizing Vision Transformer with Test-Time Training

    cs.CV 2026-05 unverdicted novelty 6.0

    Using Test-Time Training's structural match to Softmax attention plus key normalization and locality modules allows inheriting pretrained weights and fine-tuning Stable Diffusion 3.5 in one hour to match quality while...

  8. Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration

    cs.AI 2026-04 unverdicted novelty 6.0

    LLM agents trained with a task-success reward on self-generated knowledge can spontaneously explore and adapt to new environments without any rewards or instructions at inference, yielding 20% gains on web tasks and a...

  9. DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks

    cs.CV 2026-04 unverdicted novelty 6.0

    CLWM with DINOv3 targets, O(1) TTT memory, SAI latency masking, and EmbodiChain training achieves SOTA dual-arm simulation performance and zero-shot sim-to-real transfer that beats real-data finetuned baselines.

  10. Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction

    cs.CV 2026-04 unverdicted novelty 6.0

    Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.

  11. In-Place Test-Time Training

    cs.LG 2026-04 conditional novelty 6.0

    In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.

  12. Kimi Linear: An Expressive, Efficient Attention Architecture

    cs.CL 2025-10 unverdicted novelty 6.0

    Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.

  13. MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    cs.CL 2025-06 unverdicted novelty 6.0

    MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training...

  14. Titans: Learning to Memorize at Test Time

    cs.LG 2024-12 unverdicted novelty 6.0

    Titans combine attention for current context with a learnable neural memory for long-term history, achieving better performance and scaling to over 2M-token contexts on language, reasoning, genomics, and time-series tasks.

  15. Mela: Test-Time Memory Consolidation based on Transformation Hypothesis

    cs.CL 2026-05 unverdicted novelty 5.0

    Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.

  16. Cortico-cerebellar modularity as an architectural inductive bias for efficient temporal learning

    q-bio.NC 2026-05 unverdicted novelty 5.0

    CB-RNNs with a cerebellar feedforward module learn temporal tasks faster than matched RNNs, with the module driving efficiency even after freezing the recurrent core as a fixed reservoir.

  17. Kaczmarz Linear Attention

    cs.LG 2026-05 unverdicted novelty 5.0

    Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...

  18. PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents

    cs.LG 2026-05 unverdicted novelty 5.0

    PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing...

  19. Measuring Accuracy and Energy-to-Solution of Quantum Fine-Tuning of Foundational AI Models

    quant-ph 2026-05 conditional novelty 5.0

    Trapped-ion quantum fine-tuning of AI models shows linear energy scaling and 24% better classification error than classical logistic regression or SVM baselines, with a projected energy break-even at 34 qubits.

  20. Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference

    cs.DC 2026-03 unverdicted novelty 5.0

    Unifying LLM memory optimizations into a Prepare-Compute-Retrieve-Apply pipeline and accelerating it on GPU-FPGA hardware yields up to 2.2x faster inference and 4.7x less energy than GPU-only baselines.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · cited by 19 Pith papers · 13 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Learning to learn by gradient descent by gradient descent

    Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. Advances in neural information processing systems, 29, 2016

  3. [3]

    You just found out your book was used to train ai

    Authors Guild. You just found out your book was used to train ai. now what?, 2023. Accessed: 2024-06-24

  4. [4]

    xlstm: Ex- tended long short-term memory

    Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xlstm: Ex- tended long short-term memory. arXiv preprint arXiv:2405.04517, 2024

  5. [5]

    Learning a synaptic learning rule

    Yoshua Bengio, Samy Bengio, and Jocelyn Cloutier. Learning a synaptic learning rule. Citeseer, 1990

  6. [6]

    The nadaraya-watson kernel regression function estimator

    Hermanus Josephus Bierens. The nadaraya-watson kernel regression function estimator. (Serie Research Memoranda; No. 1988-58). Faculty of Economics and Business Administration, Vrije Universiteit Amsterdam., 1988

  7. [7]

    Pattern recognition and machine learning , volume 4

    Christopher M Bishop and Nasser M Nasrabadi. Pattern recognition and machine learning , volume 4. Springer, 2006

  8. [8]

    Gpt-neox-20b: An open-source autoregressive language model

    Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, et al. Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745, 2022

  9. [9]

    Local learning algorithms.Neural computation, 4(6):888–900, 1992

    Léon Bottou and Vladimir Vapnik. Local learning algorithms.Neural computation, 4(6):888–900, 1992

  10. [10]

    Variable kernel estimates of multivariate densities

    Leo Breiman, William Meisel, and Edward Purcell. Variable kernel estimates of multivariate densities. Technometrics, 19(2):135–144, 1977

  11. [11]

    Weighted nadaraya–watson regression estimation

    Zongwu Cai. Weighted nadaraya–watson regression estimation. Statistics & probability letters, 51(3):307–318, 2001

  12. [12]

    Training deep nets with sublinear memory cost, 2016

    Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost, 2016

  13. [13]

    Improved Baselines with Momentum Contrastive Learning

    Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020

  14. [14]

    A tutorial on kernel density estimation and recent advances

    Yen-Chi Chen. A tutorial on kernel density estimation and recent advances. Biostatistics & Epidemiology, 1(1):161–187, 2017

  15. [15]

    Meta-learning fast weight language models

    Kevin Clark, Kelvin Guu, Ming-Wei Chang, Panupong Pasupat, Geoffrey Hinton, and Moham- mad Norouzi. Meta-learning fast weight language models. arXiv preprint arXiv:2212.02475, 2022

  16. [16]

    Large scale transductive svms

    Ronan Collobert, Fabian Sinz, Jason Weston, Léon Bottou, and Thorsten Joachims. Large scale transductive svms. Journal of Machine Learning Research, 7(8), 2006

  17. [17]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. arXiv preprint arXiv:2405.21060, 2024. 20

  18. [18]

    Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

    Soham De, Samuel L Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, et al. Griffin: Mixing gated linear recurrences with local attention for e fficient language models. arXiv preprint arXiv:2402.19427, 2024

  19. [19]

    In the long (context) run, 2023

    Harm de Vries. In the long (context) run, 2023. Accessed: 2024-06-24

  20. [20]

    Dynamic connections in neural networks.Biological cybernetics, 46(1):27–39, 1982

    Jerome A Feldman. Dynamic connections in neural networks.Biological cybernetics, 46(1):27–39, 1982

  21. [21]

    Model-agnostic meta-learning for fast adapta- tion of deep networks

    Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adapta- tion of deep networks. In International conference on machine learning, pages 1126–1135. PMLR, 2017

  22. [22]

    Gammerman, V

    A. Gammerman, V. Vovk, and V. Vapnik. Learning by transduction. In In Uncertainty in Artificial Intelligence, pages 148–155. Morgan Kaufmann, 1998

  23. [23]

    Yossi Gandelsman, Yu Sun, Xinlei Chen, and Alexei A. Efros. Test-time training with masked autoencoders. Advances in Neural Information Processing Systems, 2022

  24. [24]

    The pile: An 800gb dataset of diverse text for language modeling, 2020

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling, 2020

  25. [25]

    EasyLM: A Simple And Scalable Training Framework for Large Language Models

    Xinyang Geng. EasyLM: A Simple And Scalable Training Framework for Large Language Models. https://github.com/young-geng/EasyLM, mar 2023. https://github.com/ young-geng/EasyLM

  26. [26]

    Unlocking state-tracking in linear rnns through negative eigenvalues

    Riccardo Grazzi, Julien Siems, Arber Zela, Jörg KH Franke, Frank Hutter, and Massimiliano Pontil. Unlocking state-tracking in linear rnns through negative eigenvalues. International Conference on Learning Representations (ICLR), 2024

  27. [27]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

  28. [28]

    Self-supervised policy adaptation during deployment

    Nicklas Hansen, Rishabh Jangir, Yu Sun, Guillem Alenyà, Pieter Abbeel, Alexei A Efros, Lerrel Pinto, and Xiaolong Wang. Self-supervised policy adaptation during deployment. arXiv preprint arXiv:2007.04309, 2020

  29. [29]

    Test-time training on nearest neighbors for large language models

    Moritz Hardt and Yu Sun. Test-time training on nearest neighbors for large language models. arXiv preprint arXiv:2305.18466, 2023

  30. [30]

    predictable

    Horace He. Strangely, matrix multiplications on gpus run faster when given "predictable" data! [short], 2024. Accessed: 2024-06-30

  31. [31]

    Gaussian Error Linear Units (GELUs)

    Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016

  32. [32]

    Using fast weights to deblur old memories

    Geoffrey E Hinton and David C Plaut. Using fast weights to deblur old memories. InProceedings of the ninth annual conference of the Cognitive Science Society, pages 177–186, 1987

  33. [33]

    Long short-term memory

    Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997

  34. [34]

    Rae, Oriol Vinyals, and Laurent Sifre

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

  35. [35]

    The dual form of neural networks revisited: Connecting test time predictions to training patterns via spotlights of attention

    Kazuki Irie, Róbert Csordás, and Jürgen Schmidhuber. The dual form of neural networks revisited: Connecting test time predictions to training patterns via spotlights of attention. In International Conference on Machine Learning, pages 9639–9659. PMLR, 2022

  36. [36]

    Practical computational power of linear transformers and their recurrent and self-referential extensions

    Kazuki Irie, Róbert Csordás, and Jürgen Schmidhuber. Practical computational power of linear transformers and their recurrent and self-referential extensions. Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

  37. [37]

    Neural di fferential equations for learning to program neural nets through continuous learning rules

    Kazuki Irie, Francesco Faccio, and Jürgen Schmidhuber. Neural di fferential equations for learning to program neural nets through continuous learning rules. Advances in Neural Information Processing Systems, 35:38614–38628, 2022

  38. [38]

    Going beyond linear transformers with recurrent fast weight programmers.Advances in Neural Information Processing Systems, 34:7703–7717, 2021

    Kazuki Irie, Imanol Schlag, Róbert Csordás, and Jürgen Schmidhuber. Going beyond linear transformers with recurrent fast weight programmers.Advances in Neural Information Processing Systems, 34:7703–7717, 2021

  39. [39]

    A modern self-referential weight matrix that learns to modify itself

    Kazuki Irie, Imanol Schlag, Róbert Csordás, and Jürgen Schmidhuber. A modern self-referential weight matrix that learns to modify itself. In International Conference on Machine Learning , pages 9660–9677. PMLR, 2022

  40. [40]

    Images as weight matrices: Sequential image generation through synaptic learning rules

    Kazuki Irie and Jürgen Schmidhuber. Images as weight matrices: Sequential image generation through synaptic learning rules. International Conference on Learning Representations (ICLR), 2022

  41. [41]

    Online domain adaptation of a pre-trained cascade of classifiers

    Vidit Jain and Erik Learned-Miller. Online domain adaptation of a pre-trained cascade of classifiers. In CVPR 2011, pages 577–584. IEEE, 2011

  42. [42]

    Learning to classify text using support vector machines, volume 668

    Thorsten Joachims. Learning to classify text using support vector machines, volume 668. Springer Science & Business Media, 2002

  43. [43]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

  44. [44]

    Transformers are rnns: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020

  45. [45]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  46. [46]

    Meta learning backpropagation and improving it

    Louis Kirsch and Jürgen Schmidhuber. Meta learning backpropagation and improving it. Advances in Neural Information Processing Systems, 34:14122–14134, 2021

  47. [47]

    Dynamic evaluation of neural sequence models

    Ben Krause, Emmanuel Kahembwe, Iain Murray, and Steve Renals. Dynamic evaluation of neural sequence models. In International Conference on Machine Learning, pages 2766–2775. PMLR, 2018

  48. [48]

    Dynamic Evaluation of Transformer Language Models

    Ben Krause, Emmanuel Kahembwe, Iain Murray, and Steve Renals. Dynamic evaluation of transformer language models. arXiv preprint arXiv:1904.08378, 2019

  49. [49]

    E fficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. E fficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023

  50. [50]

    Building machines that learn and think like people

    Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building machines that learn and think like people. Behavioral and brain sciences, 40:e253, 2017. 22

  51. [51]

    Building high-level features using large scale unsupervised learning

    Quoc V Le. Building high-level features using large scale unsupervised learning. In 2013 IEEE international conference on acoustics, speech and signal processing, pages 8595–8598. IEEE, 2013

  52. [52]

    World model on million-length video and language with blockwise ringattention

    Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with blockwise ringattention. arXiv preprint arXiv:2402.08268, 2024

  53. [53]

    Consistent video depth estimation

    Xuan Luo, Jia-Bin Huang, Richard Szeliski, Kevin Matzen, and Johannes Kopf. Consistent video depth estimation. ACM Transactions on Graphics (ToG), 39(4):71–1, 2020

  54. [54]

    Gradient-based hyperparameter optimization through reversible learning

    Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter optimization through reversible learning. In International conference on machine learning, pages 2113–2122. PMLR, 2015

  55. [55]

    Meta-Learning Update Rules for Unsupervised Representation Learning

    Luke Metz, Niru Maheswaranathan, Brian Cheung, and Jascha Sohl-Dickstein. Meta-learning update rules for unsupervised representation learning. arXiv preprint arXiv:1804.00222, 2018

  56. [56]

    Online model distillation for efficient video inference

    Ravi Teja Mullapudi, Steven Chen, Keyi Zhang, Deva Ramanan, and Kayvon Fatahalian. Online model distillation for efficient video inference. arXiv preprint arXiv:1812.02699, 2018

  57. [57]

    Pedregosa, G

    F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P . Pret- tenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011

  58. [58]

    RWKV: Reinventing RNNs for the Transformer Era

    Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, et al. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023

  59. [59]

    Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence

    Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Teddy Ferdinan, Haowen Hou, Przemysław Kazienko, et al. Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence. arXiv preprint arXiv:2404.05892, 2024

  60. [60]

    The devil in linear transformer

    Zhen Qin, Xiaodong Han, Weixuan Sun, Dongxu Li, Lingpeng Kong, Nick Barnes, and Yiran Zhong. The devil in linear transformer. arXiv preprint arXiv:2210.10340, 2022

  61. [61]

    The perceptron: a probabilistic model for information storage and organiza- tion in the brain

    Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organiza- tion in the brain. Psychological review, 65(6):386, 1958

  62. [62]

    Linear transformers are secretly fast weight programmers

    Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight programmers. In International Conference on Machine Learning, pages 9355–9366. PMLR, 2021

  63. [63]

    Learning associative inference using fast weight memory

    Imanol Schlag, Tsendsuren Munkhdalai, and Jürgen Schmidhuber. Learning associative inference using fast weight memory. arXiv preprint arXiv:2011.07831, 2020

  64. [64]

    Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-

    Jürgen Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook. PhD thesis, Technische Universität München, 1987

  65. [65]

    Learning to control fast-weight memories: An alternative to dynamic recurrent networks

    Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 4(1):131–139, 1992

  66. [66]

    Glu variants improve transformer, 2020

    Noam Shazeer. Glu variants improve transformer, 2020

  67. [67]

    Normformer: Improved transformer pretraining with extra normalization

    Sam Shleifer, Jason Weston, and Myle Ott. Normformer: Improved transformer pretraining with extra normalization. arXiv preprint arXiv:2110.09456, 2021. 23

  68. [68]

    zero-shot

    Assaf Shocher, Nadav Cohen, and Michal Irani. “zero-shot” super-resolution using deep inter- nal learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3118–3126, 2018

  69. [69]

    Roformer: Enhanced transformer with rotary position embedding, 2023

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023

  70. [70]

    Learning to (learn at test time)

    Yu Sun, Xinhao Li, Karan Dalal, Chloe Hsu, Sanmi Koyejo, Carlos Guestrin, Xiaolong Wang, Tatsunori Hashimoto, and Xinlei Chen. Learning to (learn at test time). arXiv preprint arXiv:2310.13807, 2023

  71. [71]

    Online learning of unknown dynamics for model-based controllers in legged locomotion

    Yu Sun, Wyatt L Ubellacker, Wen-Loong Ma, Xiang Zhang, Changhao Wang, Noel V Csomay- Shanklin, Masayoshi Tomizuka, Koushil Sreenath, and Aaron D Ames. Online learning of unknown dynamics for model-based controllers in legged locomotion. IEEE Robotics and Automation Letters, 6(4):8442–8449, 2021

  72. [72]

    Test-time training with self-supervision for generalization under distribution shifts

    Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. In International Conference on Machine Learning, pages 9229–9248. PMLR, 2020

  73. [73]

    Learning to learn: Introduction and overview

    Sebastian Thrun and Lorien Pratt. Learning to learn: Introduction and overview. In Learning to learn, pages 3–17. Springer, 1998

  74. [74]

    Using fast weights to improve persistent contrastive divergence

    Tijmen Tieleman and Geoffrey Hinton. Using fast weights to improve persistent contrastive divergence. In Proceedings of the 26th annual international conference on machine learning, pages 1033–1040, 2009

  75. [75]

    Llama 2: Open foundation and fine-tuned chat models, 2023

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

  76. [76]

    The nature of statistical learning theory

    Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 2013

  77. [77]

    Extracting and composing robust features with denoising autoencoders

    Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In ICML, page 1096–1103, 2008

  78. [78]

    The correlation theory of brain function

    Christoph Von Der Malsburg. The correlation theory of brain function. In Models of neural networks: Temporal aspects of coding and information processing in biological systems , pages 95–119. Springer, 1994

  79. [79]

    Test-time training on video streams

    Renhao Wang, Yu Sun, Yossi Gandelsman, Xinlei Chen, Alexei A Efros, and Xiaolong Wang. Test-time training on video streams. arXiv preprint arXiv:2307.05014, 2023

  80. [80]

    HuggingFace's Transformers: State-of-the-art Natural Language Processing

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019. 24

Showing first 80 references.