Memory by Design: Probabilistic Sequence Layers

Cristina Savin; Hyungju Jeon; Il Memming Park; Matthew Dowling

arxiv: 2605.31163 · v1 · pith:5PDMMS7Tnew · submitted 2026-05-29 · 📊 stat.ML · cs.LG

Memory by Design: Probabilistic Sequence Layers

Matthew Dowling , Hyungju Jeon , Cristina Savin , Il Memming Park This is my paper

Pith reviewed 2026-06-28 21:05 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords Bayesian filteringsequence modelslinear attentionstate space modelsmemory mechanismsDeltaNetassociative recalllong-context retrieval

0 comments

The pith

A design model using exact Bayesian filtering derives and unifies linear attention, GLA, Mamba-2, and DeltaNet while showing covariance tracking boosts retrieval robustness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a design-model framework that derives efficient recurrent sequence layers from explicit assumptions about memory, where evidence is written by exact Bayesian filtering and read out as a predictive distribution. In the linear-Gaussian case this yields Bayesian Layers that propagate both a mean and a covariance to track uncertainty over stored associations. The same framework recovers linear attention, GLA, and Mamba-2/SSD as exact filters under one design model and DeltaNet-style models as covariance-reset reductions under another. Restoring the covariance supplies closed-form retrieval predictions that are verified empirically and shown to increase robustness outside the training distribution on collision studies, associative recall tasks, and the MQAR benchmark; distilling the resulting layers into a 340M Gated DeltaNet improves long-context retrieval on RULER at matched compute.

Core claim

Several sub-quadratic recurrent architectures arise as instances of a design model in which memory writes are performed by exact Bayesian filtering; linear attention, GLA, and Mamba-2/SSD are exact filters under one such model, while DeltaNet and related Delta-rule models are covariance-reset reductions under another. Retaining the covariance yields closed-form predictions for retrieval dynamics that improve robustness beyond the training regime.

What carries the argument

The design-model framework, which specifies memory writes via exact Bayesian filtering and query-dependent readout; its linear-Gaussian instantiation is the Bayesian Layer that maintains and propagates both mean and covariance.

If this is right

Linear attention, GLA, and Mamba-2/SSD are recovered as exact Bayesian filters under one design model.
DeltaNet and Delta-rule models emerge as covariance-reset reductions under a second design model.
Restoring covariance produces closed-form predictions for retrieval dynamics.
Covariance-aware layers improve robustness on collision studies, associative recall, and MQAR beyond the training regime.
Distilling Bayesian Layers into a pretrained Gated DeltaNet raises RULER long-context retrieval at matched compute.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The unification may allow systematic derivation of new hybrid layers that combine exact filtering steps with selective resets.
Covariance tracking could serve as a diagnostic for when sequence models will generalize on out-of-distribution retrieval tasks.
The design-model approach might extend naturally to non-Gaussian or non-linear memory dynamics for more structured sequence problems.

Load-bearing premise

The linear-Gaussian instantiation accurately captures the memory dynamics of the target sequence layers and the design-model assumptions hold for the unified models without additional unstated constraints.

What would settle it

An experiment in which restoring covariance propagation fails to improve robustness on controlled collision studies, learned associative recall, or the Zoology MQAR benchmark, or in which the listed architectures deviate from the exact Bayesian filtering predictions of the design model.

Figures

Figures reproduced from arXiv: 2605.31163 by Cristina Savin, Hyungju Jeon, Il Memming Park, Matthew Dowling.

**Figure 1.** Figure 1: Design-model framework. (a) An input selects a design model; its filter updates the belief over memory, and the readout produces the layer output. (b) Propagating dense covariance gives the Bayesian Layer; covariance reset gives Delta-rule family; a latent-input model gives additive family. propagation: carrying uncertainty forward so future writes reflect past evidence. We evaluate this mechanism in contr… view at source ↗

**Figure 2.** Figure 2: Covariance as write geometry. A raw key triggers the write, but covariance decides where it lands: resolved components shrink, residual uncertainty receives the update. (a) The predicted covariance maps kt to the effective write direction ut = P¯ tkt. (b) Repeated writes contract uncertainty along the written address, reducing future gain there. (c) After B has been resolved, a colliding A write rotates aw… view at source ↗

**Figure 3.** Figure 3: Controlled collision recall. Two overlapping addresses kA, kB (overlap ρ); B is boosted, then A is flooded, and kB is queried. (a) Pairwise retrieval at kB over time (ρ=0.92). The dotted line at 0.731=(1+e −1 ) −1 is the boost asymptote (the pairwise softmax when the target outscores the distractor by 1). The Bayesian Layer holds this plateau through the flood; Linear Attention overwrites toward zero; Delt… view at source ↗

**Figure 4.** Figure 4: MQAR with shared key–value vocabulary. (a) Bayesian Layer matches the best baselines on base MQAR. (b) On Update-MQAR, BL holds at ≈88% while GDN/Mamba-2 fall to ≈60%. (c) On Block-MQAR, conv-free BL and GDN bridge the key–value gap. Results. On base MQAR, the Bayesian Layer approaches nearperfect recall at larger widths and is competitive with the strongest baselines (Figure 4a). Results except those fo… view at source ↗

**Figure 5.** Figure 5: Bayesian Layer improves long-context retrieval. (a) Mean RULER NIAH accuracy gain of BL over GDN-zero by (context length, depth decile); green cells favor BL, pink cells favor GDNzero. Depth 0.0 is earliest in context, 0.9 is latest. (b) Same delta broken out by NIAH subtask on a [−15, +15]-point scale. Gains concentrate on multi-target retrieval (multivalue, multiquery); the early-2k regression in (a) is… view at source ↗

**Figure 6.** Figure 6: Deterministic write-dynamics diagnostics. (left) Scalar write gain gt = βtk ⊤ t ut over time for the deterministic experiment of Section 4.1 (a, b), verifying the closed-form gain envelope of Equation (42): a spike at the boost→flood transition followed by decay to the Riccati floor g ss ≈0.62. (right) Deterministic overlap sweep—final retrieval at k¯B as a function of ρ at fixed write schedule. The Bayesi… view at source ↗

**Figure 7.** Figure 7: Random-key crowding. Cross-entropy at queries vs. key-space density n/D under the random-key task above (no engineered target–distractor pairs). This stresses the cross-direction prediction (1−ρ 2 )ℓ 2 of Equation (16): a write at one address replenishes other addresses’ uncertainty at this rate, which becomes a crowding cost as n/D grows. The Bayesian Layer retains the lowest loss across all densities, wh… view at source ↗

**Figure 8.** Figure 8: Appendix Fig. A1. Position-stratified BL − GDN-FT RULER accuracy deltas for the matched-compute control. The panels mirror Figure 5B,C. context model niah_single_1 niah_single_2 niah_single_3 niah_multikey_1 niah_multivalue niah_multiquery mean 2048 GDN-zero 100.00 96.50 30.50 28.00 19.25 22.88 49.52 2048 GDN-FT 100.00 92.00 27.00 25.00 23.50 23.25 48.46 2048 BL 100.00 89.00 42.00 27.50 22.88 23.50 50.81 4… view at source ↗

read the original abstract

We introduce the design-model framework: a way to derive efficient recurrent sequence maps from explicit assumptions about memory. A design model writes evidence into memory by exact Bayesian filtering; a query-dependent readout produces a predictive distribution whose mean is the layer output. In our linear-Gaussian instantiation, the \emph{Bayesian Layer} propagates both a mean and a covariance: the covariance tracks uncertainty over stored associations, steering writes toward uncertain directions, attenuating gains as evidence accumulates, and preserving confident memories. The same framework unifies several sub-quadratic recurrences. Linear attention, GLA, and Mamba-2/SSD are exact filters under one design model, whereas DeltaNet and related Delta-rule models arise as covariance-reset reductions under another. Restoring the covariance yields closed-form predictions for retrieval dynamics, verified empirically, and improves robustness beyond the training regime across controlled collision studies, learned associative recall, and the Zoology MQAR benchmark; distilling Bayesian Layers into a pretrained 340M Gated DeltaNet improves RULER long-context retrieval at matched compute.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The design-model framework unifies several sequence layers as exact Bayesian filters under linear-Gaussian assumptions, with covariance propagation adding measurable robustness, though the exactness of the mapping needs direct verification.

read the letter

The one or two things to know: the paper derives sub-quadratic sequence layers from explicit Bayesian memory assumptions via a design-model framework, showing linear attention, GLA, and Mamba-2/SSD as exact filters under one linear-Gaussian model and DeltaNet as a covariance-reset reduction under another; restoring covariance gives closed-form retrieval predictions that hold up in tests.

What is new is the explicit design model that starts from exact Bayesian filtering for writes and produces a query-dependent readout, plus the Bayesian Layer that propagates both mean and covariance to steer writes toward uncertain directions and preserve confident memories. The paper does well in the empirical section: controlled collision studies, learned associative recall, the Zoology MQAR benchmark, and a distillation experiment where Bayesian Layers improve RULER long-context retrieval in a 340M Gated DeltaNet at matched compute.

The soft spot is whether the linear-Gaussian mean and covariance propagation reproduces the original recurrences exactly, without hidden approximations, different noise models, or normalization steps. The unification claim rests on that mapping; if it holds without extra constraints, the framework is a clean derivation, but any mismatch turns it into a reinterpretation. The stress-test concern is on point and worth checking against the equations side by side.

This paper is for researchers working on efficient long-sequence models and state-space variants. A reader focused on principled memory assumptions in attention or SSMs would get value from the framework and the robustness results. It deserves a serious referee because the new derivation path and the reproducible empirical claims are substantive enough to review.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the design-model framework for deriving recurrent sequence layers from explicit Bayesian filtering assumptions on memory writes, with a query-dependent readout. In the linear-Gaussian instantiation, the Bayesian Layer propagates both mean and covariance to track uncertainty over associations. It claims that Linear attention, GLA, and Mamba-2/SSD are exact filters under one design model while DeltaNet and related models arise as covariance-reset reductions under another; restoring the covariance yields closed-form retrieval predictions that are empirically verified and improve robustness on collision studies, associative recall, the Zoology MQAR benchmark, and via distillation into a 340M Gated DeltaNet on RULER.

Significance. If the exact equivalences hold without unstated constraints, the framework supplies a principled route to unify and extend sub-quadratic sequence models while generating falsifiable closed-form predictions for retrieval dynamics. The reported empirical gains on controlled robustness tests and the distillation result on long-context retrieval constitute concrete strengths that could be leveraged for model design.

major comments (2)

[Abstract and unification section] Abstract and unification section: the claim that Linear attention, GLA, and Mamba-2/SSD are 'exact filters' under the stated linear-Gaussian design model (and DeltaNet a covariance-reset reduction) is load-bearing for the unification and closed-form prediction results. The mean/covariance propagation must be shown to recover the original recurrences exactly, including any normalization, scaling, or noise-model details, without additional approximations or hidden constraints.
[Empirical evaluation sections] Empirical evaluation sections: the robustness improvements 'beyond the training regime' across collision studies, learned associative recall, and MQAR must be supported by ablations that isolate the contribution of the restored covariance from other implementation choices, to substantiate that the design-model restoration is the operative factor.

minor comments (1)

Notation for the Bayesian Layer mean and covariance updates should be defined once with explicit variable lists to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for highlighting the centrality of the exact equivalences and the need for isolating ablations. We address each major comment below and will revise the manuscript to strengthen clarity and evidence where indicated.

read point-by-point responses

Referee: [Abstract and unification section] Abstract and unification section: the claim that Linear attention, GLA, and Mamba-2/SSD are 'exact filters' under the stated linear-Gaussian design model (and DeltaNet a covariance-reset reduction) is load-bearing for the unification and closed-form prediction results. The mean/covariance propagation must be shown to recover the original recurrences exactly, including any normalization, scaling, or noise-model details, without additional approximations or hidden constraints.

Authors: We agree that explicit verification of exact recovery is essential for the unification claims. Section 3 of the manuscript derives the Bayesian filter updates and shows that they recover the recurrences of Linear Attention, GLA, and Mamba-2/SSD exactly under the linear-Gaussian design model, with the specific scaling, normalization, and noise terms matching those in the original formulations; DeltaNet is obtained precisely by the covariance-reset reduction. To make this fully transparent, we will add a dedicated appendix containing the complete algebraic steps that recover each original recurrence from the mean/covariance propagation equations, confirming the absence of approximations or hidden constraints. revision: yes
Referee: [Empirical evaluation sections] Empirical evaluation sections: the robustness improvements 'beyond the training regime' across collision studies, learned associative recall, and MQAR must be supported by ablations that isolate the contribution of the restored covariance from other implementation choices, to substantiate that the design-model restoration is the operative factor.

Authors: We agree that isolating the covariance contribution is necessary to substantiate the claims. The manuscript already compares the full Bayesian Layer against its covariance-reset counterpart (the DeltaNet reduction) across the collision, associative recall, and MQAR experiments; these comparisons directly attribute gains to the covariance term. We will revise the empirical sections to present these as explicit ablations, add controls for other implementation factors (e.g., initialization and normalization variants), and move relevant supplementary results into the main text to strengthen the isolation of the design-model effect. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper derives recurrent sequence layers from explicit design-model assumptions (exact Bayesian filtering for writes, query-dependent readout) in a linear-Gaussian instantiation. Claims that Linear attention, GLA, Mamba-2/SSD are exact filters and DeltaNet is a covariance-reset reduction are presented as mathematical equivalences obtained by specializing the framework, not as fitted parameters renamed as predictions or self-definitions. No load-bearing self-citations, ansatz smuggling, or renaming of known results are exhibited in the provided text; the unification and closed-form predictions rest on the stated assumptions rather than reducing to inputs by construction. Empirical tests on retrieval benchmarks are independent of the unification step.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Review limited to abstract; full details on parameters or additional assumptions unavailable. The framework rests on Bayesian filtering as the memory-write mechanism and linear-Gaussian statistics for the concrete layer.

axioms (2)

domain assumption Memory writes follow exact Bayesian filtering
Central premise of the design-model framework stated in the abstract.
domain assumption Linear-Gaussian model suffices for the sequence-layer instantiation
Used to derive the Bayesian Layer that propagates mean and covariance.

invented entities (1)

Bayesian Layer no independent evidence
purpose: Sequence layer that maintains and uses covariance to steer writes and preserve confident memories
Introduced as the linear-Gaussian realization of the design model.

pith-pipeline@v0.9.1-grok · 5713 in / 1466 out tokens · 33410 ms · 2026-06-28T21:05:17.720000+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 16 linked inside Pith

[1]

Brian D. O. Anderson and John B. Moore. Optimal Filtering. Prentice-Hall , Englewood Cliffs, N.J. , 1979. ISBN 978-0-13-638122-8

1979
[2]

Zoology: Measuring and improving recall in efficient language models

Simran Arora, Sabri Eyuboglu, Aman Timalsina, Isys Johnson, Michael Poli, James Zou, Atri Rudra, and Christopher R \'e . Zoology: Measuring and improving recall in efficient language models. arXiv preprint arXiv:2312.04927, 2024

arXiv 2024
[3]

xLSTM : Extended long short-term memory

Maximilian Beck, Korbinian P \"o ppel, Markus Spanring, Andreas Ber, Emanuele Hu, Maor Ivgi, Gregor Lennartz, Kevin Schlegel, and Hochreiter Sepp. xLSTM : Extended long short-term memory. arXiv preprint arXiv:2405.04517, 2024

arXiv 2024
[4]

Atlas: Learning to optimally memorize the context at test time

Ali Behrouz, Zeman Li, Praneeth Kacham, Majid Daliri, Yuan Deng, Peilin Zhong, Meisam Razaviyayn, and Vahab Mirrokni. Atlas: Learning to optimally memorize the context at test time. arXiv preprint arXiv:2505.23735, 2025

arXiv 2025
[5]

Learning to remember, learn, and forget in attention-based models

Djohan Bonnet, Jamie Lohoff, Jan Finkbeiner, Elidona Skhikerujah, and Emre Neftci. Learning to remember, learn, and forget in attention-based models. arXiv preprint arXiv:2602.09075, 2026

Pith/arXiv arXiv 2026
[6]

Transformers are SSM s: Generalized models and efficient algorithms through structured state space duality

Tri Dao and Albert Gu. Transformers are SSM s: Generalized models and efficient algorithms through structured state space duality. In International Conference on Machine Learning, 2024 a

2024
[7]

Mamba-2: The structured state space duality

Tri Dao and Albert Gu. Mamba-2: The structured state space duality. arXiv preprint arXiv:2405.21060, 2024 b

Pith/arXiv arXiv 2024
[8]

Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Russ Harber, Tayfun Hazan, et al

Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Russ Harber, Tayfun Hazan, et al. Griffin: Mixing gated linear recurrences with local attention for efficient language models. arXiv preprint arXiv:2402.19427, 2024

Pith/arXiv arXiv 2024
[9]

MoM : Linear sequence modeling with mixture-of-memories

Jusen Du, Weigao Sun, Disen Lan, Jiaxi Hu, and Yu Cheng. MoM : Linear sequence modeling with mixture-of-memories. arXiv preprint arXiv:2502.13685, 2025

arXiv 2025
[10]

Neural turing machines

Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv [cs.NE], October 2014. URL http://arxiv.org/abs/1410.5401

Pith/arXiv arXiv 2014
[11]

Mamba: Linear-time sequence modeling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

Pith/arXiv arXiv 2023
[12]

RULER : What's the real context size of your long-context language models? arXiv preprint arXiv:2404.06654, 2024

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. RULER : What's the real context size of your long-context language models? arXiv preprint arXiv:2404.06654, 2024

Pith/arXiv arXiv 2024
[13]

Linear estimation

Thomas Kailath, Ali H Sayed, and Babak Hassibi. Linear estimation. Prentice Hall, 2000

2000
[14]

Transformers are RNNs : Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs : Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pages 5156--5165, 2020. URL https://proceedings.mlr.press/v119/katharopoulos20a.html

2020
[15]

Kimi linear: An expressive, efficient attention architecture

Kimi Team . Kimi linear: An expressive, efficient attention architecture. arXiv preprint arXiv:2510.26692, 2025

Pith/arXiv arXiv 2025
[16]

Liger : Linearizing large language models to gated recurrent structures

Disen Lan, Weigao Sun, Jiaxi Hu, Jusen Du, and Yu Cheng. Liger : Linearizing large language models to gated recurrent structures. In International Conference on Machine Learning (ICML), 2025

2025
[17]

Gerber, Elad Dolev, Eran Krakovsky, Erez Safahi, Erez Schwartz, Gal Cohen, et al

Barak Lenz, Opher Lieber, Alan Arazi, Amir Bergman, Avshalom Manevich, Barak Peleg, Ben Aviram, Chen Almagor, Clara Fridman, Dan Padnos, Daniel Gissin, Daniel Jannai, Dor Muhlgay, Dor Zimberg, Edden M. Gerber, Elad Dolev, Eran Krakovsky, Erez Safahi, Erez Schwartz, Gal Cohen, et al. Jamba : Hybrid transformer- Mamba language models. In International Confe...

2025
[18]

Longhorn: State space models are amortized online learners

Bo Liu, Rui Wang, Lemeng Wu, Yihao Feng, Peter Stone, and Qiang Liu. Longhorn: State space models are amortized online learners. arXiv preprint arXiv:2407.14207, 2024 a

arXiv 2024
[19]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12: 0 157--173, 2024 b

2024
[20]

Pointer sentinel mixture models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016

Pith/arXiv arXiv 2016
[21]

RWKV : Reinventing RNN s for the transformer era

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, et al. RWKV : Reinventing RNN s for the transformer era. arXiv preprint arXiv:2305.13048, 2023

Pith/arXiv arXiv 2023
[22]

Eagle and finch: RWKV with matrix-valued states and dynamic recurrence

Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, et al. Eagle and finch: RWKV with matrix-valued states and dynamic recurrence. arXiv preprint arXiv:2404.05892, 2024

Pith/arXiv arXiv 2024
[23]

Hyena hierarchy: Towards larger convolutional language models

Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher R \'e . Hyena hierarchy: Towards larger convolutional language models. In International Conference on Machine Learning, pages 28043--28078. PMLR, 2023

2023
[24]

Mechanistic design and scaling of hybrid architectures

Michael Poli, Armin W Thomas, Eric Nguyen, Pragaash Ponnusamy, Bj \"o rn Deiseroth, Kristian Kersting, Taiji Suzuki, Brian Hie, Stefano Ermon, Christopher R \'e , et al. Mechanistic design and scaling of hybrid architectures. arXiv preprint arXiv:2403.17844, 2024

arXiv 2024
[25]

HGRN2 : Gated linear RNN s with state expansion

Zhen Qin, Songlin Li, Weixuan Sun, Xuyang Zhong, Dongxu Yang, Bowen Peng, Hao Zhong, et al. HGRN2 : Gated linear RNN s with state expansion. arXiv preprint arXiv:2404.07904, 2024

arXiv 2024
[26]

Rae, Anna Potapenko, Siddhant M

Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, and Timothy P. Lillicrap. Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507, 2019

Pith/arXiv arXiv 1911
[27]

Bayesian filtering and smoothing

Simo S \"a rkk \"a . Bayesian filtering and smoothing. Cambridge University Press, 2013. ISBN 9781107619289. URL http://www.worldcat.org/isbn/9781107619289

arXiv 2013
[28]

Linear transformers are secretly fast weight programmers

Imanol Schlag, Kazuki Irie, and J \"u rgen Schmidhuber. Linear transformers are secretly fast weight programmers. arXiv preprint arXiv:2102.11174, 2021

arXiv 2021
[29]

Learning to control fast-weight memories: An alternative to dynamic recurrent networks

J \"u rgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 4 0 (1): 0 131--139, 1992

1992
[30]

Crowley, and Amos Storkey

Vaisakh Shaj, Cameron Barker, Aidan Scannell, Andras Szecsenyi, Elliot J. Crowley, and Amos Storkey. Kalman linear attention: Parallel bayesian filtering for efficient language modelling and state tracking. arXiv preprint arXiv:2602.10743, 2026

Pith/arXiv arXiv 2026
[31]

Deltaproduct: Increasing the expressivity of deltanet through products of householders

Julien Siems, Timur Carstensen, Arber Zela, Frank Hutter, Massimiliano Pontil, and Riccardo Grazzi. Deltaproduct: Increasing the expressivity of deltanet through products of householders. arXiv preprint arXiv:2502.10297, 2025

arXiv 2025
[32]

Steeves, Joel Hestness, and Nolan Dey

Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R. Steeves, Joel Hestness, and Nolan Dey. SlimPajama : A 627 B token cleaned and deduplicated version of RedPajama . https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama, 2023. Cerebras Systems blog

2023
[33]

Retentive network: A successor to transformer for large language models

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, 2023

Pith/arXiv arXiv 2023
[34]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017

2017
[35]

Saurous, Guillaume Lajoie, Charlotte Frenkel, Razvan Pascanu, Blaise Ag \"u era y Arcas, and Jo \ a o Sacramento

Johannes von Oswald, Nino Scherrer, Seijin Kobayashi, Luca Versari, Songlin Yang, Maximilian Schlegel, Kaitlin Maile, Yanick Schimpf, Oliver Sieberling, Alexander Meulemans, Rif A. Saurous, Guillaume Lajoie, Charlotte Frenkel, Razvan Pascanu, Blaise Ag \"u era y Arcas, and Jo \ a o Sacramento. MesaNet : Sequence modeling by locally optimal test-time train...

Pith/arXiv arXiv 2025
[36]

Rush, and Tri Dao

Junxiong Wang, Daniele Paliotta, Avner May, Alexander M. Rush, and Tri Dao. The Mamba in the Llama : Distilling and accelerating hybrid models. In Advances in Neural Information Processing Systems (NeurIPS), 2024

2024
[37]

Gated linear attention transformers with hardware-efficient training

Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. arXiv preprint arXiv:2312.06635, 2023

Pith/arXiv arXiv 2023
[38]

Gated delta networks: Improving mamba2 with delta rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. arXiv preprint arXiv:2412.06464, 2024 a

Pith/arXiv arXiv 2024
[39]

Parallelizing linear transformers with the delta rule over sequence length

Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. Advances in neural information processing systems, 37: 0 115491--115522, 2024 b

2024

[1] [1]

Brian D. O. Anderson and John B. Moore. Optimal Filtering. Prentice-Hall , Englewood Cliffs, N.J. , 1979. ISBN 978-0-13-638122-8

1979

[2] [2]

Zoology: Measuring and improving recall in efficient language models

Simran Arora, Sabri Eyuboglu, Aman Timalsina, Isys Johnson, Michael Poli, James Zou, Atri Rudra, and Christopher R \'e . Zoology: Measuring and improving recall in efficient language models. arXiv preprint arXiv:2312.04927, 2024

arXiv 2024

[3] [3]

xLSTM : Extended long short-term memory

Maximilian Beck, Korbinian P \"o ppel, Markus Spanring, Andreas Ber, Emanuele Hu, Maor Ivgi, Gregor Lennartz, Kevin Schlegel, and Hochreiter Sepp. xLSTM : Extended long short-term memory. arXiv preprint arXiv:2405.04517, 2024

arXiv 2024

[4] [4]

Atlas: Learning to optimally memorize the context at test time

Ali Behrouz, Zeman Li, Praneeth Kacham, Majid Daliri, Yuan Deng, Peilin Zhong, Meisam Razaviyayn, and Vahab Mirrokni. Atlas: Learning to optimally memorize the context at test time. arXiv preprint arXiv:2505.23735, 2025

arXiv 2025

[5] [5]

Learning to remember, learn, and forget in attention-based models

Djohan Bonnet, Jamie Lohoff, Jan Finkbeiner, Elidona Skhikerujah, and Emre Neftci. Learning to remember, learn, and forget in attention-based models. arXiv preprint arXiv:2602.09075, 2026

Pith/arXiv arXiv 2026

[6] [6]

Transformers are SSM s: Generalized models and efficient algorithms through structured state space duality

Tri Dao and Albert Gu. Transformers are SSM s: Generalized models and efficient algorithms through structured state space duality. In International Conference on Machine Learning, 2024 a

2024

[7] [7]

Mamba-2: The structured state space duality

Tri Dao and Albert Gu. Mamba-2: The structured state space duality. arXiv preprint arXiv:2405.21060, 2024 b

Pith/arXiv arXiv 2024

[8] [8]

Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Russ Harber, Tayfun Hazan, et al

Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Russ Harber, Tayfun Hazan, et al. Griffin: Mixing gated linear recurrences with local attention for efficient language models. arXiv preprint arXiv:2402.19427, 2024

Pith/arXiv arXiv 2024

[9] [9]

MoM : Linear sequence modeling with mixture-of-memories

Jusen Du, Weigao Sun, Disen Lan, Jiaxi Hu, and Yu Cheng. MoM : Linear sequence modeling with mixture-of-memories. arXiv preprint arXiv:2502.13685, 2025

arXiv 2025

[10] [10]

Neural turing machines

Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv [cs.NE], October 2014. URL http://arxiv.org/abs/1410.5401

Pith/arXiv arXiv 2014

[11] [11]

Mamba: Linear-time sequence modeling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

Pith/arXiv arXiv 2023

[12] [12]

RULER : What's the real context size of your long-context language models? arXiv preprint arXiv:2404.06654, 2024

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. RULER : What's the real context size of your long-context language models? arXiv preprint arXiv:2404.06654, 2024

Pith/arXiv arXiv 2024

[13] [13]

Linear estimation

Thomas Kailath, Ali H Sayed, and Babak Hassibi. Linear estimation. Prentice Hall, 2000

2000

[14] [14]

Transformers are RNNs : Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs : Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pages 5156--5165, 2020. URL https://proceedings.mlr.press/v119/katharopoulos20a.html

2020

[15] [15]

Kimi linear: An expressive, efficient attention architecture

Kimi Team . Kimi linear: An expressive, efficient attention architecture. arXiv preprint arXiv:2510.26692, 2025

Pith/arXiv arXiv 2025

[16] [16]

Liger : Linearizing large language models to gated recurrent structures

Disen Lan, Weigao Sun, Jiaxi Hu, Jusen Du, and Yu Cheng. Liger : Linearizing large language models to gated recurrent structures. In International Conference on Machine Learning (ICML), 2025

2025

[17] [17]

Gerber, Elad Dolev, Eran Krakovsky, Erez Safahi, Erez Schwartz, Gal Cohen, et al

Barak Lenz, Opher Lieber, Alan Arazi, Amir Bergman, Avshalom Manevich, Barak Peleg, Ben Aviram, Chen Almagor, Clara Fridman, Dan Padnos, Daniel Gissin, Daniel Jannai, Dor Muhlgay, Dor Zimberg, Edden M. Gerber, Elad Dolev, Eran Krakovsky, Erez Safahi, Erez Schwartz, Gal Cohen, et al. Jamba : Hybrid transformer- Mamba language models. In International Confe...

2025

[18] [18]

Longhorn: State space models are amortized online learners

Bo Liu, Rui Wang, Lemeng Wu, Yihao Feng, Peter Stone, and Qiang Liu. Longhorn: State space models are amortized online learners. arXiv preprint arXiv:2407.14207, 2024 a

arXiv 2024

[19] [19]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12: 0 157--173, 2024 b

2024

[20] [20]

Pointer sentinel mixture models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016

Pith/arXiv arXiv 2016

[21] [21]

RWKV : Reinventing RNN s for the transformer era

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, et al. RWKV : Reinventing RNN s for the transformer era. arXiv preprint arXiv:2305.13048, 2023

Pith/arXiv arXiv 2023

[22] [22]

Eagle and finch: RWKV with matrix-valued states and dynamic recurrence

Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, et al. Eagle and finch: RWKV with matrix-valued states and dynamic recurrence. arXiv preprint arXiv:2404.05892, 2024

Pith/arXiv arXiv 2024

[23] [23]

Hyena hierarchy: Towards larger convolutional language models

Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher R \'e . Hyena hierarchy: Towards larger convolutional language models. In International Conference on Machine Learning, pages 28043--28078. PMLR, 2023

2023

[24] [24]

Mechanistic design and scaling of hybrid architectures

Michael Poli, Armin W Thomas, Eric Nguyen, Pragaash Ponnusamy, Bj \"o rn Deiseroth, Kristian Kersting, Taiji Suzuki, Brian Hie, Stefano Ermon, Christopher R \'e , et al. Mechanistic design and scaling of hybrid architectures. arXiv preprint arXiv:2403.17844, 2024

arXiv 2024

[25] [25]

HGRN2 : Gated linear RNN s with state expansion

Zhen Qin, Songlin Li, Weixuan Sun, Xuyang Zhong, Dongxu Yang, Bowen Peng, Hao Zhong, et al. HGRN2 : Gated linear RNN s with state expansion. arXiv preprint arXiv:2404.07904, 2024

arXiv 2024

[26] [26]

Rae, Anna Potapenko, Siddhant M

Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, and Timothy P. Lillicrap. Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507, 2019

Pith/arXiv arXiv 1911

[27] [27]

Bayesian filtering and smoothing

Simo S \"a rkk \"a . Bayesian filtering and smoothing. Cambridge University Press, 2013. ISBN 9781107619289. URL http://www.worldcat.org/isbn/9781107619289

arXiv 2013

[28] [28]

Linear transformers are secretly fast weight programmers

Imanol Schlag, Kazuki Irie, and J \"u rgen Schmidhuber. Linear transformers are secretly fast weight programmers. arXiv preprint arXiv:2102.11174, 2021

arXiv 2021

[29] [29]

Learning to control fast-weight memories: An alternative to dynamic recurrent networks

J \"u rgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 4 0 (1): 0 131--139, 1992

1992

[30] [30]

Crowley, and Amos Storkey

Vaisakh Shaj, Cameron Barker, Aidan Scannell, Andras Szecsenyi, Elliot J. Crowley, and Amos Storkey. Kalman linear attention: Parallel bayesian filtering for efficient language modelling and state tracking. arXiv preprint arXiv:2602.10743, 2026

Pith/arXiv arXiv 2026

[31] [31]

Deltaproduct: Increasing the expressivity of deltanet through products of householders

Julien Siems, Timur Carstensen, Arber Zela, Frank Hutter, Massimiliano Pontil, and Riccardo Grazzi. Deltaproduct: Increasing the expressivity of deltanet through products of householders. arXiv preprint arXiv:2502.10297, 2025

arXiv 2025

[32] [32]

Steeves, Joel Hestness, and Nolan Dey

Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R. Steeves, Joel Hestness, and Nolan Dey. SlimPajama : A 627 B token cleaned and deduplicated version of RedPajama . https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama, 2023. Cerebras Systems blog

2023

[33] [33]

Retentive network: A successor to transformer for large language models

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, 2023

Pith/arXiv arXiv 2023

[34] [34]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017

2017

[35] [35]

Saurous, Guillaume Lajoie, Charlotte Frenkel, Razvan Pascanu, Blaise Ag \"u era y Arcas, and Jo \ a o Sacramento

Johannes von Oswald, Nino Scherrer, Seijin Kobayashi, Luca Versari, Songlin Yang, Maximilian Schlegel, Kaitlin Maile, Yanick Schimpf, Oliver Sieberling, Alexander Meulemans, Rif A. Saurous, Guillaume Lajoie, Charlotte Frenkel, Razvan Pascanu, Blaise Ag \"u era y Arcas, and Jo \ a o Sacramento. MesaNet : Sequence modeling by locally optimal test-time train...

Pith/arXiv arXiv 2025

[36] [36]

Rush, and Tri Dao

Junxiong Wang, Daniele Paliotta, Avner May, Alexander M. Rush, and Tri Dao. The Mamba in the Llama : Distilling and accelerating hybrid models. In Advances in Neural Information Processing Systems (NeurIPS), 2024

2024

[37] [37]

Gated linear attention transformers with hardware-efficient training

Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. arXiv preprint arXiv:2312.06635, 2023

Pith/arXiv arXiv 2023

[38] [38]

Gated delta networks: Improving mamba2 with delta rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. arXiv preprint arXiv:2412.06464, 2024 a

Pith/arXiv arXiv 2024

[39] [39]

Parallelizing linear transformers with the delta rule over sequence length

Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. Advances in neural information processing systems, 37: 0 115491--115522, 2024 b

2024