pith. machine review for the scientific record. sign in

arxiv: 2604.21100 · v1 · submitted 2026-04-22 · 💻 cs.LG

Recognition: unknown

Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences

Daniela Rus, Neehal Tumma, Noel Loo

Pith reviewed 2026-05-10 00:13 UTC · model grok-4.3

classification 💻 cs.LG
keywords delta rulepreconditioninglinear recurrencessequence modelingDeltaNetlanguage modelingcurvature approximationsubquadratic attention
0
0 comments X

The pith

Preconditioning delta-rule recurrences accounts for loss curvature and improves sequence modeling performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that delta-rule recurrences approximate online least squares updates but ignore the curvature of that loss. Starting from the theory of online least squares, it derives that linear attention and the delta rule become equivalent under exact preconditioning. The authors then implement a practical diagonal approximation to the curvature matrix, creating preconditioned versions of DeltaNet, Gated DeltaNet, and Kimi Delta Attention together with chunkwise parallel algorithms for them. Experiments report consistent gains on synthetic recall benchmarks and on language modeling at 340M and 1B scales. A sympathetic reader sees this as addressing a missing optimization detail that has limited how well current subquadratic recurrences can learn long-range dependencies.

Core claim

We derive equivalences between linear attention and the delta rule in the exactly preconditioned case. Our preconditioned delta-rule recurrences yield consistent performance improvements across synthetic recall benchmarks and language modeling at the 340M and 1B scale.

What carries the argument

A diagonal approximation to the curvature matrix of the online least-squares objective, which supplies preconditioned updates inside linear recurrence operators.

If this is right

  • Exact preconditioning makes linear attention and delta-rule updates mathematically equivalent.
  • Preconditioned DeltaNet, Gated DeltaNet, and Kimi Delta Attention can be computed with efficient chunkwise parallel algorithms.
  • The same preconditioned recurrences deliver measurable gains on both synthetic recall and real language modeling at 340M and 1B scales.
  • No extra per-task hyperparameter tuning is required to realize the reported improvements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same curvature-aware update could be inserted into other recurrence families that rely on online least-squares objectives.
  • If the diagonal approximation remains stable at larger scales, preconditioned recurrences may close more of the gap with full attention on long-context tasks.
  • Exploring low-rank or sparse curvature approximations beyond the diagonal case could trade a modest increase in compute for further accuracy gains.

Load-bearing premise

The diagonal approximation to the curvature matrix preserves the benefits of preconditioning without introducing instability or requiring additional hyperparameters that must be tuned per task.

What would settle it

An experiment in which the preconditioned DeltaNet, GDN, or KDA variants show no improvement or degrade performance relative to their non-preconditioned baselines on the synthetic recall tasks or at the 340M/1B language-modeling scales.

Figures

Figures reproduced from arXiv: 2604.21100 by Daniela Rus, Neehal Tumma, Noel Loo.

Figure 1
Figure 1. Figure 1: In these plots, we analyze the key Gram matrix in an arbitrary layer of a pretrained DeltaNet-340M model from Hug￾gingFace. Left. Plot of sorted eigenspectrum in various instances of the key Gram. Right. Eigenvalue-weighted average of the ℓ∞ norm of the key Gram eigenvectors. A value of 1 indicates perfect axis-alignment and a value of √1 d indicates perfect misalignment. To understand how well a diagonal … view at source ↗
Figure 2
Figure 2. Figure 2: Distributions of the diagonal Gram preconditioner before (left) and after (right) squashing where x = 1.5 in Bt. Note that the left plot shows distribution in log-spaced buckets, demon￾strating the log-normality of At, which justifies the log-space parameterization in (9). t ≥ t0, E [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Throughput results measuring tokens-per-second during training on 8-H100s using DDP. Models are 340M parameters (D = 1024, H = 8, layers = 24). replacing St−1 with αtSt−1 and αtSt−1 in Eq. (9), respec￾tively. For additional commentary on the choice of this parameterization as well as ablations on the functional form, refer to Appendix E.3. 4. Empirical study 4.1. Efficiency analysis We implement custom Tri… view at source ↗
Figure 4
Figure 4. Figure 4: Execution time of kernels for varying sequence lengths. Batch size is set to 1, number of heads to 8, and head dimension to 128. Without Pt refers to the exclusion of the preconditioner recurrence from the computation. αt and αt denote scalar and diagonal decay respectively. Recall our PDN recurrence is given by St = St−1+βt(vt− St−1kt)k˜⊤ t in which we have a distinct read key and write key. This correspo… view at source ↗
Figure 5
Figure 5. Figure 5: Results on synthetic MQAR task. 4.3.2. LARGE-SCALE LANGUAGE MODELING We move beyond small-scale synthetics and scale our mod￾els up to 340M/1B parameters to demonstrate the efficacy of the preconditioned recurrences on language modeling. Setup. Our experiments are designed to ensure a fair com￾parison between the base DeltaNet/GDN/KDA recurrences and their preconditioned forms. As such, the PDN/PGDN/P￾KDA … view at source ↗
Figure 7
Figure 7. Figure 7: Distribution of learned k T t k˜t which modulates the write eigenvalue in PGDN (averaged over layers). Note that the GDN recurrence is given by αtSt−1 + (vt − αtSt−1kt)βtk ⊤ t = αtSt−1 [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 6
Figure 6. Figure 6: 340M models trained on 2K context length evaluated on S-NIAH benchmark from RULER (Hsieh et al., 2024b). In-context retrieval. Next, in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Pseudocode for forward pass of GDN vs. PGDN chunkwise parallel kernels. The PKDA forward pass is analogously adapted from the KDA forward pass. (a) GDN recurrence forward pass 1 # Recurrence: 2 # S_t = alpha * S_{t-1} 3 # + beta * k @ (v - alpha * S_{t-1}ˆT @ k)ˆT 4 # where alpha = exp(g) 5 # 6 # Notation: 7 # h = S (hidden state) 8 # g = log(alpha), so exp(g) = alpha (decay gate) 9 10 def gated_delta_rule… view at source ↗
Figure 9
Figure 9. Figure 9: Pseudocode for backward pass of GDN vs. PGDN chunkwise parallel kernels. The PKDA backward pass is analogously adapted from the KDA backward pass. (a) GDN recurrence backward pass 1 def gated_delta_rule_backward( 2 q, k, v, g_cumsum, beta, A, initial_state, do, dht 3 ): 4 # Recompute forward intermediates 5 w, u = recompute_w_u_fwd(k, v, beta, A, g_cumsum) 6 h, v_new, _ = chunk_gated_delta_rule_fwd_h( 7 k,… view at source ↗
Figure 10
Figure 10. Figure 10: Additional results on the synthetic MQAR task. Here, we keep the sequence length fixed at 1024 and vary the number of KV pairs. As we observe in Section 4.3.2 with the other MQAR tasks configurations, we find that preconditioned recurrences either maintain or improve performance. E.2.2. FULL LANGUAGE RESULTS [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Plot of learned centers µ across various layers and heads in the PGDN model. The intuition behind the first term log(At) − µ is that since we know our distribution of At looks roughly log-normal, we should transform it into log-space and let it learn its mean µ there. This accomplishes two things: one, the model can now learn At in a space that looks roughly normal (which is likely easier to train) and tw… view at source ↗
Figure 12
Figure 12. Figure 12: Left. Plot of the fast-moving inverted squash function f(r) = r 1+|r| which we use in our preconditioned recurrences, as well as the slower-moving inverted tanh function, which we found performed worse empirically. Right. Plot of average saturation of each squash function across the sequence. The final term in the transform on At is given by Bt = exp − log(x)st  which performs an inverse transform on the… view at source ↗
Figure 13
Figure 13. Figure 13: Plot showing the learned distribution of Bt across a few layers in the PGDN model. Regarding the former, aside from the interval we chose [ 1 x , x], the other two immediately obvious candidates are [1, x] and [ 1 x , 1]. The difference between these intervals and the one we chose is that they do not center around 1, meaning that they are biased towards learning amplifying and dampening dynamics only, res… view at source ↗
Figure 14
Figure 14. Figure 14: Impact of various normalization factors on learned decay and write terms in main PGDN recurrence. Here, mt = k T Btk. Tying αt and βt. The final ablation axis we discuss is tying together the decay αt and gain βt between the preconditioner recurrence and main recurrence, as is done in MesaNet (von Oswald et al., 2025). We find that untying improves training loss (as shown in [PITH_FULL_IMAGE:figures/full… view at source ↗
Figure 15
Figure 15. Figure 15: Distribution of learned k˜T t kt in PGDN showing that it deviates from 1, indicative of increased recurrent expressivity relative to GDN. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Learned αt, βt and eigenvalues in the GDN and PGDN recurrences. F. Deriving the chunkwise parallel forms F.1. Chunkwise parallel forms for the preconditioner recurrences In this section, we derive the chunkwise parallel form for the recursion in Eq. (7), which is shown in Eq. (8). We consider a more general recurrence which includes a learnable decay, αt and gain term, βt: At = αtAt−1 + βtkt ⊙ kt. Split t… view at source ↗
read the original abstract

To address the increasing long-context compute limitations of softmax attention, several subquadratic recurrent operators have been developed. This work includes models such as Mamba-2, DeltaNet, Gated DeltaNet (GDN), and Kimi Delta Attention (KDA). As the space of recurrences grows, a parallel line of work has arisen to taxonomize them. One compelling view is the test-time regression (TTR) framework, which interprets recurrences as performing online least squares updates that learn a linear map from the keys to values. Existing delta-rule recurrences can be seen as first-order approximations to this objective, but notably ignore the curvature of the least-squares loss during optimization. In this work, we address this by introducing preconditioning to these recurrences. Starting from the theory of online least squares, we derive equivalences between linear attention and the delta rule in the exactly preconditioned case. Next, we realize this theory in practice by proposing a diagonal approximation: this enables us to introduce preconditioned variants of DeltaNet, GDN, and KDA alongside efficient chunkwise parallel algorithms for computing them. Empirically, we find that our preconditioned delta-rule recurrences yield consistent performance improvements across synthetic recall benchmarks and language modeling at the 340M and 1B scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes preconditioned variants of DeltaNet, Gated DeltaNet (GDN), and Kimi Delta Attention (KDA) by incorporating curvature information from online least-squares regression. It derives equivalences between linear attention and the delta rule under exact preconditioning, introduces a diagonal approximation to the curvature matrix to enable efficient chunkwise parallel algorithms, and reports consistent empirical gains on synthetic recall benchmarks as well as language modeling at the 340M and 1B scales.

Significance. If the diagonal approximation to the curvature matrix can be shown to preserve stable curvature-aware updates without hidden per-task tuning or divergence, the work would offer a principled unification of linear attention with preconditioned delta-rule recurrences and a practical route to improving subquadratic sequence models. The derivation from online least-squares theory and the scale of the reported experiments (340M/1B) are strengths that would strengthen the contribution if the approximation's validity is established.

major comments (2)
  1. [Theoretical derivation (as summarized in the abstract)] The central derivation establishes equivalences between linear attention and the delta rule only in the exactly preconditioned case (using the full inverse curvature matrix). The practical algorithm replaces this with a diagonal approximation to enable chunkwise parallelism, but no argument or bound is given showing that the diagonal version retains the curvature-aware update direction or the claimed equivalence properties. This is load-bearing for the central claim because the reported gains at 340M and 1B scales rest entirely on the unproven stability and benefit of the approximation.
  2. [Empirical evaluation section] The abstract states that the diagonal approximation 'enables' preconditioned variants with 'consistent performance improvements,' yet the manuscript provides no analysis or ablation demonstrating that the approximation avoids instability or additional hyperparameter retuning across tasks. Without such evidence, the empirical results cannot be taken as confirmation that preconditioning benefits are preserved at scale.
minor comments (2)
  1. Clarify the precise definition and initialization of the curvature matrix early in the paper to make the transition from exact preconditioning to the diagonal case easier to follow.
  2. The synthetic recall benchmarks would benefit from reporting variance across multiple random seeds to strengthen the claim of consistent improvements.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below, indicating the revisions we plan to incorporate.

read point-by-point responses
  1. Referee: [Theoretical derivation (as summarized in the abstract)] The central derivation establishes equivalences between linear attention and the delta rule only in the exactly preconditioned case (using the full inverse curvature matrix). The practical algorithm replaces this with a diagonal approximation to enable chunkwise parallelism, but no argument or bound is given showing that the diagonal version retains the curvature-aware update direction or the claimed equivalence properties. This is load-bearing for the central claim because the reported gains at 340M and 1B scales rest entirely on the unproven stability and benefit of the approximation.

    Authors: We agree that the formal equivalence between linear attention and the delta rule is established only under exact preconditioning with the full inverse curvature matrix. The diagonal approximation is introduced specifically to enable efficient chunkwise parallel computation while still incorporating an approximation to the curvature. In the revised manuscript, we will expand the theoretical discussion to provide a clearer argument—based on the online least-squares regression framework—explaining why the diagonal elements capture the dominant curvature directions and why the resulting updates remain beneficial in practice. We will also add small-scale synthetic experiments comparing update directions under exact vs. diagonal preconditioning. A rigorous error bound on the approximation, however, would require substantial new theoretical analysis. revision: partial

  2. Referee: [Empirical evaluation section] The abstract states that the diagonal approximation 'enables' preconditioned variants with 'consistent performance improvements,' yet the manuscript provides no analysis or ablation demonstrating that the approximation avoids instability or additional hyperparameter retuning across tasks. Without such evidence, the empirical results cannot be taken as confirmation that preconditioning benefits are preserved at scale.

    Authors: We acknowledge that the current manuscript lacks explicit ablations on stability and hyperparameter sensitivity. In the revised version, we will add a dedicated empirical analysis subsection that includes: training loss and gradient norm curves for the 340M-scale preconditioned models to demonstrate absence of instability or divergence; explicit confirmation that all reported results used identical hyperparameters to the baseline DeltaNet/GDN/KDA models without per-task retuning; and additional ablations on synthetic recall tasks varying the diagonal approximation strength. These additions will provide direct evidence that the preconditioning benefits are preserved without hidden tuning. revision: yes

standing simulated objections not resolved
  • A formal mathematical bound proving that the diagonal curvature approximation retains the exact equivalence properties or provides stability guarantees equivalent to the full-matrix case.

Circularity Check

0 steps flagged

Derivation from established online least squares theory shows no self-referential reduction

full rationale

The paper begins its central derivation from the established test-time regression (TTR) framework interpreting recurrences as online least squares updates. It derives equivalences between linear attention and the delta rule specifically in the exactly preconditioned case using this external theory. The diagonal approximation is introduced as a practical realization for chunkwise parallelism, without any claim that it follows by construction or reduces to fitted parameters. No self-citations are load-bearing for the uniqueness or ansatz, and no predictions are statistically forced by inputs. The empirical improvements are presented as validation rather than derived results. This makes the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on the test-time regression interpretation of recurrences and introduces a diagonal approximation whose effectiveness is asserted rather than derived from first principles.

axioms (2)
  • domain assumption Recurrent operators perform online least-squares updates that learn a linear map from keys to values
    This is the TTR framework invoked to reinterpret existing delta rules.
  • ad hoc to paper A diagonal approximation to the curvature matrix is sufficient for practical gains
    Introduced to obtain efficient algorithms while retaining preconditioning benefits.

pith-pipeline@v0.9.0 · 5531 in / 1215 out tokens · 38123 ms · 2026-05-10T00:13:49.730092+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators

    cs.LG 2026-05 unverdicted novelty 6.0

    Spectral Koopman operators let SSMs achieve 100% accuracy on long-gap multi-query associative recall with fixed memory, where pure Mamba fails.

Reference graph

Works this paper leans on

136 extracted references · 49 canonical work pages · cited by 1 Pith paper · 8 internal anchors

  1. [1]

    Lfm2 technical report.arXiv preprint arXiv:2511.23404, 2025

    Amini, A., Banaszak, A., Benoit, H., Böök, A., Dakhran, T., Duong, S., Eng, A., Fernandes, F., Härkönen, M., Harrington, A., Hasani, R., Karwa, S., Khrustalev, Y., Labonne, M., Lechner, M., Lechner, V., Lee, S., Li, Z., Loo, N., Marks, J., Mosca, E., Paech, S. J., Pak, P., Parnichkun, R. N., Quach, A., Rogers, R., Rus, D., Saxena, N., Schlager, B., Seyde,...

  2. [3]

    Zoology: Measuring and improving recall in efficient language models

    Arora, S., Eyuboglu, S., Timalsina, A., Johnson, I., Poli, M., Zou, J., Rudra, A., and Re, C. Zoology: Measuring and improving recall in efficient language models. In The Twelfth International Conference on Learning Representations, 2024 a . URL https://openreview.net/forum?id=LY3ukUANko

  3. [4]

    Simple linear attention language models balance the recall-throughput tradeoff

    Arora, S., Eyuboglu, S., Zhang, M., Timalsina, A., Alberti, S., Zou, J., Rudra, A., and Re, C. Simple linear attention language models balance the recall-throughput tradeoff. In ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2024 b . URL https://openreview.net/forum?id=qRlcoPhEoD

  4. [5]

    Just read twice: Closing the recall gap for recurrent language models

    Arora, S., Timalsina, A., Singhal, A., Spector, B., Eyuboglu, S., Zhao, X., Rao, A., Rudra, A., and R \'e , C. Just read twice: Closing the recall gap for recurrent language models. In Proceedings of the 2nd Efficient Systems for Foundation Models Workshop at the International Conference on Machine Learning (ICML), 2024 c . URL https://openreview.net/foru...

  5. [6]

    arXiv preprint arXiv:2501.00663 , year=

    Behrouz, A., Zhong, P., and Mirrokni, V. Titans: Learning to memorize at test time, 2024. URL https://arxiv.org/abs/2501.00663

  6. [7]

    Atlas: Learning to optimally memorize the context at test time, 2025

    Behrouz, A., Li, Z., Kacham, P., Daliri, M., Deng, Y., Zhong, P., Razaviyayn, M., and Mirrokni, V. Atlas: Learning to optimally memorize the context at test time, 2025. URL https://arxiv.org/abs/2505.23735

  7. [10]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try ARC , the AI2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018. doi:10.48550/arXiv.1803.05457. URL https://arxiv.org/abs/1803.05457

  8. [11]

    and Gu, A

    Dao, T. and Gu, A. Transformers are ssms: generalized models and efficient algorithms through structured state space duality. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org, 2024

  9. [13]

    Adaptive subgradient methods for online learning and stochastic optimization

    Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12 0 (61): 0 2121--2159, 2011. URL http://jmlr.org/papers/v12/duchi11a.html

  10. [14]

    Grazzi, R., Siems, J., Zela, A., Franke, J. K. H., Hutter, F., and Pontil, M. Unlocking state-tracking in linear rnns through negative eigenvalues, 2025. URL https://arxiv.org/abs/2411.12537

  11. [15]

    and Dao, T

    Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=tEYskw1VY2

  12. [16]

    Efficiently modeling long sequences with structured state spaces

    Gu, A., Goel, K., and Re, C. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2022 a . URL https://openreview.net/forum?id=uYLFoz1vlAC

  13. [17]

    On the parameterization and initialization of diagonal state space models

    Gu, A., Gupta, A., Goel, K., and R\' e , C. On the parameterization and initialization of diagonal state space models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS '22, Red Hook, NY, USA, 2022 b . Curran Associates Inc. ISBN 9781713871088

  14. [18]

    Log-linear attention

    Guo, H., Yang, S., Goel, T., Xing, E. P., Dao, T., and Kim, Y. Log-linear attention, 2025. URL https://arxiv.org/abs/2506.04761

  15. [19]

    Liquid structural state-space models

    Hasani, R., Lechner, M., Wang, T.-H., Chahine, M., Amini, A., and Rus, D. Liquid structural state-space models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=g4OTKRKfS7R

  16. [21]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    Hsieh, C.-P., Sun, S., Kriman, S., Acharya, S., Rekesh, D., Jia, F., Zhang, Y., and Ginsburg, B. Ruler: What's the real context size of your long-context language models?, 2024 b . URL https://arxiv.org/abs/2404.06654

  17. [22]

    Comba: Improving Bilinear

    Hu, J., Pan, Y., Du, J., Lan, D., Tang, X., Wen, Q., Liang, Y., and Sun, W. Comba: Improving bilinear rnns with closed-loop control, 2025. URL https://arxiv.org/abs/2506.02475

  18. [25]

    Transformers are rnns: fast autoregressive transformers with linear attention

    Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. Transformers are rnns: fast autoregressive transformers with linear attention. In Proceedings of the 37th International Conference on Machine Learning, ICML'20. JMLR.org, 2020

  19. [26]

    Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization, 2017. URL https://arxiv.org/abs/1412.6980

  20. [29]

    Longhorn: State space models are amortized online learners, 2024

    Liu, B., Wang, R., Wu, L., Feng, Y., Stone, P., and Liu, Q. Longhorn: State space models are amortized online learners, 2024. URL https://arxiv.org/abs/2407.14207

  21. [31]

    Loo, N., Swaroop, S., and Turner, R. E. Generalized variational continual learning, 2020. URL https://arxiv.org/abs/2011.12328

  22. [32]

    Optimizing neural networks with kronecker-factored approximate curvature.arXiv:1503.05671, 2020

    Martens, J. and Grosse, R. Optimizing neural networks with kronecker-factored approximate curvature, 2020. URL https://arxiv.org/abs/1503.05671

  23. [33]

    Pointer sentinel mixture models

    Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=Byj72udxe. ICLR 2017

  24. [34]

    The illusion of state in state-space models

    Merrill, W., Petty, J., and Sabharwal, A. The illusion of state in state-space models, 2025. URL https://arxiv.org/abs/2404.08819

  25. [35]

    A method for solving the convex programming problem with convergence rate O(1/k^2)

    Nesterov, Y. A method for solving the convex programming problem with convergence rate O(1/k^2) . Proceedings of the USSR Academy of Sciences, 269: 0 543--547, 1983. URL https://api.semanticscholar.org/CorpusID:145918791

  26. [36]

    Variational continual learning.arXiv preprint arXiv:1710.10628, 2017

    Nguyen, C. V., Li, Y., Bui, T. D., and Turner, R. E. Variational continual learning, 2018. URL https://arxiv.org/abs/1710.10628

  27. [38]

    N., Tumma, N., Thomas, A

    Parnichkun, R. N., Tumma, N., Thomas, A. W., Moro, A., An, Q., Suzuki, T., Yamashita, A., Poli, M., and Massaroli, S. Quantifying memory utilization with effective state-size, 2025. URL https://arxiv.org/abs/2504.19561

  28. [39]

    RWKV-7 “Goose” with expressive dynamic state evolution, 2025

    Peng, B., Zhang, R., Goldstein, D., Alcaide, E., Du, X., Hou, H., Lin, J., Liu, J., Lu, J., Merrill, W., Song, G., Tan, K., Utpala, S., Wilce, N., Wind, J. S., Wu, T., Wuttke, D., and Zhou-Zheng, C. Rwkv-7 "goose" with expressive dynamic state evolution, 2025 a . URL https://arxiv.org/abs/2503.14456

  29. [40]

    Gated kalmanet: A fading memory layer through test-time ridge regression

    Peng, L., Chattopadhyay, A., Zancato, L., Nunez, E., Xia, W., and Soatto, S. Gated kalmanet: A fading memory layer through test-time ridge regression, 2025 b . URL https://arxiv.org/abs/2511.21016

  30. [41]

    and Dao, Tri and Baccus, Stephen and Bengio, Yoshua and Ermon, Stefano and R

    Poli, M., Massaroli, S., Nguyen, E., Fu, D. Y., Dao, T., Baccus, S., Bengio, Y., Ermon, S., and Ré, C. Hyena hierarchy: Towards larger convolutional language models, 2023. URL https://arxiv.org/abs/2302.10866

  31. [44]

    Robbins, H. E. A stochastic approximation method. Annals of Mathematical Statistics, 22: 0 400--407, 1951. URL https://api.semanticscholar.org/CorpusID:16945044

  32. [46]

    International Conference on Machine Learning (ICML) , year=

    Schlag, I., Irie, K., and Schmidhuber, J. Linear transformers are secretly fast weight programmers, 2021. URL https://arxiv.org/abs/2102.11174

  33. [47]

    2025 , archivePrefix=

    Siems, J., Carstensen, T., Zela, A., Hutter, F., Pontil, M., and Grazzi, R. Deltaproduct: Improving state-tracking in linear rnns via householder products, 2025. URL https://arxiv.org/abs/2502.10297

  34. [48]

    T., Warrington, A., and Linderman, S

    Smith, J. T., Warrington, A., and Linderman, S. Simplified state space layers for sequence modeling. In The Eleventh International Conference on Learning Representations, 2023

  35. [49]

    R., Hestness, J., and Dey, N

    Soboleva, D., Al-Khateeb, F., Myers, R., Steeves, J. R., Hestness, J., and Dey, N. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama , June 2023. URL https://huggingface.co/datasets/cerebras/SlimPajama-627B

  36. [50]

    & Chen, X

    Sun, Y., Li, X., Dalal, K., Hsu, C., Koyejo, S., Guestrin, C., Wang, X., Hashimoto, T., and Chen, X. Learning to (learn at test time), 2024. URL https://arxiv.org/abs/2310.13807

  37. [51]

    Team, K., Zhang, Y., Lin, Z., Yao, X., Hu, J., Meng, F., Liu, C., Men, X., Yang, S., Li, Z., Li, W., Lu, E., Liu, W., Chen, Y., Xu, W., Yu, L., Wang, Y., Fan, Y., Zhong, L., Yuan, E., Zhang, D., Zhang, Y., Liu, T. Y., Wang, H., Fang, S., He, W., Liu, S., Li, Y., Su, J., Qiu, J., Pang, B., Yan, J., Jiang, Z., Huang, W., Yin, B., You, J., Wei, C., Wang, Z.,...

  38. [52]

    Leveraging low-rank and sparse recurrent connectivity for robust closed-loop control

    Tumma, N., Lechner, M., Loo, N., Hasani, R., and Rus, D. Leveraging low-rank and sparse recurrent connectivity for robust closed-loop control. In The Twelfth International Conference on Learning Representations, 2024

  39. [53]

    N., Kaiser, ., and Polosukhin, I

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017

  40. [54]

    2023 , month = sep, journal =

    von Oswald, J., Schlegel, M., Meulemans, A., Kobayashi, S., Niklasson, E., Zucchet, N., Scherrer, N., Miller, N., Sandler, M., y Arcas, B. A., Vladymyrov, M., Pascanu, R., and Sacramento, J. Uncovering mesa-optimization algorithms in transformers, 2024. URL https://arxiv.org/abs/2309.05858

  41. [55]

    Saurous, Guillaume Lajoie, Charlotte Frenkel, Razvan Pascanu, Blaise Agüera y Arcas, and João Sacramento

    von Oswald, J., Scherrer, N., Kobayashi, S., Versari, L., Yang, S., Schlegel, M., Maile, K., Schimpf, Y., Sieberling, O., Meulemans, A., Saurous, R. A., Lajoie, G., Frenkel, C., Pascanu, R., y Arcas, B. A., and Sacramento, J. Mesanet: Sequence modeling by locally optimal test-time training, 2025. URL https://arxiv.org/abs/2506.05233

  42. [56]

    arXiv preprint arXiv:2501.12352 , year=

    Wang, K. A., Shi, J., and Fox, E. B. Test-time regression: a unifying framework for designing sequence models with associative memory, 2025. URL https://arxiv.org/abs/2501.12352

  43. [57]

    Linformer: Self-Attention with Linear Complexity

    Wang, S., Li, B. Z., Khabsa, M., Fang, H., and Ma, H. Linformer: Self-attention with linear complexity, 2020. URL https://arxiv.org/abs/2006.04768

  44. [59]

    and Zhang, Y

    Yang, S. and Zhang, Y. Fla: A triton-based library for hardware-efficient implementations of linear attention mechanism, January 2024. URL https://github.com/fla-org/flash-linear-attention. Software

  45. [60]

    Gated linear attention transformers with hardware-efficient training

    Yang, S., Wang, B., Shen, Y., Panda, R., and Kim, Y. Gated linear attention transformers with hardware-efficient training. In Forty-first International Conference on Machine Learning, 2023

  46. [61]

    Parallelizing linear transformers with the delta rule over sequence length

    Yang, S., Wang, B., Zhang, Y., Shen, Y., and Kim, Y. Parallelizing linear transformers with the delta rule over sequence length. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  47. [62]

    Gated Delta Networks: Improving Mamba2 with Delta Rule

    Yang, S., Kautz, J., and Hatamizadeh, A. Gated delta networks: Improving mamba2 with delta rule, 2025. URL https://arxiv.org/abs/2412.06464

  48. [64]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces , author =. arXiv preprint arXiv:2312.00752 , year =. doi:10.48550/arXiv.2312.00752 , url =

  49. [65]

    PIQA: Reasoning about Physical Commonsense in Natural Language , booktitle =

    PIQA: Reasoning about Physical Commonsense in Natural Language , author =. Proceedings of the AAAI Conference on Artificial Intelligence , year =. doi:10.1609/aaai.v34i05.6239 , url =

  50. [67]

    Proceedings of the AAAI Conference on Artificial Intelligence , author=

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author =. Proceedings of the AAAI Conference on Artificial Intelligence , year =. doi:10.1609/aaai.v34i05.6399 , url =

  51. [68]

    Think You Have Solved Question Answering? Try

    Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , journal =. Think You Have Solved Question Answering? Try. 2018 , doi =

  52. [69]

    Sap, Maarten and Rashkin, Hannah and Chen, Derek and Le Bras, Ronan and Choi, Yejin , booktitle =. Social. 2019 , pages =. doi:10.18653/v1/D19-1454 , url =

  53. [70]

    B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions

    Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina , booktitle =. 2019 , pages =. doi:10.18653/v1/N19-1300 , url =

  54. [71]

    International Conference on Learning Representations , year =

    Pointer Sentinel Mixture Models , author =. International Conference on Learning Representations , year =

  55. [72]

    Paperno, Denis and Kruszewski, Germ. The. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =. doi:10.18653/v1/P16-1144 , url =

  56. [74]

    Proceedings of the 2nd Efficient Systems for Foundation Models Workshop at the International Conference on Machine Learning (ICML) , year =

    Just Read Twice: Closing the Recall Gap for Recurrent Language Models , author =. Proceedings of the 2nd Efficient Systems for Foundation Models Workshop at the International Conference on Machine Learning (ICML) , year =

  57. [75]

    doi:10.18653/v1/N19-1309 , editor =

    Lockard, Colin and Shiralkar, Prashant and Dong, Xin Luna , booktitle =. 2019 , pages =. doi:10.18653/v1/N19-1309 , url =

  58. [76]

    Proceedings of the VLDB Endowment , year =

    Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes , author =. Proceedings of the VLDB Endowment , year =. doi:10.14778/3626292.3626294 , url =

  59. [77]

    Know What You Don

    Rajpurkar, Pranav and Jia, Robin and Liang, Percy , booktitle =. Know What You Don't Know: Unanswerable Questions for. 2018 , pages =. doi:10.18653/v1/P18-2124 , url =

  60. [78]

    Joshi, E

    Joshi, Mandar and Choi, Eunsol and Weld, Daniel and Zettlemoyer, Luke. T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi:10.18653/v1/P17-1147

  61. [79]

    2025 , eprint=

    Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression , author=. 2025 , eprint=

  62. [80]

    2020 , eprint=

    Linformer: Self-Attention with Linear Complexity , author=. 2020 , eprint=

  63. [81]

    Data mining and knowledge discovery , 33(4):917–963

    Welbl, Johannes and Liu, Nelson F. and Gardner, Matt. Crowdsourcing Multiple Choice Science Questions. Proceedings of the 3rd Workshop on Noisy User-generated Text. 2017. doi:10.18653/v1/W17-4413

  64. [83]

    Transactions of the Association for Computational Linguistics , author =

    Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and Toutanova, Kristina and Jones, Llion and Kelcey, Matthew and Chang, Ming-Wei and Dai, Andrew M. and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav. Natura...

  65. [84]

    2025 , eprint=

    LFM2 Technical Report , author=. 2025 , eprint=

  66. [85]

    doi:10.18653/v1/N19-1246 , editor =

    Dua, Dheeru and Wang, Yizhong and Dasigi, Pradeep and Stanovsky, Gabriel and Singh, Sameer and Gardner, Matt , booktitle =. 2019 , pages =. doi:10.18653/v1/N19-1246 , url =

  67. [86]

    Lost in the Middle: How Language Models Use Long Contexts

    Natural Questions: A Benchmark for Question Answering Research , author =. Transactions of the Association for Computational Linguistics , year =. doi:10.1162/tacl\_a\_00276 , url =

  68. [87]

    International Conference on Learning Representations , year=

    Efficiently Modeling Long Sequences with Structured State Spaces , author=. International Conference on Learning Representations , year=

  69. [88]

    The Eleventh International Conference on Learning Representations , year=

    Simplified State Space Layers for Sequence Modeling , author=. The Eleventh International Conference on Learning Representations , year=

  70. [89]

    Advances in neural information processing systems , volume=

    Attention is all you need , author=. Advances in neural information processing systems , volume=

  71. [90]

    and Hestness, Joel and Dey, Nolan , title =

    Soboleva, Daria and Al-Khateeb, Faisal and Myers, Robert and Steeves, Jacob R. and Hestness, Joel and Dey, Nolan , title =. 2023 , month = jun, url =

  72. [91]

    2024 , month = jan, day =

    FLA: A Triton-Based Library for Hardware-Efficient Implementations of Linear Attention Mechanism , author =. 2024 , month = jan, day =

  73. [92]

    Forty-first International Conference on Machine Learning , year=

    Gated Linear Attention Transformers with Hardware-Efficient Training , author=. Forty-first International Conference on Machine Learning , year=

  74. [93]

    International Conference on Machine Learning , pages=

    Resurrecting recurrent neural networks for long sequences , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  75. [94]

    First Conference on Language Modeling , year=

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces , author=. First Conference on Language Modeling , year=

  76. [95]

    Forty-first International Conference on Machine Learning , year=

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality , author=. Forty-first International Conference on Machine Learning , year=

  77. [96]

    The effective rank: A measure of effective dimensionality , year=

    Roy, Olivier and Vetterli, Martin , booktitle=. The effective rank: A measure of effective dimensionality , year=

  78. [97]

    The Twelfth International Conference on Learning Representations , year=

    Zoology: Measuring and Improving Recall in Efficient Language Models , author=. The Twelfth International Conference on Learning Representations , year=

  79. [98]

    Forty-first International Conference on Machine Learning , year=

    Mechanistic Design and Scaling of Hybrid Architectures , author=. Forty-first International Conference on Machine Learning , year=

  80. [99]

    The Eleventh International Conference on Learning Representations , year=

    Liquid Structural State-Space Models , author=. The Eleventh International Conference on Learning Representations , year=

Showing first 80 references.