Recognition: unknown
Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences
Pith reviewed 2026-05-10 00:13 UTC · model grok-4.3
The pith
Preconditioning delta-rule recurrences accounts for loss curvature and improves sequence modeling performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We derive equivalences between linear attention and the delta rule in the exactly preconditioned case. Our preconditioned delta-rule recurrences yield consistent performance improvements across synthetic recall benchmarks and language modeling at the 340M and 1B scale.
What carries the argument
A diagonal approximation to the curvature matrix of the online least-squares objective, which supplies preconditioned updates inside linear recurrence operators.
If this is right
- Exact preconditioning makes linear attention and delta-rule updates mathematically equivalent.
- Preconditioned DeltaNet, Gated DeltaNet, and Kimi Delta Attention can be computed with efficient chunkwise parallel algorithms.
- The same preconditioned recurrences deliver measurable gains on both synthetic recall and real language modeling at 340M and 1B scales.
- No extra per-task hyperparameter tuning is required to realize the reported improvements.
Where Pith is reading between the lines
- The same curvature-aware update could be inserted into other recurrence families that rely on online least-squares objectives.
- If the diagonal approximation remains stable at larger scales, preconditioned recurrences may close more of the gap with full attention on long-context tasks.
- Exploring low-rank or sparse curvature approximations beyond the diagonal case could trade a modest increase in compute for further accuracy gains.
Load-bearing premise
The diagonal approximation to the curvature matrix preserves the benefits of preconditioning without introducing instability or requiring additional hyperparameters that must be tuned per task.
What would settle it
An experiment in which the preconditioned DeltaNet, GDN, or KDA variants show no improvement or degrade performance relative to their non-preconditioned baselines on the synthetic recall tasks or at the 340M/1B language-modeling scales.
Figures
read the original abstract
To address the increasing long-context compute limitations of softmax attention, several subquadratic recurrent operators have been developed. This work includes models such as Mamba-2, DeltaNet, Gated DeltaNet (GDN), and Kimi Delta Attention (KDA). As the space of recurrences grows, a parallel line of work has arisen to taxonomize them. One compelling view is the test-time regression (TTR) framework, which interprets recurrences as performing online least squares updates that learn a linear map from the keys to values. Existing delta-rule recurrences can be seen as first-order approximations to this objective, but notably ignore the curvature of the least-squares loss during optimization. In this work, we address this by introducing preconditioning to these recurrences. Starting from the theory of online least squares, we derive equivalences between linear attention and the delta rule in the exactly preconditioned case. Next, we realize this theory in practice by proposing a diagonal approximation: this enables us to introduce preconditioned variants of DeltaNet, GDN, and KDA alongside efficient chunkwise parallel algorithms for computing them. Empirically, we find that our preconditioned delta-rule recurrences yield consistent performance improvements across synthetic recall benchmarks and language modeling at the 340M and 1B scale.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes preconditioned variants of DeltaNet, Gated DeltaNet (GDN), and Kimi Delta Attention (KDA) by incorporating curvature information from online least-squares regression. It derives equivalences between linear attention and the delta rule under exact preconditioning, introduces a diagonal approximation to the curvature matrix to enable efficient chunkwise parallel algorithms, and reports consistent empirical gains on synthetic recall benchmarks as well as language modeling at the 340M and 1B scales.
Significance. If the diagonal approximation to the curvature matrix can be shown to preserve stable curvature-aware updates without hidden per-task tuning or divergence, the work would offer a principled unification of linear attention with preconditioned delta-rule recurrences and a practical route to improving subquadratic sequence models. The derivation from online least-squares theory and the scale of the reported experiments (340M/1B) are strengths that would strengthen the contribution if the approximation's validity is established.
major comments (2)
- [Theoretical derivation (as summarized in the abstract)] The central derivation establishes equivalences between linear attention and the delta rule only in the exactly preconditioned case (using the full inverse curvature matrix). The practical algorithm replaces this with a diagonal approximation to enable chunkwise parallelism, but no argument or bound is given showing that the diagonal version retains the curvature-aware update direction or the claimed equivalence properties. This is load-bearing for the central claim because the reported gains at 340M and 1B scales rest entirely on the unproven stability and benefit of the approximation.
- [Empirical evaluation section] The abstract states that the diagonal approximation 'enables' preconditioned variants with 'consistent performance improvements,' yet the manuscript provides no analysis or ablation demonstrating that the approximation avoids instability or additional hyperparameter retuning across tasks. Without such evidence, the empirical results cannot be taken as confirmation that preconditioning benefits are preserved at scale.
minor comments (2)
- Clarify the precise definition and initialization of the curvature matrix early in the paper to make the transition from exact preconditioning to the diagonal case easier to follow.
- The synthetic recall benchmarks would benefit from reporting variance across multiple random seeds to strengthen the claim of consistent improvements.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below, indicating the revisions we plan to incorporate.
read point-by-point responses
-
Referee: [Theoretical derivation (as summarized in the abstract)] The central derivation establishes equivalences between linear attention and the delta rule only in the exactly preconditioned case (using the full inverse curvature matrix). The practical algorithm replaces this with a diagonal approximation to enable chunkwise parallelism, but no argument or bound is given showing that the diagonal version retains the curvature-aware update direction or the claimed equivalence properties. This is load-bearing for the central claim because the reported gains at 340M and 1B scales rest entirely on the unproven stability and benefit of the approximation.
Authors: We agree that the formal equivalence between linear attention and the delta rule is established only under exact preconditioning with the full inverse curvature matrix. The diagonal approximation is introduced specifically to enable efficient chunkwise parallel computation while still incorporating an approximation to the curvature. In the revised manuscript, we will expand the theoretical discussion to provide a clearer argument—based on the online least-squares regression framework—explaining why the diagonal elements capture the dominant curvature directions and why the resulting updates remain beneficial in practice. We will also add small-scale synthetic experiments comparing update directions under exact vs. diagonal preconditioning. A rigorous error bound on the approximation, however, would require substantial new theoretical analysis. revision: partial
-
Referee: [Empirical evaluation section] The abstract states that the diagonal approximation 'enables' preconditioned variants with 'consistent performance improvements,' yet the manuscript provides no analysis or ablation demonstrating that the approximation avoids instability or additional hyperparameter retuning across tasks. Without such evidence, the empirical results cannot be taken as confirmation that preconditioning benefits are preserved at scale.
Authors: We acknowledge that the current manuscript lacks explicit ablations on stability and hyperparameter sensitivity. In the revised version, we will add a dedicated empirical analysis subsection that includes: training loss and gradient norm curves for the 340M-scale preconditioned models to demonstrate absence of instability or divergence; explicit confirmation that all reported results used identical hyperparameters to the baseline DeltaNet/GDN/KDA models without per-task retuning; and additional ablations on synthetic recall tasks varying the diagonal approximation strength. These additions will provide direct evidence that the preconditioning benefits are preserved without hidden tuning. revision: yes
- A formal mathematical bound proving that the diagonal curvature approximation retains the exact equivalence properties or provides stability guarantees equivalent to the full-matrix case.
Circularity Check
Derivation from established online least squares theory shows no self-referential reduction
full rationale
The paper begins its central derivation from the established test-time regression (TTR) framework interpreting recurrences as online least squares updates. It derives equivalences between linear attention and the delta rule specifically in the exactly preconditioned case using this external theory. The diagonal approximation is introduced as a practical realization for chunkwise parallelism, without any claim that it follows by construction or reduces to fitted parameters. No self-citations are load-bearing for the uniqueness or ansatz, and no predictions are statistically forced by inputs. The empirical improvements are presented as validation rather than derived results. This makes the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Recurrent operators perform online least-squares updates that learn a linear map from keys to values
- ad hoc to paper A diagonal approximation to the curvature matrix is sufficient for practical gains
Forward citations
Cited by 1 Pith paper
-
Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators
Spectral Koopman operators let SSMs achieve 100% accuracy on long-gap multi-query associative recall with fixed memory, where pure Mamba fails.
Reference graph
Works this paper leans on
-
[1]
Lfm2 technical report.arXiv preprint arXiv:2511.23404, 2025
Amini, A., Banaszak, A., Benoit, H., Böök, A., Dakhran, T., Duong, S., Eng, A., Fernandes, F., Härkönen, M., Harrington, A., Hasani, R., Karwa, S., Khrustalev, Y., Labonne, M., Lechner, M., Lechner, V., Lee, S., Li, Z., Loo, N., Marks, J., Mosca, E., Paech, S. J., Pak, P., Parnichkun, R. N., Quach, A., Rogers, R., Rus, D., Saxena, N., Schlager, B., Seyde,...
-
[3]
Zoology: Measuring and improving recall in efficient language models
Arora, S., Eyuboglu, S., Timalsina, A., Johnson, I., Poli, M., Zou, J., Rudra, A., and Re, C. Zoology: Measuring and improving recall in efficient language models. In The Twelfth International Conference on Learning Representations, 2024 a . URL https://openreview.net/forum?id=LY3ukUANko
2024
-
[4]
Simple linear attention language models balance the recall-throughput tradeoff
Arora, S., Eyuboglu, S., Zhang, M., Timalsina, A., Alberti, S., Zou, J., Rudra, A., and Re, C. Simple linear attention language models balance the recall-throughput tradeoff. In ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2024 b . URL https://openreview.net/forum?id=qRlcoPhEoD
2024
-
[5]
Just read twice: Closing the recall gap for recurrent language models
Arora, S., Timalsina, A., Singhal, A., Spector, B., Eyuboglu, S., Zhao, X., Rao, A., Rudra, A., and R \'e , C. Just read twice: Closing the recall gap for recurrent language models. In Proceedings of the 2nd Efficient Systems for Foundation Models Workshop at the International Conference on Machine Learning (ICML), 2024 c . URL https://openreview.net/foru...
2024
-
[6]
arXiv preprint arXiv:2501.00663 , year=
Behrouz, A., Zhong, P., and Mirrokni, V. Titans: Learning to memorize at test time, 2024. URL https://arxiv.org/abs/2501.00663
-
[7]
Atlas: Learning to optimally memorize the context at test time, 2025
Behrouz, A., Li, Z., Kacham, P., Daliri, M., Deng, Y., Zhong, P., Razaviyayn, M., and Mirrokni, V. Atlas: Learning to optimally memorize the context at test time, 2025. URL https://arxiv.org/abs/2505.23735
-
[10]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try ARC , the AI2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018. doi:10.48550/arXiv.1803.05457. URL https://arxiv.org/abs/1803.05457
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1803.05457 2018
-
[11]
and Gu, A
Dao, T. and Gu, A. Transformers are ssms: generalized models and efficient algorithms through structured state space duality. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org, 2024
2024
-
[13]
Adaptive subgradient methods for online learning and stochastic optimization
Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12 0 (61): 0 2121--2159, 2011. URL http://jmlr.org/papers/v12/duchi11a.html
2011
- [14]
-
[15]
and Dao, T
Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=tEYskw1VY2
2024
-
[16]
Efficiently modeling long sequences with structured state spaces
Gu, A., Goel, K., and Re, C. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2022 a . URL https://openreview.net/forum?id=uYLFoz1vlAC
2022
-
[17]
On the parameterization and initialization of diagonal state space models
Gu, A., Gupta, A., Goel, K., and R\' e , C. On the parameterization and initialization of diagonal state space models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS '22, Red Hook, NY, USA, 2022 b . Curran Associates Inc. ISBN 9781713871088
2022
-
[18]
Guo, H., Yang, S., Goel, T., Xing, E. P., Dao, T., and Kim, Y. Log-linear attention, 2025. URL https://arxiv.org/abs/2506.04761
-
[19]
Liquid structural state-space models
Hasani, R., Lechner, M., Wang, T.-H., Chahine, M., Amini, A., and Rus, D. Liquid structural state-space models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=g4OTKRKfS7R
2023
-
[21]
RULER: What's the Real Context Size of Your Long-Context Language Models?
Hsieh, C.-P., Sun, S., Kriman, S., Acharya, S., Rekesh, D., Jia, F., Zhang, Y., and Ginsburg, B. Ruler: What's the real context size of your long-context language models?, 2024 b . URL https://arxiv.org/abs/2404.06654
work page internal anchor Pith review arXiv 2024
-
[22]
Hu, J., Pan, Y., Du, J., Lan, D., Tang, X., Wen, Q., Liang, Y., and Sun, W. Comba: Improving bilinear rnns with closed-loop control, 2025. URL https://arxiv.org/abs/2506.02475
-
[25]
Transformers are rnns: fast autoregressive transformers with linear attention
Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. Transformers are rnns: fast autoregressive transformers with linear attention. In Proceedings of the 37th International Conference on Machine Learning, ICML'20. JMLR.org, 2020
2020
-
[26]
Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization, 2017. URL https://arxiv.org/abs/1412.6980
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[29]
Longhorn: State space models are amortized online learners, 2024
Liu, B., Wang, R., Wu, L., Feng, Y., Stone, P., and Liu, Q. Longhorn: State space models are amortized online learners, 2024. URL https://arxiv.org/abs/2407.14207
- [31]
-
[32]
Optimizing neural networks with kronecker-factored approximate curvature.arXiv:1503.05671, 2020
Martens, J. and Grosse, R. Optimizing neural networks with kronecker-factored approximate curvature, 2020. URL https://arxiv.org/abs/1503.05671
-
[33]
Pointer sentinel mixture models
Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=Byj72udxe. ICLR 2017
2017
-
[34]
The illusion of state in state-space models
Merrill, W., Petty, J., and Sabharwal, A. The illusion of state in state-space models, 2025. URL https://arxiv.org/abs/2404.08819
-
[35]
A method for solving the convex programming problem with convergence rate O(1/k^2)
Nesterov, Y. A method for solving the convex programming problem with convergence rate O(1/k^2) . Proceedings of the USSR Academy of Sciences, 269: 0 543--547, 1983. URL https://api.semanticscholar.org/CorpusID:145918791
1983
-
[36]
Variational continual learning.arXiv preprint arXiv:1710.10628, 2017
Nguyen, C. V., Li, Y., Bui, T. D., and Turner, R. E. Variational continual learning, 2018. URL https://arxiv.org/abs/1710.10628
-
[38]
Parnichkun, R. N., Tumma, N., Thomas, A. W., Moro, A., An, Q., Suzuki, T., Yamashita, A., Poli, M., and Massaroli, S. Quantifying memory utilization with effective state-size, 2025. URL https://arxiv.org/abs/2504.19561
-
[39]
RWKV-7 “Goose” with expressive dynamic state evolution, 2025
Peng, B., Zhang, R., Goldstein, D., Alcaide, E., Du, X., Hou, H., Lin, J., Liu, J., Lu, J., Merrill, W., Song, G., Tan, K., Utpala, S., Wilce, N., Wind, J. S., Wu, T., Wuttke, D., and Zhou-Zheng, C. Rwkv-7 "goose" with expressive dynamic state evolution, 2025 a . URL https://arxiv.org/abs/2503.14456
-
[40]
Gated kalmanet: A fading memory layer through test-time ridge regression
Peng, L., Chattopadhyay, A., Zancato, L., Nunez, E., Xia, W., and Soatto, S. Gated kalmanet: A fading memory layer through test-time ridge regression, 2025 b . URL https://arxiv.org/abs/2511.21016
-
[41]
and Dao, Tri and Baccus, Stephen and Bengio, Yoshua and Ermon, Stefano and R
Poli, M., Massaroli, S., Nguyen, E., Fu, D. Y., Dao, T., Baccus, S., Bengio, Y., Ermon, S., and Ré, C. Hyena hierarchy: Towards larger convolutional language models, 2023. URL https://arxiv.org/abs/2302.10866
-
[44]
Robbins, H. E. A stochastic approximation method. Annals of Mathematical Statistics, 22: 0 400--407, 1951. URL https://api.semanticscholar.org/CorpusID:16945044
1951
-
[46]
International Conference on Machine Learning (ICML) , year=
Schlag, I., Irie, K., and Schmidhuber, J. Linear transformers are secretly fast weight programmers, 2021. URL https://arxiv.org/abs/2102.11174
-
[47]
Siems, J., Carstensen, T., Zela, A., Hutter, F., Pontil, M., and Grazzi, R. Deltaproduct: Improving state-tracking in linear rnns via householder products, 2025. URL https://arxiv.org/abs/2502.10297
-
[48]
T., Warrington, A., and Linderman, S
Smith, J. T., Warrington, A., and Linderman, S. Simplified state space layers for sequence modeling. In The Eleventh International Conference on Learning Representations, 2023
2023
-
[49]
R., Hestness, J., and Dey, N
Soboleva, D., Al-Khateeb, F., Myers, R., Steeves, J. R., Hestness, J., and Dey, N. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama , June 2023. URL https://huggingface.co/datasets/cerebras/SlimPajama-627B
2023
- [50]
-
[51]
Team, K., Zhang, Y., Lin, Z., Yao, X., Hu, J., Meng, F., Liu, C., Men, X., Yang, S., Li, Z., Li, W., Lu, E., Liu, W., Chen, Y., Xu, W., Yu, L., Wang, Y., Fan, Y., Zhong, L., Yuan, E., Zhang, D., Zhang, Y., Liu, T. Y., Wang, H., Fang, S., He, W., Liu, S., Li, Y., Su, J., Qiu, J., Pang, B., Yan, J., Jiang, Z., Huang, W., Yin, B., You, J., Wei, C., Wang, Z.,...
work page internal anchor Pith review arXiv 2025
-
[52]
Leveraging low-rank and sparse recurrent connectivity for robust closed-loop control
Tumma, N., Lechner, M., Loo, N., Hasani, R., and Rus, D. Leveraging low-rank and sparse recurrent connectivity for robust closed-loop control. In The Twelfth International Conference on Learning Representations, 2024
2024
-
[53]
N., Kaiser, ., and Polosukhin, I
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017
2017
-
[54]
von Oswald, J., Schlegel, M., Meulemans, A., Kobayashi, S., Niklasson, E., Zucchet, N., Scherrer, N., Miller, N., Sandler, M., y Arcas, B. A., Vladymyrov, M., Pascanu, R., and Sacramento, J. Uncovering mesa-optimization algorithms in transformers, 2024. URL https://arxiv.org/abs/2309.05858
-
[55]
von Oswald, J., Scherrer, N., Kobayashi, S., Versari, L., Yang, S., Schlegel, M., Maile, K., Schimpf, Y., Sieberling, O., Meulemans, A., Saurous, R. A., Lajoie, G., Frenkel, C., Pascanu, R., y Arcas, B. A., and Sacramento, J. Mesanet: Sequence modeling by locally optimal test-time training, 2025. URL https://arxiv.org/abs/2506.05233
-
[56]
arXiv preprint arXiv:2501.12352 , year=
Wang, K. A., Shi, J., and Fox, E. B. Test-time regression: a unifying framework for designing sequence models with associative memory, 2025. URL https://arxiv.org/abs/2501.12352
-
[57]
Linformer: Self-Attention with Linear Complexity
Wang, S., Li, B. Z., Khabsa, M., Fang, H., and Ma, H. Linformer: Self-attention with linear complexity, 2020. URL https://arxiv.org/abs/2006.04768
work page internal anchor Pith review arXiv 2020
-
[59]
and Zhang, Y
Yang, S. and Zhang, Y. Fla: A triton-based library for hardware-efficient implementations of linear attention mechanism, January 2024. URL https://github.com/fla-org/flash-linear-attention. Software
2024
-
[60]
Gated linear attention transformers with hardware-efficient training
Yang, S., Wang, B., Shen, Y., Panda, R., and Kim, Y. Gated linear attention transformers with hardware-efficient training. In Forty-first International Conference on Machine Learning, 2023
2023
-
[61]
Parallelizing linear transformers with the delta rule over sequence length
Yang, S., Wang, B., Zhang, Y., Shen, Y., and Kim, Y. Parallelizing linear transformers with the delta rule over sequence length. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
2024
-
[62]
Gated Delta Networks: Improving Mamba2 with Delta Rule
Yang, S., Kautz, J., and Hatamizadeh, A. Gated delta networks: Improving mamba2 with delta rule, 2025. URL https://arxiv.org/abs/2412.06464
work page internal anchor Pith review arXiv 2025
-
[64]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Mamba: Linear-Time Sequence Modeling with Selective State Spaces , author =. arXiv preprint arXiv:2312.00752 , year =. doi:10.48550/arXiv.2312.00752 , url =
-
[65]
PIQA: Reasoning about Physical Commonsense in Natural Language , booktitle =
PIQA: Reasoning about Physical Commonsense in Natural Language , author =. Proceedings of the AAAI Conference on Artificial Intelligence , year =. doi:10.1609/aaai.v34i05.6239 , url =
-
[67]
Proceedings of the AAAI Conference on Artificial Intelligence , author=
WinoGrande: An Adversarial Winograd Schema Challenge at Scale , author =. Proceedings of the AAAI Conference on Artificial Intelligence , year =. doi:10.1609/aaai.v34i05.6399 , url =
-
[68]
Think You Have Solved Question Answering? Try
Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , journal =. Think You Have Solved Question Answering? Try. 2018 , doi =
2018
-
[69]
Sap, Maarten and Rashkin, Hannah and Chen, Derek and Le Bras, Ronan and Choi, Yejin , booktitle =. Social. 2019 , pages =. doi:10.18653/v1/D19-1454 , url =
-
[70]
B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions
Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina , booktitle =. 2019 , pages =. doi:10.18653/v1/N19-1300 , url =
-
[71]
International Conference on Learning Representations , year =
Pointer Sentinel Mixture Models , author =. International Conference on Learning Representations , year =
-
[72]
Paperno, Denis and Kruszewski, Germ. The. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =. doi:10.18653/v1/P16-1144 , url =
-
[74]
Proceedings of the 2nd Efficient Systems for Foundation Models Workshop at the International Conference on Machine Learning (ICML) , year =
Just Read Twice: Closing the Recall Gap for Recurrent Language Models , author =. Proceedings of the 2nd Efficient Systems for Foundation Models Workshop at the International Conference on Machine Learning (ICML) , year =
-
[75]
doi:10.18653/v1/N19-1309 , editor =
Lockard, Colin and Shiralkar, Prashant and Dong, Xin Luna , booktitle =. 2019 , pages =. doi:10.18653/v1/N19-1309 , url =
-
[76]
Proceedings of the VLDB Endowment , year =
Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes , author =. Proceedings of the VLDB Endowment , year =. doi:10.14778/3626292.3626294 , url =
-
[77]
Rajpurkar, Pranav and Jia, Robin and Liang, Percy , booktitle =. Know What You Don't Know: Unanswerable Questions for. 2018 , pages =. doi:10.18653/v1/P18-2124 , url =
-
[78]
Joshi, Mandar and Choi, Eunsol and Weld, Daniel and Zettlemoyer, Luke. T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi:10.18653/v1/P17-1147
-
[79]
2025 , eprint=
Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression , author=. 2025 , eprint=
2025
-
[80]
2020 , eprint=
Linformer: Self-Attention with Linear Complexity , author=. 2020 , eprint=
2020
-
[81]
Data mining and knowledge discovery , 33(4):917–963
Welbl, Johannes and Liu, Nelson F. and Gardner, Matt. Crowdsourcing Multiple Choice Science Questions. Proceedings of the 3rd Workshop on Noisy User-generated Text. 2017. doi:10.18653/v1/W17-4413
-
[83]
Transactions of the Association for Computational Linguistics , author =
Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and Toutanova, Kristina and Jones, Llion and Kelcey, Matthew and Chang, Ming-Wei and Dai, Andrew M. and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav. Natura...
-
[84]
2025 , eprint=
LFM2 Technical Report , author=. 2025 , eprint=
2025
-
[85]
doi:10.18653/v1/N19-1246 , editor =
Dua, Dheeru and Wang, Yizhong and Dasigi, Pradeep and Stanovsky, Gabriel and Singh, Sameer and Gardner, Matt , booktitle =. 2019 , pages =. doi:10.18653/v1/N19-1246 , url =
-
[86]
Lost in the Middle: How Language Models Use Long Contexts
Natural Questions: A Benchmark for Question Answering Research , author =. Transactions of the Association for Computational Linguistics , year =. doi:10.1162/tacl\_a\_00276 , url =
work page internal anchor Pith review doi:10.1162/tacl
-
[87]
International Conference on Learning Representations , year=
Efficiently Modeling Long Sequences with Structured State Spaces , author=. International Conference on Learning Representations , year=
-
[88]
The Eleventh International Conference on Learning Representations , year=
Simplified State Space Layers for Sequence Modeling , author=. The Eleventh International Conference on Learning Representations , year=
-
[89]
Advances in neural information processing systems , volume=
Attention is all you need , author=. Advances in neural information processing systems , volume=
-
[90]
and Hestness, Joel and Dey, Nolan , title =
Soboleva, Daria and Al-Khateeb, Faisal and Myers, Robert and Steeves, Jacob R. and Hestness, Joel and Dey, Nolan , title =. 2023 , month = jun, url =
2023
-
[91]
2024 , month = jan, day =
FLA: A Triton-Based Library for Hardware-Efficient Implementations of Linear Attention Mechanism , author =. 2024 , month = jan, day =
2024
-
[92]
Forty-first International Conference on Machine Learning , year=
Gated Linear Attention Transformers with Hardware-Efficient Training , author=. Forty-first International Conference on Machine Learning , year=
-
[93]
International Conference on Machine Learning , pages=
Resurrecting recurrent neural networks for long sequences , author=. International Conference on Machine Learning , pages=. 2023 , organization=
2023
-
[94]
First Conference on Language Modeling , year=
Mamba: Linear-Time Sequence Modeling with Selective State Spaces , author=. First Conference on Language Modeling , year=
-
[95]
Forty-first International Conference on Machine Learning , year=
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality , author=. Forty-first International Conference on Machine Learning , year=
-
[96]
The effective rank: A measure of effective dimensionality , year=
Roy, Olivier and Vetterli, Martin , booktitle=. The effective rank: A measure of effective dimensionality , year=
-
[97]
The Twelfth International Conference on Learning Representations , year=
Zoology: Measuring and Improving Recall in Efficient Language Models , author=. The Twelfth International Conference on Learning Representations , year=
-
[98]
Forty-first International Conference on Machine Learning , year=
Mechanistic Design and Scaling of Hybrid Architectures , author=. Forty-first International Conference on Machine Learning , year=
-
[99]
The Eleventh International Conference on Learning Representations , year=
Liquid Structural State-Space Models , author=. The Eleventh International Conference on Learning Representations , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.