Recognition: 2 theorem links
· Lean TheoremKimi Linear: An Expressive, Efficient Attention Architecture
Pith reviewed 2026-05-13 23:42 UTC · model grok-4.3
The pith
Kimi Linear, a hybrid linear attention model, outperforms full attention across contexts while cutting KV cache by up to 75%.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Kimi Linear is a hybrid linear attention architecture that for the first time outperforms full attention under fair comparisons across short-context, long-context, and reinforcement learning scaling regimes. Its core is Kimi Delta Attention, an expressive linear module that extends Gated DeltaNet with finer-grained gating to use finite-state RNN memory more effectively. A specialized chunkwise algorithm based on a Diagonal-Plus-Low-Rank transition matrix variant keeps computation low. A model with 3B activated parameters and 48B total parameters, built from a layerwise mix of KDA and Multi-Head Latent Attention, exceeds full MLA performance while reducing KV cache usage by up to 75% and up 6
What carries the argument
Kimi Delta Attention (KDA) module with finer-grained gating on top of Gated DeltaNet, combined with a specialized Diagonal-Plus-Low-Rank (DPLR) transition matrix for efficient chunkwise computation.
Load-bearing premise
The performance gains come from the finer-grained gating in KDA and the specialized DPLR variant rather than from differences in training data, hyperparameters, or evaluation setup.
What would settle it
Train identical Kimi Linear and full attention models on the exact same data and hyperparameters, then measure whether the linear version still scores higher on the reported benchmarks.
read the original abstract
We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios -- including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule. We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 times decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths. To support further research, we open-source the KDA kernel and vLLM implementations, and release the pre-trained and instruction-tuned model checkpoints.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Kimi Linear, a hybrid linear attention architecture whose core is Kimi Delta Attention (KDA), an extension of Gated DeltaNet that adds finer-grained gating to improve utilization of finite-state RNN memory. It employs a bespoke chunkwise algorithm based on a specialized Diagonal-Plus-Low-Rank (DPLR) transition matrix that reduces computation relative to the general DPLR form while staying closer to the classical delta rule. A 3B-activated / 48B-total-parameter model is pretrained as a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). The central empirical claim is that, under an identical training recipe, Kimi Linear outperforms full MLA across short-context, long-context, and RL scaling regimes, while cutting KV cache usage by up to 75% and delivering up to 6x decoding throughput at 1M context. The KDA kernel, vLLM integration, and model checkpoints are released.
Significance. If the performance margins survive rigorous verification of identical training conditions, the result would be significant: it would demonstrate that a carefully designed linear attention module can surpass full attention in both accuracy and efficiency across multiple regimes, offering a practical drop-in replacement that materially reduces inference cost for long contexts. The open-sourcing of the kernel and models further strengthens the contribution by enabling direct reproduction and extension.
major comments (2)
- Abstract and Experiments section: the claim that gains arise under an 'identical training recipe' is load-bearing for attributing improvements to KDA's gating and the specialized DPLR variant. The manuscript does not supply side-by-side hyperparameter tables, exact data-mix details, or ablations that swap only the attention module while freezing all other factors; without these, the reported margins (and 75% KV-cache reduction) could stem from unstated differences in optimization or evaluation rather than the architecture.
- §3.2 (KDA and DPLR formulation): the specialized DPLR variant is asserted to be both more efficient and more consistent with the classical delta rule than the general DPLR, yet no direct complexity comparison, flop counts, or numerical-stability analysis versus the general formulation is provided to support this design choice.
minor comments (2)
- Figure captions and axis labels in the throughput and KV-cache plots should explicitly state the context lengths and batch sizes used so readers can directly compare the 1M-context 6x claim.
- Notation for the gating variables in the KDA equations should be unified across the text and pseudocode to avoid ambiguity between per-head and per-dimension gates.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the suggested clarifications and analyses.
read point-by-point responses
-
Referee: Abstract and Experiments section: the claim that gains arise under an 'identical training recipe' is load-bearing for attributing improvements to KDA's gating and the specialized DPLR variant. The manuscript does not supply side-by-side hyperparameter tables, exact data-mix details, or ablations that swap only the attention module while freezing all other factors; without these, the reported margins (and 75% KV-cache reduction) could stem from unstated differences in optimization or evaluation rather than the architecture.
Authors: We appreciate this point and agree that explicit documentation is necessary to support the attribution of gains to the architecture. The training runs for Kimi Linear and the full MLA baseline used identical data mixtures, optimizer settings, learning-rate schedules, batch sizes, and all other hyperparameters. In the revised manuscript we will add: (1) side-by-side hyperparameter tables, (2) precise data-mix specifications, and (3) an ablation study that replaces only the attention module while freezing every other training factor. These additions will make the identical-recipe claim fully verifiable and will confirm that the observed margins arise from KDA rather than extraneous differences. revision: yes
-
Referee: §3.2 (KDA and DPLR formulation): the specialized DPLR variant is asserted to be both more efficient and more consistent with the classical delta rule than the general DPLR, yet no direct complexity comparison, flop counts, or numerical-stability analysis versus the general formulation is provided to support this design choice.
Authors: We agree that quantitative support for the design choice would strengthen the paper. In the revision we will expand §3.2 (or add an appendix) with: (i) asymptotic and practical flop-count comparisons, (ii) explicit complexity analysis of the specialized versus general DPLR transition matrices, and (iii) numerical-stability experiments (including forward-pass error accumulation and gradient-norm statistics) on both formulations. These results will demonstrate the computational savings and closer fidelity to the classical delta rule that motivated the specialized variant. revision: yes
Circularity Check
No significant circularity; empirical claims rest on new architecture and reported benchmarks
full rationale
The paper presents an empirical architecture paper whose central claim is that a hybrid of KDA (finer-grained gating extension of Gated DeltaNet) and MLA outperforms full attention under an identical training recipe, with measured KV-cache and throughput gains. No load-bearing mathematical derivation, prediction, or uniqueness theorem is offered that reduces by construction to fitted inputs or prior self-citations. The architecture description introduces new components (chunkwise DPLR variant, gating mechanism) whose performance is asserted via experimental results rather than self-referential definitions or renamed known patterns. Self-citations, if present for Gated DeltaNet, are not load-bearing for the outperformance claim.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
outperforms full attention under fair comparisons... reducing KV cache usage by up to 75%
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 23 Pith papers
-
Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences
Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
-
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation
The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
-
Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training
Mem3R achieves better long-sequence 3D reconstruction by decoupling tracking and mapping with a hybrid memory of TTT-updated MLP and explicit tokens, reducing model size and trajectory errors.
-
OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention
OSDN adds online diagonal preconditioning to the Delta Rule, preserving chunkwise parallelism while proving super-geometric convergence and delivering 32-39% recall gains at 340M-1.3B scales.
-
$\delta$-mem: Efficient Online Memory for Large Language Models
δ-mem augments frozen LLMs with an 8x8 online memory state updated by delta-rule learning to generate low-rank attention corrections, delivering 1.10x average gains over the backbone and larger improvements on memory-...
-
Structured Recurrent Mixers for Massively Parallelized Sequence Generation
Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.
-
Revisiting Transformer Layer Parameterization Through Causal Energy Minimization
CEM recasts Transformer layers as energy minimization steps, enabling constrained parameterizations like weight sharing and low-rank interactions that match standard baselines in 100M-scale language modeling.
-
UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification
UniPrefill accelerates LLM prefill via block-wise dynamic sparsification, achieving up to 2.1x TTFT speedup while supporting hybrid architectures and native vLLM continuous batching.
-
Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling
HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.
-
Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter
PrfaaS enables practical cross-datacenter prefill-decode disaggregation for hybrid-attention models via selective offloading, bandwidth-aware scheduling, and cache-aware placement, yielding 54% higher throughput and 6...
-
Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation
Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.
-
Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion
Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.
-
Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing
Stochastic training with random cross-layer KV attention enables depth-wise cache sharing in transformers, cutting memory footprint while preserving or improving performance.
-
LPC-SM: Local Predictive Coding and Sparse Memory for Long-Context Language Modeling
LPC-SM is a hybrid architecture separating local attention, persistent memory, predictive correction, and control with ONT for memory writes, showing loss reductions on 158M-parameter models up to 4096-token contexts.
-
SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer
SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher thro...
-
Mela: Test-Time Memory Consolidation based on Transformation Hypothesis
Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.
-
Kaczmarz Linear Attention
Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...
-
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention
MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.
-
Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving
Irminsul recovers up to 83% of prompt tokens above exact-prefix matching and delivers 63% prefill energy savings per cache hit on MLA-MoE models by content-hashing CDC chunks and applying closed-form kr correction.
-
Heterogeneous Scientific Foundation Model Collaboration
Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.
-
SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform Inference
SpikingBrain2.0 is a 5B hybrid spiking-Transformer that recovers most base model performance while delivering 10x TTFT speedup at 4M context and supporting over 10M tokens on limited GPUs via dual sparse attention and...
-
UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training
UniEP fuses MoE communication and computation into unified MegaKernels with deterministic token ordering, delivering 1.03x-1.38x speedups over prior work while preserving training accuracy.
-
LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning
LongAct uses saliency from high-magnitude activations to guide sparse weight updates in long-context RL, yielding about 8% gains on LongBench v2 across multiple algorithms.
Reference graph
Works this paper leans on
-
[1]
gpt-oss-120b & gpt-oss-20b Model Card
Sandhini Agarwal et al. “gpt-oss-120b & gpt-oss-20b model card”. In:arXiv preprint arXiv:2508.10925(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Colt5: Faster long-range transformers with conditional computation
Joshua Ainslie et al. “Colt5: Faster long-range transformers with conditional computation”. In:arXiv preprint arXiv:2303.09752(2023)
-
[3]
Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers
Zeyuan Allen-Zhu. “Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers”. In:SSRN Electronic Journal(May 2025). Available at SSRN: https://ssrn.com/abstract=5240330 or http://dx.doi.org/10.2139/ssrn.5240330.DOI:10.2139/ssrn.5240330
work page doi:10.2139/ssrn.5240330.doi:10.2139/ssrn.5240330 2025
-
[4]
Simple linear attention language models balance the recall-throughput tradeoff
Simran Arora et al. “Simple linear attention language models balance the recall-throughput tradeoff”. In: Forty-first International Conference on Machine Learning. 2024.URL: https://openreview.net/forum? id=e93ffDcpH3
work page 2024
- [5]
-
[6]
Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks, 2025
Yushi Bai et al. “Longbench v2: Towards deeper understanding and reasoning on realistic long-context multi- tasks”. In:arXiv preprint arXiv:2412.15204(2024)
-
[7]
Round and Round We Go! What makes Rotary Positional Encodings useful?
Federico Barbero et al. “Round and Round We Go! What makes Rotary Positional Encodings useful?” In: Proceedings of ICLR. 2025.URL:https://openreview.net/forum?id=GtvuNrk58a
work page 2025
-
[8]
Atlas: Learning to optimally memorize the context at test time, 2025
Ali Behrouz et al. “Atlas: Learning to optimally memorize the context at test time”. In:arXiv preprint arXiv:2505.23735(2025)
-
[9]
Unlimiformer: Long-range transformers with unlimited length input
Amanda Bertsch et al. “Unlimiformer: Long-range transformers with unlimited length input”. In:Advances in NeurIPS36 (2023), pp. 35522–35543
work page 2023
-
[10]
Stella Biderman et al. “Lessons from the trenches on reproducible evaluation of language models”. In:arXiv preprint arXiv:2405.14782(2024)
-
[11]
The WY Representation for Products of Householder Matrices
Christian Bischof and Charles Van Loan. “The WY Representation for Products of Householder Matrices”. In: SIAM Journal on Scientific and Statistical Computing(1987), s2–s13.URL: https://doi.org/10.1137/ 0908009
work page 1987
-
[12]
Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models
Aaron Blakeman et al. “Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models”. In: arXiv preprint arXiv:2504.03624(2025)
-
[13]
Long code arena: a set of benchmarks for long-context code models
Egor Bogomolov et al. “Long code arena: a set of benchmarks for long-context code models”. In:arXiv preprint arXiv:2406.11612(2024)
-
[14]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark et al. “Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge”. In: arXiv:1803.05457v1(2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[15]
The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
Ganqu Cui et al. “The entropy mechanism of reinforcement learning for reasoning language models”. In:arXiv preprint arXiv:2505.22617(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Tri Dao and Albert Gu. “Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality”. In:CoRRabs/2405.21060 (2024).DOI: 10.48550/ARXIV.2405.21060 . arXiv:2405.21060.URL:https://doi.org/10.48550/arXiv.2405.21060
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.21060 2024
-
[17]
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Tri Dao et al. “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”. In:Advances in NeurIPS. 2022, pp. 16344–16359.URL: https://proceedings.neurips.cc/paper_files/paper/ 2022/file/67d57c32e20fd0a7a302cb81d36e40d5-Paper-Conference.pdf
work page 2022
-
[18]
DeepSeek-AI.DeepSeek-V3.2-Exp: Boosting Long-Context Efficiency with DeepSeek Sparse Attention. 2025
work page 2025
-
[19]
DeepSeek-AI et al.DeepSeek-V3 Technical Report. 2025. arXiv: 2412.19437 [cs.CL] .URL: https:// arxiv.org/abs/2412.19437
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [20]
- [21]
- [22]
-
[23]
Mom: Linear sequence modeling with mixture-of-memories
Jusen Du et al. “Mom: Linear sequence modeling with mixture-of-memories”. In:arXiv preprint arXiv:2502.13685(2025)
-
[24]
Native Hybrid Attention for Efficient Sequence Modeling
Jusen Du et al. “Native Hybrid Attention for Efficient Sequence Modeling”. In:arXiv preprint arXiv:2510.07019 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Moa: Mixture of sparse attention for automatic large language model compression
Tianyu Fu et al. “Moa: Mixture of sparse attention for automatic large language model compression”. In:arXiv preprint arXiv:2406.14909(2024). 19 Kimi Linear: An Expressive, Efficient Attention ArchitectureTECHNICALREPORT
-
[26]
Are we done with mmlu? CoRR, abs/2406.04127,
Aryo Pradipta Gema et al. “Are we done with mmlu?” In:arXiv preprint arXiv:2406.04127(2024)
-
[27]
Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues
Riccardo Grazzi et al. “Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues”. In:Proceed- ings of ICLR. 2025.URL:https://openreview.net/forum?id=UvTo3tVBk2
work page 2025
-
[28]
How ordinary elimination became Gaussian elimination
Joseph F. Grcar. “How ordinary elimination became Gaussian elimination”. In:Historia Mathematica38.2 (May 2011), pp. 163–218.ISSN: 0315-0860.DOI: 10.1016/j.hm.2010.06.003 .URL: http://dx.doi. org/10.1016/j.hm.2010.06.003
-
[29]
Albert Gu and Tri Dao.Mamba: Linear-Time Sequence Modeling with Selective State Spaces. 2023. arXiv: 2312.00752 [cs.LG]
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Albert Gu, Karan Goel, and Christopher Ré.Efficiently Modeling Long Sequences with Structured State Spaces
-
[31]
arXiv:2111.00396 [cs.LG]
work page internal anchor Pith review Pith/arXiv arXiv
- [32]
- [33]
-
[34]
DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning
Daya Guo et al. “DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning”. In:Nature 645.8081 (2025), pp. 633–638
work page 2025
-
[35]
Han Guo et al. “Log-linear attention”. In:arXiv preprint arXiv:2506.04761(2025)
-
[36]
Qipeng Guo et al. “Star-transformer”. In:arXiv preprint arXiv:1902.09113(2019)
-
[37]
Dan Hendrycks et al.Measuring Massive Multitask Language Understanding. 2021. arXiv: 2009.03300 [cs.CY].URL:https://arxiv.org/abs/2009.03300
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[38]
Jordan Hoffmann et al.Training Compute-Optimal Large Language Models. 2022. arXiv: 2203 . 15556 [cs.CL].URL:https://arxiv.org/abs/2203.15556
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[39]
RULER: What's the Real Context Size of Your Long-Context Language Models?
Cheng-Ping Hsieh et al. “RULER: What’s the Real Context Size of Your Long-Context Language Models?” In: arXiv preprint arXiv:2404.06654(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
Attractor memory for long-term time series forecasting: A chaos perspective
Jiaxi Hu et al. “Attractor memory for long-term time series forecasting: A chaos perspective”. In:Advances in NeurIPS37 (2024), pp. 20786–20818
work page 2024
-
[41]
Jiaxi Hu et al. “Comba: Improving Nonlinear RNNs with Closed-loop Control”. In:arXiv preprint arXiv:2506.02475(2025)
-
[42]
Ermo Hua et al. “Fourier Position Embedding: Enhancing Attention’s Periodic Extension for Length General- ization”. In:arXiv preprint arXiv:2412.17739(2024)
-
[43]
Transformer Quality in Linear Time
Weizhe Hua et al. “Transformer Quality in Linear Time”. In:Proceedings of ICML. Ed. by Kamalika Chaudhuri et al. PMLR, 2022, pp. 9099–9117.URL:https://proceedings.mlr.press/v162/hua22a.html
work page 2022
-
[44]
C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models
Yuzhen Huang et al. “C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models”. In: Advances in NeurIPS36 (2023), pp. 62991–63010
work page 2023
-
[45]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain et al. “Livecodebench: Holistic and contamination free evaluation of large language models for code”. In:arXiv preprint arXiv:2403.07974(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [46]
-
[47]
Accumulating Householder transformations, revisited
Thierry Joffrain et al. “Accumulating Householder transformations, revisited”. In: (2006), pp. 169–179.URL: https://doi.org/10.1145/1141885.1141886
-
[48]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Mandar Joshi et al. “Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension”. In:arXiv preprint arXiv:1705.03551(2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[49]
Transformers are RNNs: Fast Autoregressive Transformers with Linear Atten- tion
Angelos Katharopoulos et al. “Transformers are RNNs: Fast Autoregressive Transformers with Linear Atten- tion”. In:Proceedings of ICML. Ed. by Hal Daumé III and Aarti Singh. PMLR, 2020, pp. 5156–5165.URL: https://proceedings.mlr.press/v119/katharopoulos20a.html
work page 2020
-
[50]
The impact of positional encoding on length generalization in transformers
Amirhossein Kazemnejad et al. “The impact of positional encoding on length generalization in transformers”. In:Advances in NeurIPS36 (2023), pp. 24892–24928
work page 2023
-
[51]
Kimi K2: Open Agentic Intelligence
Team Kimi et al. “Kimi k2: Open agentic intelligence”. In:arXiv preprint arXiv:2507.20534(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
Reformer: The Efficient Transformer
Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. “Reformer: The efficient transformer”. In:arXiv preprint arXiv:2001.04451(2020)
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[53]
Satyapriya Krishna et al. “Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation”. In: arXiv preprint arXiv:2409.12941(2024)
-
[54]
A Survey of Post-Training Scaling in Large Language Models
Hanyu Lai et al. “A Survey of Post-Training Scaling in Large Language Models”. In:Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025, pp. 2771– 2791. 20 Kimi Linear: An Expressive, Efficient Attention ArchitectureTECHNICALREPORT
work page 2025
-
[55]
Liger: Linearizing Large Language Models to Gated Recurrent Structures
Disen Lan et al. “Liger: Linearizing Large Language Models to Gated Recurrent Structures”. In:arXiv preprint arXiv:2503.01496(2025)
-
[56]
CMMLU: Measuring massive multitask language understanding in Chinese
Haonan Li et al. “CMMLU: Measuring massive multitask language understanding in Chinese”. In:Findings of the Association for Computational Linguistics: ACL 2024. Ed. by Lun-Wei Ku, Andre Martins, and Vivek Srikumar. Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 11260–11285.DOI: 10 . 18653 / v1 / 2024 . findings - acl . 671.UR...
work page 2024
-
[57]
Transmamba: Flexibly switching between transformer and mamba
Yixing Li et al. “Transmamba: Flexibly switching between transformer and mamba”. In:arXiv preprint arXiv:2503.24067(2025)
-
[58]
Opher Lieber et al.Jamba: A Hybrid Transformer-Mamba Language Model. 2024. arXiv: 2403 . 19887 [cs.CL]
work page 2024
-
[59]
Forgetting transformer: Softmax attention with a forget gate
Zhixuan Lin et al. “Forgetting transformer: Softmax attention with a forget gate”. In:arXiv preprint arXiv:2503.02130(2025)
-
[60]
Longhorn: State space models are amortized online learners
Bo Liu et al. “Longhorn: State Space Models are Amortized Online Learners”. In:ArXivabs/2407.14207 (2024). URL:https://api.semanticscholar.org/CorpusID:271310065
-
[61]
Jiawei Liu et al. “Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation”. In:Thirty-seventh Conference on NeurIPS. 2023.URL: https://openreview. net/forum?id=1qvx610Cu7
work page 2023
-
[62]
Repoqa: Evaluating long context code understanding
Jiawei Liu et al. “Repoqa: Evaluating long context code understanding”. In:arXiv preprint arXiv:2406.06025 (2024)
-
[63]
Jingyuan Liu et al.Muon is Scalable for LLM Training. 2025. arXiv: 2502.16982 [cs.LG] .URL: https: //arxiv.org/abs/2502.16982
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [64]
-
[65]
The illusion of state in state- space models,
William Merrill, Jackson Petty, and Ashish Sabharwal. “The illusion of state in state-space models”. In:arXiv preprint arXiv:2404.08819(2024)
-
[66]
The Parallelism Tradeoff: Limitations of Log-Precision Transformers
William Merrill and Ashish Sabharwal. “The Parallelism Tradeoff: Limitations of Log-Precision Transformers”. In:Transactions of the Association for Computational Linguistics11 (2023), pp. 531–545.DOI: 10.1162/ tacl_a_00562.URL:https://aclanthology.org/2023.tacl-1.31/
work page 2023
- [67]
- [68]
- [69]
-
[70]
Tsendsuren Munkhdalai et al. “Metalearned Neural Memory”. In:ArXivabs/1907.09720 (2019).URL: https: //api.semanticscholar.org/CorpusID:198179407
-
[71]
Training language models to follow instructions with human feedback
Long Ouyang et al. “Training language models to follow instructions with human feedback”. In:Advances in NeurIPS35 (2022), pp. 27730–27744
work page 2022
- [72]
-
[73]
YaRN: Efficient Context Window Extension of Large Language Models
Bowen Peng et al. “Yarn: Efficient context window extension of large language models”. In:arXiv preprint arXiv:2309.00071(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[74]
Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing
Piotr Pi˛ ekos, Róbert Csordás, and Jürgen Schmidhuber. “Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing”. In:arXiv preprint arXiv:2505.00315(2025)
-
[75]
Reasoning with large language models, a survey
Aske Plaat et al. “Reasoning with large language models, a survey”. In:CoRR(2024)
work page 2024
-
[76]
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
Ofir Press, Noah Smith, and Mike Lewis. “Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation”. In:Proceedings of ICLR. 2022.URL: https://openreview.net/forum?id= R8sQPpGCv0
work page 2022
-
[77]
Puvvada et al.SWAN-GPT: An Efficient and Scalable Approach for Long-Context Language Modeling
Krishna C. Puvvada et al.SWAN-GPT: An Efficient and Scalable Approach for Long-Context Language Modeling. 2025. arXiv:2504.08719 [cs.CL]
- [78]
-
[79]
Zhen Qin et al.TransNormerLLM: A Faster and Better Large Language Model with Improved TransNormer
- [80]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.