arxiv: 2510.26692 · v2 · submitted 2025-10-30 · 💻 cs.CL · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Kimi Linear: An Expressive, Efficient Attention Architecture

Kimi Team: Yu Zhang , Zongyu Lin , Xingcheng Yao , Jiaxi Hu , Fanqing Meng , Chengyin Liu , Xin Men , Songlin Yang

show 51 more authors

Zhiyuan Li Wentao Li Enzhe Lu Weizhou Liu Yanru Chen Weixin Xu Longhui Yu Yejie Wang Yu Fan Longguang Zhong Enming Yuan Dehao Zhang Yizhi Zhang T.Y. Liu Haiming Wang Shengjun Fang Weiran He Shaowei Liu Yiwei Li Jianlin Su Jiezhong Qiu Bo Pang Junjie Yan Zhejun Jiang Weixiao Huang Bohong Yin Jiacheng You Chu Wei Zhengtao Wang Chao Hong Yutian Chen Guanduo Chen Yucheng Wang Huabin Zheng Feng Wang Yibo Liu Mengnan Dong Zheng Zhang Siyuan Pan Wenhao Wu Yuhao Wu Longyu Guan Jiawen Tao Guohong Fu Xinran Xu Yuzhi Wang Guokun Lai Yuxin Wu Xinyu Zhou Zhilin Yang Yulun Du

Authors on Pith no claims yet

Pith reviewed 2026-05-13 23:42 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords linear attentionKimi Delta Attentionhybrid attentionKV cachelong contextdecoding throughputdelta ruleefficient transformers

0 comments

The pith

Kimi Linear, a hybrid linear attention model, outperforms full attention across contexts while cutting KV cache by up to 75%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Kimi Linear as a hybrid architecture combining linear and full attention layers. It claims this design beats standard full attention models even when trained identically on the same data. The improvement comes from Kimi Delta Attention, which adds finer-grained gating to make better use of the limited memory in linear attention. Experiments with a 3B-active-parameter model show gains on short and long tasks plus reinforcement learning scaling, plus major efficiency wins in memory and speed. If correct, this suggests linear attention can replace full attention as a practical, higher-performing option for large models.

Core claim

Kimi Linear is a hybrid linear attention architecture that for the first time outperforms full attention under fair comparisons across short-context, long-context, and reinforcement learning scaling regimes. Its core is Kimi Delta Attention, an expressive linear module that extends Gated DeltaNet with finer-grained gating to use finite-state RNN memory more effectively. A specialized chunkwise algorithm based on a Diagonal-Plus-Low-Rank transition matrix variant keeps computation low. A model with 3B activated parameters and 48B total parameters, built from a layerwise mix of KDA and Multi-Head Latent Attention, exceeds full MLA performance while reducing KV cache usage by up to 75% and up 6

What carries the argument

Kimi Delta Attention (KDA) module with finer-grained gating on top of Gated DeltaNet, combined with a specialized Diagonal-Plus-Low-Rank (DPLR) transition matrix for efficient chunkwise computation.

Load-bearing premise

The performance gains come from the finer-grained gating in KDA and the specialized DPLR variant rather than from differences in training data, hyperparameters, or evaluation setup.

What would settle it

Train identical Kimi Linear and full attention models on the exact same data and hyperparameters, then measure whether the linear version still scores higher on the reported benchmarks.

read the original abstract

We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios -- including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule. We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 times decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths. To support further research, we open-source the KDA kernel and vLLM implementations, and release the pre-trained and instruction-tuned model checkpoints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Kimi Linear's KDA adds finer gating to Gated DeltaNet plus a custom chunkwise DPLR solver, and the hybrid model reportedly beats full attention with big efficiency gains, but the identical-training claim still needs ablations to pin down.

read the letter

The main takeaway is that this work extends Gated DeltaNet into Kimi Delta Attention with extra per-dimension gating and a tailored chunkwise algorithm for the DPLR transition. They slot it into a hybrid with MLA, train a 3B-active/48B-total model under what they call an identical recipe, and report consistent wins over full attention on short-context, long-context, and RL tasks, plus up to 75% smaller KV cache and 6x decode throughput at 1M context. They also release the kernel, vLLM code, and checkpoints, which is the most immediately useful part for anyone running long sequences in production.

Referee Report

2 major / 2 minor

Summary. The paper introduces Kimi Linear, a hybrid linear attention architecture whose core is Kimi Delta Attention (KDA), an extension of Gated DeltaNet that adds finer-grained gating to improve utilization of finite-state RNN memory. It employs a bespoke chunkwise algorithm based on a specialized Diagonal-Plus-Low-Rank (DPLR) transition matrix that reduces computation relative to the general DPLR form while staying closer to the classical delta rule. A 3B-activated / 48B-total-parameter model is pretrained as a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). The central empirical claim is that, under an identical training recipe, Kimi Linear outperforms full MLA across short-context, long-context, and RL scaling regimes, while cutting KV cache usage by up to 75% and delivering up to 6x decoding throughput at 1M context. The KDA kernel, vLLM integration, and model checkpoints are released.

Significance. If the performance margins survive rigorous verification of identical training conditions, the result would be significant: it would demonstrate that a carefully designed linear attention module can surpass full attention in both accuracy and efficiency across multiple regimes, offering a practical drop-in replacement that materially reduces inference cost for long contexts. The open-sourcing of the kernel and models further strengthens the contribution by enabling direct reproduction and extension.

major comments (2)

Abstract and Experiments section: the claim that gains arise under an 'identical training recipe' is load-bearing for attributing improvements to KDA's gating and the specialized DPLR variant. The manuscript does not supply side-by-side hyperparameter tables, exact data-mix details, or ablations that swap only the attention module while freezing all other factors; without these, the reported margins (and 75% KV-cache reduction) could stem from unstated differences in optimization or evaluation rather than the architecture.
§3.2 (KDA and DPLR formulation): the specialized DPLR variant is asserted to be both more efficient and more consistent with the classical delta rule than the general DPLR, yet no direct complexity comparison, flop counts, or numerical-stability analysis versus the general formulation is provided to support this design choice.

minor comments (2)

Figure captions and axis labels in the throughput and KV-cache plots should explicitly state the context lengths and batch sizes used so readers can directly compare the 1M-context 6x claim.
Notation for the gating variables in the KDA equations should be unified across the text and pseudocode to avoid ambiguity between per-head and per-dimension gates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the suggested clarifications and analyses.

read point-by-point responses

Referee: Abstract and Experiments section: the claim that gains arise under an 'identical training recipe' is load-bearing for attributing improvements to KDA's gating and the specialized DPLR variant. The manuscript does not supply side-by-side hyperparameter tables, exact data-mix details, or ablations that swap only the attention module while freezing all other factors; without these, the reported margins (and 75% KV-cache reduction) could stem from unstated differences in optimization or evaluation rather than the architecture.

Authors: We appreciate this point and agree that explicit documentation is necessary to support the attribution of gains to the architecture. The training runs for Kimi Linear and the full MLA baseline used identical data mixtures, optimizer settings, learning-rate schedules, batch sizes, and all other hyperparameters. In the revised manuscript we will add: (1) side-by-side hyperparameter tables, (2) precise data-mix specifications, and (3) an ablation study that replaces only the attention module while freezing every other training factor. These additions will make the identical-recipe claim fully verifiable and will confirm that the observed margins arise from KDA rather than extraneous differences. revision: yes
Referee: §3.2 (KDA and DPLR formulation): the specialized DPLR variant is asserted to be both more efficient and more consistent with the classical delta rule than the general DPLR, yet no direct complexity comparison, flop counts, or numerical-stability analysis versus the general formulation is provided to support this design choice.

Authors: We agree that quantitative support for the design choice would strengthen the paper. In the revision we will expand §3.2 (or add an appendix) with: (i) asymptotic and practical flop-count comparisons, (ii) explicit complexity analysis of the specialized versus general DPLR transition matrices, and (iii) numerical-stability experiments (including forward-pass error accumulation and gradient-norm statistics) on both formulations. These results will demonstrate the computational savings and closer fidelity to the classical delta rule that motivated the specialized variant. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on new architecture and reported benchmarks

full rationale

The paper presents an empirical architecture paper whose central claim is that a hybrid of KDA (finer-grained gating extension of Gated DeltaNet) and MLA outperforms full attention under an identical training recipe, with measured KV-cache and throughput gains. No load-bearing mathematical derivation, prediction, or uniqueness theorem is offered that reduces by construction to fitted inputs or prior self-citations. The architecture description introduces new components (chunkwise DPLR variant, gating mechanism) whose performance is asserted via experimental results rather than self-referential definitions or renamed known patterns. Self-citations, if present for Gated DeltaNet, are not load-bearing for the outperformance claim.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are detailed beyond the architectural description of KDA and hybrid layers.

pith-pipeline@v0.9.0 · 5802 in / 1107 out tokens · 55906 ms · 2026-05-13T23:42:51.206799+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

outperforms full attention under fair comparisons... reducing KV cache usage by up to 75%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences
cs.LG 2026-04 unverdicted novelty 7.0

Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation
cs.LG 2026-04 unverdicted novelty 7.0

The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training
cs.CV 2026-04 unverdicted novelty 7.0

Mem3R achieves better long-sequence 3D reconstruction by decoupling tracking and mapping with a hybrid memory of TTT-updated MLP and explicit tokens, reducing model size and trajectory errors.
OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention
cs.LG 2026-05 unverdicted novelty 6.0

OSDN adds online diagonal preconditioning to the Delta Rule, preserving chunkwise parallelism while proving super-geometric convergence and delivering 32-39% recall gains at 340M-1.3B scales.
$\delta$-mem: Efficient Online Memory for Large Language Models
cs.AI 2026-05 unverdicted novelty 6.0

δ-mem augments frozen LLMs with an 8x8 online memory state updated by delta-rule learning to generate low-rank attention corrections, delivering 1.10x average gains over the backbone and larger improvements on memory-...
Structured Recurrent Mixers for Massively Parallelized Sequence Generation
cs.CL 2026-05 unverdicted novelty 6.0

Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.
Revisiting Transformer Layer Parameterization Through Causal Energy Minimization
cs.LG 2026-05 unverdicted novelty 6.0

CEM recasts Transformer layers as energy minimization steps, enabling constrained parameterizations like weight sharing and low-rank interactions that match standard baselines in 100M-scale language modeling.
UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification
cs.CL 2026-05 unverdicted novelty 6.0

UniPrefill accelerates LLM prefill via block-wise dynamic sparsification, achieving up to 2.1x TTFT speedup while supporting hybrid architectures and native vLLM continuous batching.
Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling
cs.CL 2026-04 unverdicted novelty 6.0

HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.
Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter
cs.DC 2026-04 unverdicted novelty 6.0

PrfaaS enables practical cross-datacenter prefill-decode disaggregation for hybrid-attention models via selective offloading, bandwidth-aware scheduling, and cache-aware placement, yielding 54% higher throughput and 6...
Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation
cs.CV 2026-04 conditional novelty 6.0

Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.
Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion
cs.CL 2026-04 conditional novelty 6.0

Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.
Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing
cs.LG 2026-04 unverdicted novelty 6.0

Stochastic training with random cross-layer KV attention enables depth-wise cache sharing in transformers, cutting memory footprint while preserving or improving performance.
LPC-SM: Local Predictive Coding and Sparse Memory for Long-Context Language Modeling
cs.CL 2026-03 unverdicted novelty 6.0

LPC-SM is a hybrid architecture separating local attention, persistent memory, predictive correction, and control with ONT for memory writes, showing loss reductions on 158M-parameter models up to 4096-token contexts.
SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer
cs.CV 2026-05 unverdicted novelty 5.0

SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher thro...
Mela: Test-Time Memory Consolidation based on Transformation Hypothesis
cs.CL 2026-05 unverdicted novelty 5.0

Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.
Kaczmarz Linear Attention
cs.LG 2026-05 unverdicted novelty 5.0

Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention
cs.LG 2026-05 unverdicted novelty 5.0

MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.
Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving
cs.DC 2026-05 unverdicted novelty 5.0

Irminsul recovers up to 83% of prompt tokens above exact-prefix matching and delivers 63% prefill energy savings per cache hit on MLA-MoE models by content-hashing CDC chunks and applying closed-form kr correction.
Heterogeneous Scientific Foundation Model Collaboration
cs.AI 2026-04 unverdicted novelty 5.0

Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.
SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform Inference
cs.LG 2026-04 unverdicted novelty 5.0

SpikingBrain2.0 is a 5B hybrid spiking-Transformer that recovers most base model performance while delivering 10x TTFT speedup at 4M context and supporting over 10M tokens on limited GPUs via dual sparse attention and...
UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training
cs.DC 2026-04 unverdicted novelty 5.0

UniEP fuses MoE communication and computation into unified MegaKernels with deterministic token ordering, delivering 1.03x-1.38x speedups over prior work while preserving training accuracy.
LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 5.0

LongAct uses saliency from high-magnitude activations to guide sparse weight updates in long-context RL, yielding about 8% gains on LongBench v2 across multiple algorithms.

Reference graph

Works this paper leans on

129 extracted references · 129 canonical work pages · cited by 23 Pith papers · 27 internal anchors

[1]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal et al. “gpt-oss-120b & gpt-oss-20b model card”. In:arXiv preprint arXiv:2508.10925(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Colt5: Faster long-range transformers with conditional computation

Joshua Ainslie et al. “Colt5: Faster long-range transformers with conditional computation”. In:arXiv preprint arXiv:2303.09752(2023)

work page arXiv 2023
[3]

Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers

Zeyuan Allen-Zhu. “Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers”. In:SSRN Electronic Journal(May 2025). Available at SSRN: https://ssrn.com/abstract=5240330 or http://dx.doi.org/10.2139/ssrn.5240330.DOI:10.2139/ssrn.5240330

work page doi:10.2139/ssrn.5240330.doi:10.2139/ssrn.5240330 2025
[4]

Simple linear attention language models balance the recall-throughput tradeoff

Simran Arora et al. “Simple linear attention language models balance the recall-throughput tradeoff”. In: Forty-first International Conference on Machine Learning. 2024.URL: https://openreview.net/forum? id=e93ffDcpH3

work page 2024
[5]

Simran Arora et al.Zoology: Measuring and Improving Recall in Efficient Language Models. 2023. arXiv: 2312.04927 [cs.CL]

work page arXiv 2023
[6]

Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks, 2025

Yushi Bai et al. “Longbench v2: Towards deeper understanding and reasoning on realistic long-context multi- tasks”. In:arXiv preprint arXiv:2412.15204(2024)

work page arXiv 2024
[7]

Round and Round We Go! What makes Rotary Positional Encodings useful?

Federico Barbero et al. “Round and Round We Go! What makes Rotary Positional Encodings useful?” In: Proceedings of ICLR. 2025.URL:https://openreview.net/forum?id=GtvuNrk58a

work page 2025
[8]

Atlas: Learning to optimally memorize the context at test time, 2025

Ali Behrouz et al. “Atlas: Learning to optimally memorize the context at test time”. In:arXiv preprint arXiv:2505.23735(2025)

work page arXiv 2025
[9]

Unlimiformer: Long-range transformers with unlimited length input

Amanda Bertsch et al. “Unlimiformer: Long-range transformers with unlimited length input”. In:Advances in NeurIPS36 (2023), pp. 35522–35543

work page 2023
[10]

Biderman, H

Stella Biderman et al. “Lessons from the trenches on reproducible evaluation of language models”. In:arXiv preprint arXiv:2405.14782(2024)

work page arXiv 2024
[11]

The WY Representation for Products of Householder Matrices

Christian Bischof and Charles Van Loan. “The WY Representation for Products of Householder Matrices”. In: SIAM Journal on Scientific and Statistical Computing(1987), s2–s13.URL: https://doi.org/10.1137/ 0908009

work page 1987
[12]

Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models

Aaron Blakeman et al. “Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models”. In: arXiv preprint arXiv:2504.03624(2025)

work page arXiv 2025
[13]

Long code arena: a set of benchmarks for long-context code models

Egor Bogomolov et al. “Long code arena: a set of benchmarks for long-context code models”. In:arXiv preprint arXiv:2406.11612(2024)

work page arXiv 2024
[14]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark et al. “Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge”. In: arXiv:1803.05457v1(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[15]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Ganqu Cui et al. “The entropy mechanism of reinforcement learning for reasoning language models”. In:arXiv preprint arXiv:2505.22617(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. “Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality”. In:CoRRabs/2405.21060 (2024).DOI: 10.48550/ARXIV.2405.21060 . arXiv:2405.21060.URL:https://doi.org/10.48550/arXiv.2405.21060

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.21060 2024
[17]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao et al. “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”. In:Advances in NeurIPS. 2022, pp. 16344–16359.URL: https://proceedings.neurips.cc/paper_files/paper/ 2022/file/67d57c32e20fd0a7a302cb81d36e40d5-Paper-Conference.pdf

work page 2022
[18]

DeepSeek-AI.DeepSeek-V3.2-Exp: Boosting Long-Context Efficiency with DeepSeek Sparse Attention. 2025

work page 2025
[19]

DeepSeek-AI et al.DeepSeek-V3 Technical Report. 2025. arXiv: 2412.19437 [cs.CL] .URL: https:// arxiv.org/abs/2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Jiayu Ding et al.LongNet: Scaling Transformers to 1,000,000,000 Tokens. 2023. arXiv: 2307.02486 [cs.CL]. URL:https://arxiv.org/abs/2307.02486

work page arXiv 2023
[21]

Juechu Dong et al.Flex Attention: A Programming Model for Generating Optimized Attention Kernels. 2024. arXiv:2412.05496 [cs.LG].URL:https://arxiv.org/abs/2412.05496

work page arXiv 2024
[22]

Xin Dong et al.Hymba: A Hybrid-head Architecture for Small Language Models. 2024. arXiv: 2411.13676 [cs.CL].URL:https://arxiv.org/abs/2411.13676

work page arXiv 2024
[23]

Mom: Linear sequence modeling with mixture-of-memories

Jusen Du et al. “Mom: Linear sequence modeling with mixture-of-memories”. In:arXiv preprint arXiv:2502.13685(2025)

work page arXiv 2025
[24]

Native Hybrid Attention for Efficient Sequence Modeling

Jusen Du et al. “Native Hybrid Attention for Efficient Sequence Modeling”. In:arXiv preprint arXiv:2510.07019 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Moa: Mixture of sparse attention for automatic large language model compression

Tianyu Fu et al. “Moa: Mixture of sparse attention for automatic large language model compression”. In:arXiv preprint arXiv:2406.14909(2024). 19 Kimi Linear: An Expressive, Efficient Attention ArchitectureTECHNICALREPORT

work page arXiv 2024
[26]

Are we done with mmlu? CoRR, abs/2406.04127,

Aryo Pradipta Gema et al. “Are we done with mmlu?” In:arXiv preprint arXiv:2406.04127(2024)

work page arXiv 2024
[27]

Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues

Riccardo Grazzi et al. “Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues”. In:Proceed- ings of ICLR. 2025.URL:https://openreview.net/forum?id=UvTo3tVBk2

work page 2025
[28]

How ordinary elimination became Gaussian elimination

Joseph F. Grcar. “How ordinary elimination became Gaussian elimination”. In:Historia Mathematica38.2 (May 2011), pp. 163–218.ISSN: 0315-0860.DOI: 10.1016/j.hm.2010.06.003 .URL: http://dx.doi. org/10.1016/j.hm.2010.06.003

work page doi:10.1016/j.hm.2010.06.003 2011
[29]

Albert Gu and Tri Dao.Mamba: Linear-Time Sequence Modeling with Selective State Spaces. 2023. arXiv: 2312.00752 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Albert Gu, Karan Goel, and Christopher Ré.Efficiently Modeling Long Sequences with Structured State Spaces

work page
[31]

arXiv:2111.00396 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Xiangming Gu et al.When Attention Sink Emerges in Language Models: An Empirical View. 2025. arXiv: 2410.10781 [cs.CL].URL:https://arxiv.org/abs/2410.10781

work page arXiv 2025
[33]

Yuxian Gu et al.Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search. 2025. arXiv: 2508.15884 [cs.CL].URL:https://arxiv.org/abs/2508.15884

work page arXiv 2025
[34]

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

Daya Guo et al. “DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning”. In:Nature 645.8081 (2025), pp. 633–638

work page 2025
[35]

Log-linear attention

Han Guo et al. “Log-linear attention”. In:arXiv preprint arXiv:2506.04761(2025)

work page arXiv 2025
[36]

Star-transformer

Qipeng Guo et al. “Star-transformer”. In:arXiv preprint arXiv:1902.09113(2019)

work page arXiv 1902
[37]

Dan Hendrycks et al.Measuring Massive Multitask Language Understanding. 2021. arXiv: 2009.03300 [cs.CY].URL:https://arxiv.org/abs/2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2021
[38]

Jordan Hoffmann et al.Training Compute-Optimal Large Language Models. 2022. arXiv: 2203 . 15556 [cs.CL].URL:https://arxiv.org/abs/2203.15556

work page internal anchor Pith review Pith/arXiv arXiv 2022
[39]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh et al. “RULER: What’s the Real Context Size of Your Long-Context Language Models?” In: arXiv preprint arXiv:2404.06654(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Attractor memory for long-term time series forecasting: A chaos perspective

Jiaxi Hu et al. “Attractor memory for long-term time series forecasting: A chaos perspective”. In:Advances in NeurIPS37 (2024), pp. 20786–20818

work page 2024
[41]

Comba: Improving Bilinear

Jiaxi Hu et al. “Comba: Improving Nonlinear RNNs with Closed-loop Control”. In:arXiv preprint arXiv:2506.02475(2025)

work page arXiv 2025
[42]

Fourier position embedding: Enhancing at- tention’s periodic extension for length generalization.arXiv preprint arXiv:2412.17739, 2024

Ermo Hua et al. “Fourier Position Embedding: Enhancing Attention’s Periodic Extension for Length General- ization”. In:arXiv preprint arXiv:2412.17739(2024)

work page arXiv 2024
[43]

Transformer Quality in Linear Time

Weizhe Hua et al. “Transformer Quality in Linear Time”. In:Proceedings of ICML. Ed. by Kamalika Chaudhuri et al. PMLR, 2022, pp. 9099–9117.URL:https://proceedings.mlr.press/v162/hua22a.html

work page 2022
[44]

C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models

Yuzhen Huang et al. “C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models”. In: Advances in NeurIPS36 (2023), pp. 62991–63010

work page 2023
[45]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain et al. “Livecodebench: Holistic and contamination free evaluation of large language models for code”. In:arXiv preprint arXiv:2403.07974(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Samy Jelassi et al.Repeat After Me: Transformers are Better than State Space Models at Copying. 2024. arXiv: 2402.01032 [cs.LG]

work page arXiv 2024
[47]

Accumulating Householder transformations, revisited

Thierry Joffrain et al. “Accumulating Householder transformations, revisited”. In: (2006), pp. 169–179.URL: https://doi.org/10.1145/1141885.1141886

work page doi:10.1145/1141885.1141886 2006
[48]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi et al. “Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension”. In:arXiv preprint arXiv:1705.03551(2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[49]

Transformers are RNNs: Fast Autoregressive Transformers with Linear Atten- tion

Angelos Katharopoulos et al. “Transformers are RNNs: Fast Autoregressive Transformers with Linear Atten- tion”. In:Proceedings of ICML. Ed. by Hal Daumé III and Aarti Singh. PMLR, 2020, pp. 5156–5165.URL: https://proceedings.mlr.press/v119/katharopoulos20a.html

work page 2020
[50]

The impact of positional encoding on length generalization in transformers

Amirhossein Kazemnejad et al. “The impact of positional encoding on length generalization in transformers”. In:Advances in NeurIPS36 (2023), pp. 24892–24928

work page 2023
[51]

Kimi K2: Open Agentic Intelligence

Team Kimi et al. “Kimi k2: Open agentic intelligence”. In:arXiv preprint arXiv:2507.20534(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Reformer: The Efficient Transformer

Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. “Reformer: The efficient transformer”. In:arXiv preprint arXiv:2001.04451(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2001
[53]

Krishna, K

Satyapriya Krishna et al. “Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation”. In: arXiv preprint arXiv:2409.12941(2024)

work page arXiv 2024
[54]

A Survey of Post-Training Scaling in Large Language Models

Hanyu Lai et al. “A Survey of Post-Training Scaling in Large Language Models”. In:Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025, pp. 2771– 2791. 20 Kimi Linear: An Expressive, Efficient Attention ArchitectureTECHNICALREPORT

work page 2025
[55]

Liger: Linearizing Large Language Models to Gated Recurrent Structures

Disen Lan et al. “Liger: Linearizing Large Language Models to Gated Recurrent Structures”. In:arXiv preprint arXiv:2503.01496(2025)

work page arXiv 2025
[56]

CMMLU: Measuring massive multitask language understanding in Chinese

Haonan Li et al. “CMMLU: Measuring massive multitask language understanding in Chinese”. In:Findings of the Association for Computational Linguistics: ACL 2024. Ed. by Lun-Wei Ku, Andre Martins, and Vivek Srikumar. Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 11260–11285.DOI: 10 . 18653 / v1 / 2024 . findings - acl . 671.UR...

work page 2024
[57]

Transmamba: Flexibly switching between transformer and mamba

Yixing Li et al. “Transmamba: Flexibly switching between transformer and mamba”. In:arXiv preprint arXiv:2503.24067(2025)

work page arXiv 2025
[58]

Opher Lieber et al.Jamba: A Hybrid Transformer-Mamba Language Model. 2024. arXiv: 2403 . 19887 [cs.CL]

work page 2024
[59]

Forgetting transformer: Softmax attention with a forget gate

Zhixuan Lin et al. “Forgetting transformer: Softmax attention with a forget gate”. In:arXiv preprint arXiv:2503.02130(2025)

work page arXiv 2025
[60]

Longhorn: State space models are amortized online learners

Bo Liu et al. “Longhorn: State Space Models are Amortized Online Learners”. In:ArXivabs/2407.14207 (2024). URL:https://api.semanticscholar.org/CorpusID:271310065

work page arXiv 2024
[61]

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Jiawei Liu et al. “Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation”. In:Thirty-seventh Conference on NeurIPS. 2023.URL: https://openreview. net/forum?id=1qvx610Cu7

work page 2023
[62]

Repoqa: Evaluating long context code understanding

Jiawei Liu et al. “Repoqa: Evaluating long context code understanding”. In:arXiv preprint arXiv:2406.06025 (2024)

work page arXiv 2024
[63]

Jingyuan Liu et al.Muon is Scalable for LLM Training. 2025. arXiv: 2502.16982 [cs.LG] .URL: https: //arxiv.org/abs/2502.16982

work page internal anchor Pith review Pith/arXiv arXiv 2025
[64]

Enzhe Lu et al.MoBA: Mixture of Block Attention for Long-Context LLMs. 2025. arXiv: 2502.13189 [cs.LG]. URL:https://arxiv.org/abs/2502.13189

work page arXiv 2025
[65]

The illusion of state in state- space models,

William Merrill, Jackson Petty, and Ashish Sabharwal. “The illusion of state in state-space models”. In:arXiv preprint arXiv:2404.08819(2024)

work page arXiv 2024
[66]

The Parallelism Tradeoff: Limitations of Log-Precision Transformers

William Merrill and Ashish Sabharwal. “The Parallelism Tradeoff: Limitations of Log-Precision Transformers”. In:Transactions of the Association for Computational Linguistics11 (2023), pp. 531–545.DOI: 10.1162/ tacl_a_00562.URL:https://aclanthology.org/2023.tacl-1.31/

work page 2023
[67]

MiniMax et al.MiniMax-01: Scaling Foundation Models with Lightning Attention. 2025. arXiv: 2501.08313 [cs.CL]

work page arXiv 2025
[68]

Tsendsuren Munkhdalai, Manaal Faruqui, and Siddharth Gopal.Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention. 2024. arXiv:2404.07143 [cs.CL]

work page arXiv 2024
[69]

Tsendsuren Munkhdalai and Adam Trischler.Metalearning with Hebbian Fast Weights. 2018. arXiv: 1807. 05076 [cs.NE].URL:https://arxiv.org/abs/1807.05076

work page arXiv 2018
[70]

Metalearned Neural Memory

Tsendsuren Munkhdalai et al. “Metalearned Neural Memory”. In:ArXivabs/1907.09720 (2019).URL: https: //api.semanticscholar.org/CorpusID:198179407

work page arXiv 1907
[71]

Training language models to follow instructions with human feedback

Long Ouyang et al. “Training language models to follow instructions with human feedback”. In:Advances in NeurIPS35 (2022), pp. 27730–27744

work page 2022
[72]

Bo Peng et al.RWKV-7 "Goose" with Expressive Dynamic State Evolution. 2025. arXiv: 2503.14456 [cs.CL]

work page arXiv 2025
[73]

YaRN: Efficient Context Window Extension of Large Language Models

Bowen Peng et al. “Yarn: Efficient context window extension of large language models”. In:arXiv preprint arXiv:2309.00071(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[74]

Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing

Piotr Pi˛ ekos, Róbert Csordás, and Jürgen Schmidhuber. “Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing”. In:arXiv preprint arXiv:2505.00315(2025)

work page arXiv 2025
[75]

Reasoning with large language models, a survey

Aske Plaat et al. “Reasoning with large language models, a survey”. In:CoRR(2024)

work page 2024
[76]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Ofir Press, Noah Smith, and Mike Lewis. “Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation”. In:Proceedings of ICLR. 2022.URL: https://openreview.net/forum?id= R8sQPpGCv0

work page 2022
[77]

Puvvada et al.SWAN-GPT: An Efficient and Scalable Approach for Long-Context Language Modeling

Krishna C. Puvvada et al.SWAN-GPT: An Efficient and Scalable Approach for Long-Context Language Modeling. 2025. arXiv:2504.08719 [cs.CL]

work page arXiv 2025
[78]

Zhen Qin et al.HGRN2: Gated Linear RNNs with State Expansion. 2024. arXiv:2404.07904 [cs.CL]

work page arXiv 2024
[79]

Zhen Qin et al.TransNormerLLM: A Faster and Better Large Language Model with Improved TransNormer

work page
[80]

arXiv:2307.14995 [cs.CL]

work page arXiv

Showing first 80 references.