pith. sign in

arxiv: 2605.18753 · v1 · pith:GYO6TJXKnew · submitted 2026-05-18 · 💻 cs.CL · cs.AI· cs.LG

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

Pith reviewed 2026-05-20 10:45 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords sparse attentionhierarchical attentiondifferentiable sparsitylong context modelingLLM efficiencyalpha-entmaxadaptive block selectionnon-dispersive attention
0
0 comments X

The pith

DashAttention replaces fixed top-k with adaptive α-entmax to make hierarchical attention fully differentiable and non-dispersive for long contexts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a hierarchical sparse attention mechanism for large language models that selects key-value blocks adaptively rather than with a rigid top-k operation. Current approaches break gradient flow between coarse and fine stages and assume a fixed number of relevant tokens per query. DashAttention applies an α-entmax transformation in the first stage to choose a variable number of blocks according to the query, then uses that selection as a prior for the second-stage softmax. This keeps the whole process end-to-end differentiable and avoids the dispersion that limits other sparse methods. If the approach holds, it would let models handle long sequences at high sparsity levels such as 75 percent while matching full-attention accuracy and offering faster inference through an efficient implementation.

Core claim

DashAttention leverages the adaptively sparse α-entmax transformation to select a variable number of blocks according to the current query in the first stage. This in turn provides a prior for the second-stage softmax attention, keeping the entire hierarchy fully differentiable. Contrary to other hierarchical attention methods, DashAttention is non-dispersive, translating to better long-context modeling ability. Experiments with large language models show that DashAttention achieves comparable accuracy as full attention with 75% sparsity and a better Pareto frontier than NSA and InfLLMv2, especially in high-sparsity regimes, along with an efficient GPU-aware implementation that achieves a aT

What carries the argument

The adaptively sparse α-entmax transformation that selects a variable number of relevant KV blocks in the coarse stage and supplies a differentiable prior for fine-grained softmax attention.

If this is right

  • LLMs reach full-attention accuracy while using only 25 percent of the attention computations at 75 percent sparsity.
  • Gradient flow remains intact between coarse block selection and fine token attention, supporting stable end-to-end training.
  • The non-dispersive property improves long-sequence modeling compared with prior hierarchical sparse methods.
  • A superior accuracy-efficiency trade-off appears especially in high-sparsity regimes versus NSA and InfLLMv2.
  • The Triton implementation delivers inference speedups exceeding FlashAttention-3 for long contexts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The variable block selection could be extended to allocate context resources differently across layers or tasks.
  • Similar adaptive sparsity could be tested in other transformer components such as feed-forward layers.
  • Combining the method with existing long-context techniques might push feasible context lengths further on limited hardware.
  • Measuring performance on tasks that require precise recall of distant information would test where the non-dispersive property matters most.

Load-bearing premise

The adaptively sparse α-entmax transformation reliably selects relevant blocks according to the query and supplies an effective prior for the second stage without introducing dispersion or training instability.

What would settle it

Training or evaluating DashAttention on long-context tasks and finding higher attention dispersion or convergence failure relative to full attention would challenge the non-dispersive and stable claims.

Figures

Figures reproduced from arXiv: 2605.18753 by Andr\'e F. T. Martins, Edoardo M. Ponti, Federico Alvetreti, Lei Li, Marcos V. Treviso, Nuno M. T. Gon\c{c}alves, Xu Han, Yuxiang Huang.

Figure 1
Figure 1. Figure 1: High-level overview of DashAttention. Stage 0 builds chunk summaries by local SDPA; [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy–Sparsity Pareto fron￾tiers on HELMET. For the 8B model, we sweep the sparsity (γ for DashAtten￾tion4 , k for NSA and InfLLMv2) of each sparse method to obtain points at increasing sparsity levels and report the resulting HELMET overall accuracy. DashAttention dominates NSA and InfLLMv2 across the sweep ( [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Per-layer attention sparsity. Since DashAttention uses α-entmax in Stage 1 to induce sparsity, it can dynamically allocate sparsity across differ￾ent layers according to the geometry of the Stage 1 scores. To show this property, we measure the sparsity of each layer using a 16K-length input from RULER-SG1. The analysis is shown in [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of mappings for different values of [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy–Sparsity Pareto frontiers on HELMET at 16K context length. [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗
read the original abstract

Current hierarchical attention methods, such as NSA and InfLLMv2, select the top-k relevant key-value (KV) blocks based on coarse attention scores and subsequently apply fine-grained softmax attention on the selected tokens. However, the top-k operation assumes the number of relevant tokens for any query is fixed and it precludes the gradient flow between the sparse and dense stages. In this work, we propose DashAttention (Differentiable and Adaptive Sparse Hierarchical Attention), which leverages the adaptively sparse $\alpha$-entmax transformation to select a variable number of blocks according to the current query in the first stage. This in turn provides a prior for the second-stage softmax attention, keeping the entire hierarchy fully differentiable. Contrary to other hierarchical attention methods, we show that DashAttention is non-dispersive, translating to better long-context modeling ability. Experiments with large language models (LLMs) show that DashAttention achieves comparable accuracy as full attention with 75% sparsity and a better Pareto frontier than NSA and InfLLMv2, especially in high-sparsity regimes. We also provide an efficient, GPU-aware implementation of DashAttention in Triton, which achieves a speedup of up to over FlashAttention-3 at inference time. Overall, DashAttention offers a cost-effective strategy to model long contexts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes DashAttention, a hierarchical attention mechanism for LLMs that replaces fixed top-k block selection with an adaptively sparse α-entmax transformation in the coarse stage. This produces a query-dependent variable support over KV blocks that serves as a prior for the second-stage softmax attention, rendering the full hierarchy differentiable. The authors claim the resulting attention is non-dispersive (unlike NSA and InfLLMv2), yields comparable accuracy to full attention at 75% sparsity, improves the accuracy-sparsity Pareto frontier especially at high sparsity, and admits an efficient Triton GPU implementation that outperforms FlashAttention-3 at inference.

Significance. If the non-dispersive property and empirical gains are substantiated, the method would offer a practical route to scaling long-context modeling without sacrificing differentiability or introducing excessive dispersion. The explicit GPU-aware implementation and focus on high-sparsity regimes constitute concrete engineering contributions that could be adopted in production LLM inference stacks.

major comments (2)
  1. [Experiments] Experiments section: aggregate accuracy and Pareto curves are reported, yet no per-layer entropy, support-size histograms, or second-stage attention concentration statistics are shown. Without these, it is impossible to verify that the α-entmax prior actually keeps the fine-grained distribution non-dispersive when block relevance is diffuse, which is load-bearing for the central claim of superiority over NSA/InfLLMv2 in high-sparsity regimes.
  2. [§3.2] §3.2 (Method): the argument that α-entmax supplies a sufficiently tight prior rests on the assumption that selected blocks contain predominantly relevant tokens. When query-key similarity is low or α is not layer-specific, the variable support can still admit many marginally relevant tokens, allowing the subsequent softmax to spread; the manuscript provides no ablation isolating this mechanism from the overall accuracy numbers.
minor comments (2)
  1. [Abstract] The abstract and introduction introduce 'non-dispersive' without a quantitative definition or reference to a specific entropy or support-size metric; a short formal definition should appear before the experimental claims.
  2. [§3] Notation for the two-stage hierarchy (coarse α-entmax output as prior) is introduced but not consistently reused in the complexity analysis; a single equation summarizing end-to-end complexity would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects of experimental validation and the tightness of the proposed prior. We address each major comment below and outline planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: aggregate accuracy and Pareto curves are reported, yet no per-layer entropy, support-size histograms, or second-stage attention concentration statistics are shown. Without these, it is impossible to verify that the α-entmax prior actually keeps the fine-grained distribution non-dispersive when block relevance is diffuse, which is load-bearing for the central claim of superiority over NSA/InfLLMv2 in high-sparsity regimes.

    Authors: We agree that additional statistics are needed to directly substantiate the non-dispersive property. In the revised manuscript we will add per-layer entropy plots, histograms of the number of selected KV blocks per query, and second-stage attention concentration metrics (such as effective support size and entropy of the fine-grained distribution). These will be placed in the Experiments section to demonstrate that the α-entmax prior maintains concentration even under diffuse block relevance. revision: yes

  2. Referee: [§3.2] §3.2 (Method): the argument that α-entmax supplies a sufficiently tight prior rests on the assumption that selected blocks contain predominantly relevant tokens. When query-key similarity is low or α is not layer-specific, the variable support can still admit many marginally relevant tokens, allowing the subsequent softmax to spread; the manuscript provides no ablation isolating this mechanism from the overall accuracy numbers.

    Authors: The superior high-sparsity performance relative to fixed top-k baselines offers supporting evidence, yet we acknowledge that an explicit ablation would better isolate the adaptive prior's contribution. We will add such an ablation in the revision, comparing DashAttention against variants with fixed block counts or non-adaptive α values. We will also clarify in §3.2 the conditions under which the selected blocks remain predominantly relevant and note that α may be tuned per layer when beneficial. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper introduces DashAttention by replacing top-k selection with α-entmax in the coarse stage to enable variable support and gradient flow, then using the resulting distribution as a prior for fine-grained softmax. This construction is presented as a direct methodological choice with independent motivation from prior hierarchical methods (NSA, InfLLMv2). Performance claims rest on empirical comparisons rather than any equation that reduces a prediction to a fitted parameter or self-citation by definition. No load-bearing uniqueness theorems, ansatzes smuggled via citation, or renamings of known results appear in the abstract or described chain. The non-dispersive property is asserted as a consequence of the adaptive sparsity mechanism and is evaluated experimentally, not derived tautologically from the inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Review based on abstract only; no explicit free parameters, axioms, or invented entities are stated. The α parameter in entmax and any sparsity thresholds are likely tuned but not detailed here.

free parameters (1)
  • alpha in α-entmax
    Controls the degree of sparsity in block selection; value not specified in abstract but central to adaptivity.

pith-pipeline@v0.9.0 · 5799 in / 1192 out tokens · 48600 ms · 2026-05-20T10:45:32.694271+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 16 internal anchors

  1. [1]

    Is it really long context if all you need is retrieval? towards genuinely difficult long context NLP

    Omer Goldman, Alon Jacovi, Aviv Slobodkin, Aviya Maimon, Ido Dagan, and Reut Tsarfaty. Is it really long context if all you need is retrieval? towards genuinely difficult long context NLP. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 16576–16586,...

  2. [2]

    Piotr Nawrot, Robert Li, Renjie Huang, Sebastian Ruder, Kelly Marchisio, and Edoardo M. Ponti. The sparse frontier: Sparse attention trade-offs in transformer llms. InProceedings of the 64th Annual Meeting of the Association for Computational Linguistics, 2026

  3. [3]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  4. [4]

    Softmax is not enough (for sharp size generalisation)

    Petar Veliˇckovi´c, Christos Perivolaropoulos, Federico Barbero, and Razvan Pascanu. Softmax is not enough (for sharp size generalisation). InForty-second International Conference on Machine Learning, 2025

  5. [5]

    Native sparse attention: Hardware-aligned and natively trainable sparse attention

    Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Yuxing Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng 10 Liang, and Wangding Zeng. Native sparse attention: Hardware-aligned and natively trainable sparse attention. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pile...

  6. [6]

    InfLLM-v2: Dense-sparse switchable attention for seamless short-to-long adaptation

    Weilin Zhao, Zihan Zhou, Zhou su, Chaojun Xiao, Yuxuan Li, Yanghao Li, Yudi Zhang, Weilun Zhao, Zhen Li, Yuxiang Huang, Ao Sun, Xu Han, and Zhiyuan Liu. InfLLM-v2: Dense-sparse switchable attention for seamless short-to-long adaptation. InThe Fourteenth International Conference on Learning Representations, 2026

  7. [7]

    Ben Peters, Vlad Niculae, and André F. T. Martins. Sparse sequence-to-sequence models. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1504–1519, Florence, Italy, July 2019. Association for Computational Linguistics

  8. [8]

    Long-context generalization with sparse attention

    Pavlo Vasylenko, Hugo Pitorro, Andre Martins, and Marcos Vinicius Treviso. Long-context generalization with sparse attention. InThe Fourteenth International Conference on Learning Representations, 2026

  9. [9]

    MoBA: Mixture of block attention for long-context LLMs

    Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, Zhiqi Huang, Huan Yuan, Suting Xu, Xinran Xu, Guokun Lai, Yanru Chen, Huabin Zheng, Junjie Yan, Jianlin Su, Yuxin Wu, Yutao Zhang, Zhilin Yang, Xinyu Zhou, Mingxing Zhang, and Jiezhong Qiu. MoBA: Mixture of block attention for long-contex...

  10. [10]

    Flashattention: Fast and memory-efficient exact attention with IO-awareness

    Tri Dao, Daniel Y Fu, Stefano Ermon, Atri Rudra, and Christopher Re. Flashattention: Fast and memory-efficient exact attention with IO-awareness. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems, 2022

  11. [11]

    From softmax to sparsemax: A sparse model of attention and multi-label classification

    Andre Martins and Ramon Astudillo. From softmax to sparsemax: A sparse model of attention and multi-label classification. In Maria Florina Balcan and Kilian Q. Weinberger, editors, International Conference on Machine Learning (ICML), volume 48 ofProceedings of Machine Learning Research, pages 1614–1623, New York, New York, USA, 20–22 Jun 2016. PMLR

  12. [12]

    Adasplash: Adaptive sparse flash attention

    Nuno Gonçalves, Marcos V Treviso, and Andre Martins. Adasplash: Adaptive sparse flash attention. InForty-second International Conference on Machine Learning, 2025

  13. [13]

    Adasplash-2: Faster differentiable sparse attention

    Nuno Gonçalves, Hugo Pitorro, Vlad Niculae, Edoardo Ponti, Lei Li, Andre Martins, and Marcos Treviso. Adasplash-2: Faster differentiable sparse attention. InForty-third International Conference on Machine Learning, 2026

  14. [14]

    Gqa: Training generalized multi-query transformer models from multi-head checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, 2023

  15. [15]

    Learning classifiers with fenchel-young losses: Generalized entropies, margins, and algorithms

    Mathieu Blondel, Andre Martins, and Vlad Niculae. Learning classifiers with fenchel-young losses: Generalized entropies, margins, and algorithms. In Kamalika Chaudhuri and Masashi Sugiyama, editors,Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89 ofProceedings of Machine Learning Research, page...

  16. [16]

    Flashattention-2: Faster attention with better parallelism and work partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. InThe Twelfth International Conference on Learning Representations, 2024

  17. [17]

    Philippe Tillet, H. T. Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, MAPL 2019, page 10–19, New York, NY , USA, 2019. Association for Computing Machinery. 11

  18. [18]

    Infllm-v2-data-5b dataset

    OpenBMB. Infllm-v2-data-5b dataset. https://huggingface.co/datasets/openbmb/ InfLLM-V2-data-5B, 2025

  19. [19]

    Minicpm4: Ultra-efficient llms on end devices

    MiniCPM Team, Chaojun Xiao, Yuxuan Li, Xu Han, Yuzhuo Bai, Jie Cai, Haotian Chen, Wentong Chen, Xin Cong, Ganqu Cui, et al. Minicpm4: Ultra-efficient llms on end devices. arXiv preprint arXiv:2506.07900, 2025

  20. [20]

    RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling, 2024

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling, 2024

  21. [21]

    HELMET: How to evaluate long-context models effectively and thoroughly

    Howard Yen, Tianyu Gao, Minmin Hou, Ke Ding, Daniel Fleischer, Peter Izsak, Moshe Wasserblat, and Danqi Chen. HELMET: How to evaluate long-context models effectively and thoroughly. InThe Thirteenth International Conference on Learning Representations, 2025

  22. [22]

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

  23. [23]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

  24. [24]

    Commonsenseqa: A question answering challenge targeting commonsense knowledge

    Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, 2019

  25. [25]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023

  26. [26]

    Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800, 2019

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800, 2019

  27. [27]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  28. [28]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

  29. [29]

    Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs

    Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gard- ner. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Paper...

  30. [30]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

  31. [31]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  32. [32]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023

  33. [33]

    Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems, 37:62557–62583, 2024

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems, 37:62557–62583, 2024

  34. [34]

    Flashattention-3: Fast and accurate attention with asynchrony and low-precision

    Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  35. [35]

    PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

    Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069, 2024

  36. [36]

    Pyramidinfer: Pyramid kv cache compression for high-throughput llm inference

    Dongjie Yang, XiaoDong Han, Yan Gao, Yao Hu, Shilin Zhang, and Hai Zhao. Pyramidinfer: Pyramid kv cache compression for high-throughput llm inference. InFindings of the Association for Computational Linguistics: ACL 2024, pages 3258–3270, 2024

  37. [37]

    Efficient streaming language models with attention sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InThe Twelfth International Conference on Learning Representations, 2024

  38. [38]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv:2004.05150, 2020

  39. [39]

    Big bird: Transformers for longer sequences

    Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big bird: Transformers for longer sequences. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 17...

  40. [40]

    H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

  41. [41]

    Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention.arXiv preprint arXiv:2407.02490, 2024

    Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhen- hua Han, Amir H Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention.arXiv preprint arXiv:2407.02490, 2024

  42. [42]

    Reformer: The efficient transformer

    Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In International Conference on Learning Representations, 2020

  43. [43]

    Spargeattention: Accurate and training-free sparse attention accelerating any model inference.arXiv preprint arXiv:2502.18137, 2025

    Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, and Jianfei Chen. Spargeattention: Accurate and training-free sparse attention accelerating any model inference.arXiv preprint arXiv:2502.18137, 2025

  44. [44]

    Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

    Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: Query-aware sparsity for efficient long-context llm inference.arXiv preprint arXiv:2406.10774, 2024

  45. [45]

    Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

    Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms.arXiv preprint arXiv:2310.01801, 2023. 13

  46. [46]

    DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

    Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. Duoattention: Efficient long-context llm inference with retrieval and streaming heads.arXiv preprint arXiv:2410.10819, 2024

  47. [47]

    Lycheedecode: Accelerating long-context llm inference via hybrid-head sparse decoding.arXiv preprint arXiv:2602.04541, 2026

    Gang Lin, Dongfang Li, Zhuoen Chen, Yukun Shi, Xuhui Chen, Baotian Hu, and Min Zhang. Lycheedecode: Accelerating long-context llm inference via hybrid-head sparse decoding.arXiv preprint arXiv:2602.04541, 2026

  48. [48]

    Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024

  49. [49]

    R-kv: Redundancy-aware kv cache compression for reasoning models.arXiv preprint arXiv:2505.24133, 2025

    Zefan Cai, Wen Xiao, Hanshi Sun, Cheng Luo, Yikai Zhang, Ke Wan, Yucheng Li, Yeyang Zhou, Li-Wen Chang, Jiuxiang Gu, et al. R-kv: Redundancy-aware kv cache compression for reasoning models.arXiv preprint arXiv:2505.24133, 2025

  50. [50]

    Indexcache: Accelerating sparse attention via cross-layer index reuse.arXiv preprint arXiv:2603.12201, 2026

    Yushi Bai, Qian Dong, Ting Jiang, Xin Lv, Zhengxiao Du, Aohan Zeng, Jie Tang, and Juanzi Li. Indexcache: Accelerating sparse attention via cross-layer index reuse.arXiv preprint arXiv:2603.12201, 2026

  51. [51]

    Infllm: Training-free long-context extrapolation for llms with an efficient context memory.Advances in neural information processing systems, 37:119638–119661, 2024

    Chaojun Xiao, Pengle Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, and Maosong Sun. Infllm: Training-free long-context extrapolation for llms with an efficient context memory.Advances in neural information processing systems, 37:119638–119661, 2024

  52. [52]

    Shadowkv: Kv cache in shadows for high-throughput long-context llm inference.arXiv preprint arXiv:2410.21465, 2024

    Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, and Beidi Chen. Shadowkv: Kv cache in shadows for high-throughput long-context llm inference.arXiv preprint arXiv:2410.21465, 2024

  53. [53]

    Nosa: Native and offloadable sparse attention

    Yuxiang Huang, Pengjie Wang, Jicheng Han, Weilin Zhao, Zhou Su, Ao Sun, Hongya Lyu, Hengyu Zhao, Yudong Wang, Chaojun Xiao, et al. Nosa: Native and offloadable sparse attention. arXiv preprint arXiv:2510.13602, 2025

  54. [54]

    Piotr Nawrot, Adrian Ła ´ncucki, Marcin Chochowski, David Tarjan, and Edoardo M. Ponti. Dynamic memory compression: retrofitting llms for accelerated inference. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

  55. [55]

    Inference-time hyper-scaling with KV cache compression

    Adrian Ła ´ncucki, Konrad Staniszewski, Piotr Nawrot, and Edoardo Ponti. Inference-time hyper-scaling with KV cache compression. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  56. [56]

    Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

    Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

  57. [57]

    KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

    Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache.arXiv preprint arXiv:2402.02750, 2024

  58. [58]

    Pqcache: Product quantization-based kvcache for long context llm inference

    Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Weipeng Chen, and Bin Cui. Pqcache: Product quantization-based kvcache for long context llm inference. Proceedings of the ACM on Management of Data, 3(3):1–30, 2025

  59. [59]

    MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training

    Wenxuan Li, Chengruidong Zhang, Huiqiang Jiang, et al. Mtraining: Distributed dynamic sparse attention for efficient ultra-long context training.arXiv preprint arXiv:2510.18830, 2025

  60. [60]

    Seerattention-r: Sparse attention adaptation for long reasoning

    Yizhao Gao, Shuming Guo, Shijie Cao, Yuqing Xia, Yu Cheng, Lei Wang, Lingxiao Ma, Yutao Sun, Tianzhu Ye, Li Dong, et al. Seerattention-r: Sparse attention adaptation for long reasoning. arXiv preprint arXiv:2506.08889, 2025

  61. [61]

    Seerattention: Learning intrinsic sparse attention in your llms.arXiv preprint arXiv:2410.13276, 2024

    Yizhao Gao, Zhichen Zeng, Dayou Du, Shijie Cao, Peiyuan Zhou, Jiaxing Qi, Junjie Lai, Hayden Kwok-Hay So, Ting Cao, Fan Yang, et al. Seerattention: Learning intrinsic sparse attention in your llms.arXiv preprint arXiv:2410.13276, 2024. 14

  62. [62]

    Flash sparse attention: An alternative efficient implementation of native sparse attention kernel.arXiv e-prints, pages arXiv–2508, 2025

    Ran Yan, Youhe Jiang, and Binhang Yuan. Flash sparse attention: An alternative efficient implementation of native sparse attention kernel.arXiv e-prints, pages arXiv–2508, 2025

  63. [63]

    Hsa: Head-wise sparse attention for efficient and accurate long-context inference

    Jing Liu, Jianqiao Lu, Yao Luo, Yuan Yang, Chen Zheng, Deyi Liu, Mengzhao Chen, Chaoyi Zhang, Yunshui Li, Jin Ma, et al. Hsa: Head-wise sparse attention for efficient and accurate long-context inference

  64. [64]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

  65. [65]

    Deepseek-v4: Towards highly efficient million-token context in- telligence

    DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context in- telligence. https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/ DeepSeek_V4.pdf, 2026. Technical report, Hugging Face repository, accessed 2026-04-28

  66. [66]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024

  67. [67]

    Spargeattention2: Trainable sparse attention via hybrid top-k+ top-p masking and distillation fine-tuning.arXiv preprint arXiv:2602.13515, 2026

    Jintao Zhang, Kai Jiang, Chendong Xiang, Weiqi Feng, Yuezhou Hu, Haocheng Xi, Jianfei Chen, and Jun Zhu. Spargeattention2: Trainable sparse attention via hybrid top-k+ top-p masking and distillation fine-tuning.arXiv preprint arXiv:2602.13515, 2026

  68. [68]

    Double-p: Hierarchical top-p sparse attention for long-context llms.arXiv preprint arXiv:2602.05191, 2026

    Wentao Ni, Kangqi Zhang, Zhongming Yu, Oren Nelson, Mingu Lee, Hong Cai, Fatih Porikli, Jongryool Kim, Zhijian Liu, and Jishen Zhao. Double-p: Hierarchical top-p sparse attention for long-context llms.arXiv preprint arXiv:2602.05191, 2026

  69. [69]

    S., Mahabaleshwarkar, A

    Aaron Blakeman, Aarti Basant, Abhinav Khattar, Adithya Renduchintala, Akhiad Bercovich, Aleksander Ficek, Alexis Bjorlin, Ali Taghibakhshi, Amala Sanjay Deshmukh, Ameya Sunil Mahabaleshwarkar, et al. Nemotron-h: A family of accurate and efficient hybrid mamba- transformer models.arXiv preprint arXiv:2504.03624, 2025

  70. [70]

    Possible generalization of boltzmann-gibbs statistics.Journal of statistical physics, 52(1):479–487, 1988

    Constantino Tsallis. Possible generalization of boltzmann-gibbs statistics.Journal of statistical physics, 52(1):479–487, 1988

  71. [71]

    MiniCPM: Unveiling the potential of small language models with scalable training strategies

    Shengding Hu, Yuge Tu, Xu Han, Ganqu Cui, Chaoqun He, Weilin Zhao, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Xinrong Zhang, Zhen Leng Thai, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, dahai li, Zhiyuan Liu, and Maosong Sun. MiniCPM: Unveiling the potential of small language models with...

  72. [72]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

  73. [73]

    Olmes: A standard for language model evaluations

    Yuling Gu, Oyvind Tafjord, Bailey Kuehl, Dany Haddad, Jesse Dodge, and Hannaneh Ha- jishirzi. Olmes: A standard for language model evaluations. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 5005–5033, 2025

  74. [74]

    The language model evaluation harness, 07 2024

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

  75. [75]

    Hardware-aligned hierarchical sparse attention for efficient long-term memory access.arXiv preprint arXiv:2504.16795, 2025

    Xiang Hu, Jiaqi Leng, Jun Zhao, Kewei Tu, and Wei Wu. Hardware-aligned hierarchical sparse attention for efficient long-term memory access.arXiv preprint arXiv:2504.16795, 2025

  76. [76]

    Every to- ken counts: Generalizing 16m ultra-long context in large language models.arXiv preprint arXiv:2511.23319, 2025

    Xiang Hu, Zhanchao Zhou, Ruiqi Liang, Zehuan Li, Wei Wu, and Jianguo Li. Every to- ken counts: Generalizing 16m ultra-long context in large language models.arXiv preprint arXiv:2511.23319, 2025. 15 2 0 2 z1 2 02 z2 0.0 0.2 0.4 0.6 0.8 1.0 Softmax 2 0 2 z1 2 02 z2 0.0 0.2 0.4 0.6 0.8 1.0 1.5-entmax 2 0 2 z1 2 02 z2 0.0 0.2 0.4 0.6 0.8 1.0 T op-1 softmax 2 ...

  77. [77]

    lim n→∞ H aggrsoftmax z(1),z (2),· · ·,z (H);θ logn = 1,

    Softmax head aggregation is dispersive, i.e. lim n→∞ H aggrsoftmax z(1),z (2),· · ·,z (H);θ logn = 1,

  78. [78]

    Proof.We first prove that softmax head aggregation is dispersive

    Denote p(h) =α-entmax(z (h)), if there are ∥p(h)∥0 =O nβh , βh ∈(0,1) , then entmax head aggregation is not dispersive, and lim sup n→∞ H aggrα-entmax z(1),z (2),· · ·,z (H);θ logn ≤max h∈[H] βh <1. Proof.We first prove that softmax head aggregation is dispersive. Denotep (h) = softmax z(h) . First, we find the lower bound of the numerator by usingH(·)’s ...