Express Language Modeling

Albert Gong; Annabelle Michael Carrell; Lester Mackey; Raaz Dwivedi

arxiv: 2606.10944 · v1 · pith:EW3FIAUKnew · submitted 2026-06-09 · 💻 cs.LG · cs.DS· math.ST· stat.ME· stat.ML· stat.TH

Express Language Modeling

Albert Gong , Annabelle Michael Carrell , Raaz Dwivedi , Lester Mackey This is my paper

Pith reviewed 2026-06-27 13:43 UTC · model grok-4.3

classification 💻 cs.LG cs.DSmath.STstat.MEstat.MLstat.TH

keywords Expresscausal attentionattention approximationThinformerlanguage modelinglong-contextKV cache compressionTriton implementation

0 comments

The pith

Express converts non-causal attention approximations into causal ones while preserving their error guarantees.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Express as a conversion procedure that takes any non-causal attention approximation and produces a causal version with identical guarantees. Causal attention is required for standard autoregressive language modeling, yet many strong approximation techniques were previously limited to non-causal settings. By matching the guarantees exactly, Express makes the best non-causal methods available for causal use. When paired with the Thinformer approximation it achieves an error bound of log to the three-halves of n divided by s, using only linear memory in the parameter s and quadratic-logarithmic compression cost. The authors also supply a Triton implementation that yields speedups over FlashAttention 2 and removes four specific resource limits in long-context language modeling pipelines.

Core claim

Express is a conversion tool that takes a non-causal attention approximation and produces a causal version with exactly the same approximation guarantees. When combined with the state-of-the-art Thinformer approximation, Express improves upon the best known causal attention guarantees, delivering log to the three-halves of n over s approximation error with only O(s) memory and O(s squared log squared of n) compression overhead for a sequence of length n. The method is realized in an efficient I/O-aware Triton implementation that demonstrates substantial speedups over FlashAttention 2 and removes four resource bottlenecks in the language modeling pipeline: long-context prefill, KV cache compr

What carries the argument

Express, the conversion procedure from non-causal to causal attention approximation that preserves the original approximation guarantees exactly.

If this is right

Causal attention can now achieve the same approximation quality previously available only in non-causal settings.
The error bound improves to log to the three-halves of n over s with linear memory in s.
Compression overhead remains O(s squared log squared of n) for sequence length n.
An efficient Triton implementation produces speedups relative to FlashAttention 2.
Four concrete bottlenecks in long-context language modeling are removed: prefill, KV cache compression, memory-constrained decoding, and compute-constrained decoding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future non-causal approximation improvements can transfer directly to causal language models via the same conversion.
Separate lines of research on causal-only approximations may become less necessary.
The memory and overhead scaling could support longer context windows under fixed hardware budgets.

Load-bearing premise

The conversion procedure from non-causal to causal approximation preserves the original approximation guarantees exactly, without requiring additional assumptions on the attention matrix or sequence statistics.

What would settle it

An explicit attention matrix and sequence where the causal approximation produced by Express exhibits strictly larger error than the original non-causal approximation.

Figures

Figures reproduced from arXiv: 2606.10944 by Albert Gong, Annabelle Michael Carrell, Lester Mackey, Raaz Dwivedi.

**Figure 1.** Figure 1: Accelerating long-context prefill. For masked attention, Thinformer Express attains larger speed-ups over FlashAttention 2 than the Triton HyperAttention algorithm (b) while improving the runtime-perplexity trade-off on LongBench-E tasks (c). (a) For unmasked attention, our I/O-aware Triton implementation also provides speed-ups over the original Torch-compiled Thinformer. open-source code at https://githu… view at source ↗

**Figure 2.** Figure 2: Accelerating KV cache compression. Across leading KV cache compression methods and LongBench-E tasks, Thinformer Express substantially reduces runtime while preserving quality. Error bands display ±1 standard error across all benchmark questions. See Sec. 4 for more details. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Memory- and compute-efficient long-form decoding. On competition-level MATH-500 problems that benefit from step-by-step reasoning, Thinformer Express improves both the cache size-accuracy trade-off (left) and time-accuracy trade-off (right) of long-form decoding. Error bands display ±1 standard error across all 500 benchmark questions. See Sec. 4 for more details. from long-form decoding as each newly gene… view at source ↗

read the original abstract

We introduce a new tool, Express, for converting a non-causal attention approximation into a causal approximation with matching approximation guarantees. When combined with the state-of-the-art Thinformer approximation, Express improves upon the best known causal attention guarantees, delivering $\log^{3/2}(n)/s$ approximation error with only $O(s)$ memory and $O(s^2 \log^2(n))$ compression overhead for a sequence of length $n$. We pair these developments with an efficient I/O-aware Triton implementation, demonstrate substantial speedups over FlashAttention 2, and use Express to overcome four resource bottlenecks in the language modeling pipeline: long-context prefill, KV cache compression, long-form memory-constrained decoding, and long-form compute-constrained decoding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Express claims an exact conversion from non-causal to causal attention approximations that preserves error bounds, but that step is the part that needs the closest check.

read the letter

The paper introduces Express as a conversion tool that takes a non-causal attention approximation and produces a causal version with the same guarantees. When applied to Thinformer it reports log^{3/2}(n)/s error, O(s) memory, and O(s^2 log^2(n)) overhead for causal attention on length-n sequences. It also supplies a Triton implementation and shows speedups over FlashAttention 2 while targeting four concrete bottlenecks: long-context prefill, KV cache compression, memory-limited decoding, and compute-limited decoding.

The practical framing is the strongest part. The authors identify real resource limits in current pipelines and tie the new bounds directly to those limits. The implementation detail and reported speedups give something concrete to test even before the theory is fully digested.

The soft spot is the conversion itself. The abstract states that Express maps the non-causal construction to a causal one without changing the approximation error, memory, or overhead. That claim is load-bearing. If the conversion requires extra assumptions on the attention matrix, introduces a lower-order error term, or changes the compression cost in practice, the stated improvement over prior causal bounds does not follow directly. The paper will stand or fall on whether the construction and proof make this preservation explicit and assumption-free.

This work is aimed at researchers and engineers who build or optimize long-context transformers and care about attention approximations. Readers who already follow Thinformer-style results will find the most immediate value.

It deserves a serious referee to examine the conversion math and the experimental verification of the claimed bounds.

Referee Report

1 major / 0 minor

Summary. The paper introduces Express, a conversion tool that transforms non-causal attention approximations into causal ones while preserving the original approximation guarantees. When paired with the Thinformer approximation, Express yields a causal attention guarantee of log^{3/2}(n)/s error using O(s) memory and O(s^2 log^2(n)) compression overhead for sequence length n. The work also provides an I/O-aware Triton implementation demonstrating speedups over FlashAttention 2 and applies the method to address bottlenecks in long-context prefill, KV cache compression, and memory/compute-constrained decoding.

Significance. If the conversion procedure indeed maps non-causal guarantees to causal attention without additional error terms or assumptions, the result would strengthen the best-known theoretical bounds for causal attention and enable more efficient long-context language modeling. The combination of improved bounds, low memory overhead, and a practical implementation would be a notable contribution to efficient transformer research.

major comments (1)

[Abstract] Abstract: The central claim that Express converts a non-causal approximation (Thinformer) into a causal one while delivering exactly the same log^{3/2}(n)/s error bound (with no extra terms) is load-bearing for the stated improvement over prior causal attention results. The skeptic concern that the conversion step may introduce lower-order error or require additional matrix assumptions (e.g., decay properties away from the diagonal) is not resolved by the abstract alone; without the explicit construction and error analysis, the bound does not follow directly from the non-causal case.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful review. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that Express converts a non-causal approximation (Thinformer) into a causal one while delivering exactly the same log^{3/2}(n)/s error bound (with no extra terms) is load-bearing for the stated improvement over prior causal attention results. The skeptic concern that the conversion step may introduce lower-order error or require additional matrix assumptions (e.g., decay properties away from the diagonal) is not resolved by the abstract alone; without the explicit construction and error analysis, the bound does not follow directly from the non-causal case.

Authors: The manuscript provides the explicit construction of Express together with the full error analysis in Sections 3 and 4. The conversion maps any non-causal approximation to a causal one while preserving the original error bound exactly, without introducing lower-order terms. The argument relies only on the standard assumptions already used for the non-causal Thinformer guarantee and does not require additional decay properties away from the diagonal. Consequently the stated causal bound follows directly from the non-causal result via the given mapping. revision: no

Circularity Check

0 steps flagged

No circularity detected; derivation self-contained

full rationale

The paper introduces Express as a new conversion procedure that maps non-causal approximations (such as Thinformer) to causal ones while preserving the original error bounds exactly. No equations, fitted parameters, or derivation steps are exhibited in the provided text that reduce the claimed log^{3/2}(n)/s guarantee to a self-definition, a renamed fit, or a load-bearing self-citation chain. Thinformer is referenced as an external state-of-the-art result, and the conversion is presented as a novel construction whose guarantees follow from its definition rather than tautologically equaling its inputs. The central claim therefore remains independent of the patterns that would trigger circularity scores above 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities; all arrays left empty.

pith-pipeline@v0.9.1-grok · 5667 in / 1046 out tokens · 22561 ms · 2026-06-27T13:43:26.572004+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 5 canonical work pages · 3 internal anchors

[1]

Discrepancy minimization via a self-balancing walk

Ryan Alweiss, Yang P Liu, and Mehtaab Sawhney. Discrepancy minimization via a self-balancing walk. InProceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 14–20, 2021.(Cited on page 6.)

2021
[2]

Longbench: A bilingual, multitask benchmark for long context understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages ...

2024
[3]

Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling. InSecond Conference on Language Modeling, 2025.(Cited on pages 2 and 8.)

2025
[4]

Low-rank thinning

Annabelle Michael Carrell, Albert Gong, Abhishek Shetty, Raaz Dwivedi, and Lester Mackey. Low-rank thinning. InInternational Conference on Machine Learning, pages 6811–6848. PMLR, 2025.(Cited on pages 1, 2, 5, 6, 7, 13, 16, and 17.)

2025
[5]

On linear-time deterministic algorithms for optimization problems in fixed dimension.Journal of Algorithms, 21(3):579–597, 1996.(Cited on page 6.)

Bernard Chazelle and Jiri Matousek. On linear-time deterministic algorithms for optimization problems in fixed dimension.Journal of Algorithms, 21(3):579–597, 1996.(Cited on page 6.)

1996
[6]

Flashattention-2: Faster attention with better parallelism and work partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=mZn2Xyh9Ec.(Cited on pages 1, 7, and 10.)

2024
[7]

Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 35:16344–16359, 2022.(Cited on pages 1 and 7.)

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 35:16344–16359, 2022.(Cited on pages 1 and 7.)

2022
[8]

A simple and effective l_2 norm-based strategy for kv cache compression

Alessio Devoto, Yu Zhao, Simone Scardapane, and Pasquale Minervini. A simple and effective l_2 norm-based strategy for kv cache compression. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18476–18499, 2024.(Cited on page 9.)

2024
[9]

Expected attention: Kv cache compression by estimating attention from future queries distribution.arXiv preprint arXiv:2510.00636, 2025.(Cited on pages 9 and 21.)

Alessio Devoto, Maximilian Jeblick, and Simon Jégou. Expected attention: Kv cache compression by estimating attention from future queries distribution.arXiv preprint arXiv:2510.00636, 2025.(Cited on pages 9 and 21.)

work page arXiv 2025
[10]

Generalized kernel thinning

Raaz Dwivedi and Lester Mackey. Generalized kernel thinning. InInternational Conference on Learning Representations, 2022.(Cited on pages 2 and 5.) 10

2022
[11]

Kernel thinning.Journal of Machine Learning Research, 25 (152):1–77, 2024.(Cited on pages 2, 5, and 15.)

Raaz Dwivedi and Lester Mackey. Kernel thinning.Journal of Machine Learning Research, 25 (152):1–77, 2024.(Cited on pages 2, 5, and 15.)

2024
[12]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Team Glm, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793, 2024.(Cited on page 8.)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024.(Cited on page 8.)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025.(Cited on pages 1 and 9.)

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025.(Cited on pages 1 and 9.)

2025
[15]

Hyperattention: Long-context attention in near-linear time

Insu Han, Rajesh Jayaram, Amin Karbasi, Vahab Mirrokni, David Woodruff, and Amir Zandieh. Hyperattention: Long-context attention in near-linear time. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=Eh0Od2BJIM.(Cited on pages 1, 6, and 8.)

2024
[16]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URL https://openreview.net/forum?id=7Bywt2mQsCe.(Cited on pages ...

2021
[17]

Hoeffding

Wassily Hoeffding. Probability inequalities for sums of bounded random variables.Journal of the American Statistical Association, 58(301):13–30, 1963. ISSN 01621459, 1537274X. URL http://www.jstor.org/stable/2282952.(Cited on page 15.)

work page arXiv 1963
[18]

Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, 2017.(Cited on page 8.)

2017
[19]

Streaming attention approximation via discrepancy theory

Ekaterina Kochetkova, Kshiteej Sheth, Insu Han, Amir Zandieh, and Michael Kapralov. Streaming attention approximation via discrepancy theory. InAdvances in Neural Information Processing Systems, 2025.(Cited on pages 1 and 6.)

2025
[20]

Learning question classifiers

Xin Li and Dan Roth. Learning question classifiers. InCOLING 2002: The 19th International Conference on Computational Linguistics, 2002.(Cited on page 8.)

2002
[21]

Snapkv: Llm knows what you are looking for before generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation. InAdvances in Neural Information Processing Systems, volume 37, 2024. (Cited on pages 2, 8, and 9.)

2024
[22]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023.(Cited on pages 2, 9, and 21.)

2023
[23]

Approximations and optimal geometric divide-and-conquer.Journal of Computer and System Sciences, 50(2):203–208, 1995.(Cited on page 6.)

Jiri Matousek. Approximations and optimal geometric divide-and-conquer.Journal of Computer and System Sciences, 50(2):203–208, 1995.(Cited on page 6.)

1995
[24]

Keydiff: Key similarity-based kv cache eviction for long-context llm inference in resource-constrained environments

Junyoung Park, Dalton Jones, Matthew J Morse, Raghavv Goel, Mingu Lee, and Christopher Lott. Keydiff: Key similarity-based kv cache eviction for long-context llm inference in resource-constrained environments. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025.(Cited on page 9.)

2025
[25]

Algorithms for ε-approximations of terrains

Jeff M Phillips. Algorithms for ε-approximations of terrains. InInternational Colloquium on Automata, Languages, and Programming, pages 447–458. Springer, 2008.(Cited on page 6.)

2008
[26]

WildCat: Near-Linear Attention in Theory and Practice

Tobias Schröder and Lester Mackey. Wildcat: Near-linear attention in theory and practice. arXiv preprint arXiv:2602.10056, 2026.(Cited on pages 1 and 6.)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024.(Cited on page 10.)

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024.(Cited on page 10.)

2024
[28]

Distribution compression in near-linear time

Abhishek Shetty, Raaz Dwivedi, and Lester Mackey. Distribution compression in near-linear time. InInternational Conference on Learning Representations, 2022.(Cited on pages 3 and 14.) 11

2022
[29]

Support vector machines.Wiley Interdisciplinary Reviews: Computational Statistics, 1, 2008

Ingo Steinwart and Andreas Christmann. Support vector machines.Wiley Interdisciplinary Reviews: Computational Statistics, 1, 2008. URL https://api.semanticscholar.org/ CorpusID:661123.(Cited on pages 2 and 15.)

2008
[30]

Triton: an intermediate language and compiler for tiled neural network computations

Philippe Tillet, Hsiang-Tsung Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10–19, 2019.(Cited on pages 1 and 7.)

2019
[31]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6000–6010, Red Hook, NY , USA, 2017. Curran Associates Inc. ISBN 9781510860964.(Cited on pages 1,...

2017
[32]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations,
[33]

URLhttps://openreview.net/forum?id=NG7sS51zVF.(Cited on pages 2, 8, and 9.)
[34]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2369–2380, 2018.(Cited on page 8.)

2018
[35]

Kdeformer: Accelerating transformers via kernel density estimation

Amir Zandieh, Insu Han, Majid Daliri, and Amin Karbasi. Kdeformer: Accelerating transformers via kernel density estimation. InInternational Conference on Machine Learning, pages 40605–40623. PMLR, 2023.(Cited on page 1.) 12 A Kernel Halving For completeness, we reproduce the KH(δ)halving algorithm of Carrell et al. [4, Alg. B.1]. Algorithm A.1:KH(δ): Kern...

2023

[1] [1]

Discrepancy minimization via a self-balancing walk

Ryan Alweiss, Yang P Liu, and Mehtaab Sawhney. Discrepancy minimization via a self-balancing walk. InProceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 14–20, 2021.(Cited on page 6.)

2021

[2] [2]

Longbench: A bilingual, multitask benchmark for long context understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages ...

2024

[3] [3]

Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling. InSecond Conference on Language Modeling, 2025.(Cited on pages 2 and 8.)

2025

[4] [4]

Low-rank thinning

Annabelle Michael Carrell, Albert Gong, Abhishek Shetty, Raaz Dwivedi, and Lester Mackey. Low-rank thinning. InInternational Conference on Machine Learning, pages 6811–6848. PMLR, 2025.(Cited on pages 1, 2, 5, 6, 7, 13, 16, and 17.)

2025

[5] [5]

On linear-time deterministic algorithms for optimization problems in fixed dimension.Journal of Algorithms, 21(3):579–597, 1996.(Cited on page 6.)

Bernard Chazelle and Jiri Matousek. On linear-time deterministic algorithms for optimization problems in fixed dimension.Journal of Algorithms, 21(3):579–597, 1996.(Cited on page 6.)

1996

[6] [6]

Flashattention-2: Faster attention with better parallelism and work partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=mZn2Xyh9Ec.(Cited on pages 1, 7, and 10.)

2024

[7] [7]

Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 35:16344–16359, 2022.(Cited on pages 1 and 7.)

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 35:16344–16359, 2022.(Cited on pages 1 and 7.)

2022

[8] [8]

A simple and effective l_2 norm-based strategy for kv cache compression

Alessio Devoto, Yu Zhao, Simone Scardapane, and Pasquale Minervini. A simple and effective l_2 norm-based strategy for kv cache compression. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18476–18499, 2024.(Cited on page 9.)

2024

[9] [9]

Expected attention: Kv cache compression by estimating attention from future queries distribution.arXiv preprint arXiv:2510.00636, 2025.(Cited on pages 9 and 21.)

Alessio Devoto, Maximilian Jeblick, and Simon Jégou. Expected attention: Kv cache compression by estimating attention from future queries distribution.arXiv preprint arXiv:2510.00636, 2025.(Cited on pages 9 and 21.)

work page arXiv 2025

[10] [10]

Generalized kernel thinning

Raaz Dwivedi and Lester Mackey. Generalized kernel thinning. InInternational Conference on Learning Representations, 2022.(Cited on pages 2 and 5.) 10

2022

[11] [11]

Kernel thinning.Journal of Machine Learning Research, 25 (152):1–77, 2024.(Cited on pages 2, 5, and 15.)

Raaz Dwivedi and Lester Mackey. Kernel thinning.Journal of Machine Learning Research, 25 (152):1–77, 2024.(Cited on pages 2, 5, and 15.)

2024

[12] [12]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Team Glm, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793, 2024.(Cited on page 8.)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024.(Cited on page 8.)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025.(Cited on pages 1 and 9.)

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025.(Cited on pages 1 and 9.)

2025

[15] [15]

Hyperattention: Long-context attention in near-linear time

Insu Han, Rajesh Jayaram, Amin Karbasi, Vahab Mirrokni, David Woodruff, and Amir Zandieh. Hyperattention: Long-context attention in near-linear time. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=Eh0Od2BJIM.(Cited on pages 1, 6, and 8.)

2024

[16] [16]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URL https://openreview.net/forum?id=7Bywt2mQsCe.(Cited on pages ...

2021

[17] [17]

Hoeffding

Wassily Hoeffding. Probability inequalities for sums of bounded random variables.Journal of the American Statistical Association, 58(301):13–30, 1963. ISSN 01621459, 1537274X. URL http://www.jstor.org/stable/2282952.(Cited on page 15.)

work page arXiv 1963

[18] [18]

Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, 2017.(Cited on page 8.)

2017

[19] [19]

Streaming attention approximation via discrepancy theory

Ekaterina Kochetkova, Kshiteej Sheth, Insu Han, Amir Zandieh, and Michael Kapralov. Streaming attention approximation via discrepancy theory. InAdvances in Neural Information Processing Systems, 2025.(Cited on pages 1 and 6.)

2025

[20] [20]

Learning question classifiers

Xin Li and Dan Roth. Learning question classifiers. InCOLING 2002: The 19th International Conference on Computational Linguistics, 2002.(Cited on page 8.)

2002

[21] [21]

Snapkv: Llm knows what you are looking for before generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation. InAdvances in Neural Information Processing Systems, volume 37, 2024. (Cited on pages 2, 8, and 9.)

2024

[22] [22]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023.(Cited on pages 2, 9, and 21.)

2023

[23] [23]

Approximations and optimal geometric divide-and-conquer.Journal of Computer and System Sciences, 50(2):203–208, 1995.(Cited on page 6.)

Jiri Matousek. Approximations and optimal geometric divide-and-conquer.Journal of Computer and System Sciences, 50(2):203–208, 1995.(Cited on page 6.)

1995

[24] [24]

Keydiff: Key similarity-based kv cache eviction for long-context llm inference in resource-constrained environments

Junyoung Park, Dalton Jones, Matthew J Morse, Raghavv Goel, Mingu Lee, and Christopher Lott. Keydiff: Key similarity-based kv cache eviction for long-context llm inference in resource-constrained environments. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025.(Cited on page 9.)

2025

[25] [25]

Algorithms for ε-approximations of terrains

Jeff M Phillips. Algorithms for ε-approximations of terrains. InInternational Colloquium on Automata, Languages, and Programming, pages 447–458. Springer, 2008.(Cited on page 6.)

2008

[26] [26]

WildCat: Near-Linear Attention in Theory and Practice

Tobias Schröder and Lester Mackey. Wildcat: Near-linear attention in theory and practice. arXiv preprint arXiv:2602.10056, 2026.(Cited on pages 1 and 6.)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[27] [27]

Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024.(Cited on page 10.)

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision.Advances in Neural Information Processing Systems, 37:68658–68685, 2024.(Cited on page 10.)

2024

[28] [28]

Distribution compression in near-linear time

Abhishek Shetty, Raaz Dwivedi, and Lester Mackey. Distribution compression in near-linear time. InInternational Conference on Learning Representations, 2022.(Cited on pages 3 and 14.) 11

2022

[29] [29]

Support vector machines.Wiley Interdisciplinary Reviews: Computational Statistics, 1, 2008

Ingo Steinwart and Andreas Christmann. Support vector machines.Wiley Interdisciplinary Reviews: Computational Statistics, 1, 2008. URL https://api.semanticscholar.org/ CorpusID:661123.(Cited on pages 2 and 15.)

2008

[30] [30]

Triton: an intermediate language and compiler for tiled neural network computations

Philippe Tillet, Hsiang-Tsung Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10–19, 2019.(Cited on pages 1 and 7.)

2019

[31] [31]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6000–6010, Red Hook, NY , USA, 2017. Curran Associates Inc. ISBN 9781510860964.(Cited on pages 1,...

2017

[32] [32]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations,

[33] [33]

URLhttps://openreview.net/forum?id=NG7sS51zVF.(Cited on pages 2, 8, and 9.)

[34] [34]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2369–2380, 2018.(Cited on page 8.)

2018

[35] [35]

Kdeformer: Accelerating transformers via kernel density estimation

Amir Zandieh, Insu Han, Majid Daliri, and Amin Karbasi. Kdeformer: Accelerating transformers via kernel density estimation. InInternational Conference on Machine Learning, pages 40605–40623. PMLR, 2023.(Cited on page 1.) 12 A Kernel Halving For completeness, we reproduce the KH(δ)halving algorithm of Carrell et al. [4, Alg. B.1]. Algorithm A.1:KH(δ): Kern...

2023