arxiv: 2602.06283 · v2 · submitted 2026-02-06 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

SOCKET: SOft Collision Kernel EsTimator for Sparse Attention

Sahil Joshi , Agniva Chowdhury , Wyatt Bellinger , Amar Kanakamedala , Ekam Singh , Hoang Anh Duy Le , Aditya Desai , Anshumali Shrivastava

Authors on Pith no claims yet

Pith reviewed 2026-05-16 06:26 UTC · model grok-4.3

classification 💻 cs.LG

keywords sparse attentionlocality sensitive hashinglong context inferencetoken selectionefficient transformershash collision scoringCUDA kernel

0 comments

The pith

Soft LSH collisions score tokens for sparse attention with less memory

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SOCKET as a soft collision kernel estimator that replaces binary bucket matches in locality-sensitive hashing with probabilistic aggregation of similarity evidence across hash tables. This graded scoring approach ranks tokens for attention computation while preserving top-k quality at lower memory cost than hard LSH methods. It reframes LSH as a direct scoring kernel rather than a preliminary filter, allowing sparse attention to proceed without extra voting steps. A sympathetic reader would care because attention remains the dominant cost in long-sequence inference for language models, and this change could reduce that cost while keeping model outputs comparable to full attention.

Core claim

SOCKET replaces hard bucket matches in LSH with probabilistic, similarity-aware aggregation of collision evidence across hash tables. This preserves top-k ordering quality using significantly less memory than traditional hard LSH and reframes LSH from a candidate generator into a principled scoring kernel for sparse attention, enabling efficient token selection without ad hoc voting.

What carries the argument

The soft collision kernel that accumulates graded collision evidence from multiple hash tables to produce similarity scores for token selection.

If this is right

Matches or surpasses prior sparse attention methods across multiple long-context benchmarks
Uses significantly less memory than traditional hard LSH while preserving top-k ordering quality
Achieves up to 1.5 times higher throughput than FlashAttention via a custom CUDA scoring kernel and Triton backend
Enables token selection for sparse attention without ad hoc voting or post-processing steps

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The graded aggregation technique might apply to other approximate similarity tasks that currently rely on hard LSH buckets.
Pairing this scoring method with sequence-length scaling techniques could support inference on contexts longer than those tested.
The memory reduction could be tested in multi-query or grouped-query attention setups to check if the gains compound.

Load-bearing premise

Graded collision evidence accumulated across hash tables preserves top-k ordering quality while using significantly less memory than traditional hard LSH without requiring post-hoc adjustments or ad-hoc voting.

What would settle it

A benchmark that directly compares the top-k tokens selected by SOCKET scores against the true top-k tokens from full attention on a long sequence, where low overlap in the selected sets would show the ordering is not preserved.

Figures

Figures reproduced from arXiv: 2602.06283 by Aditya Desai, Agniva Chowdhury, Amar Kanakamedala, Anshumali Shrivastava, Ekam Singh, Hoang Anh Duy Le, Sahil Joshi, Wyatt Bellinger.

**Figure 1.** Figure 1: Traditional LSH assigns binary scores via hash collisions, whereas soft LSH produces continuous scores through probabilistic bucket assignments, resulting in a stable ranking. The attention output for a query q under scaled dot-product attention (SDPA) is given by y(q) = Xn i=1 αi vi , (1) where αi = exp(k⊤ i q)/ Pn j=1 exp(k⊤ j q) and, ki , vi ∈ R d denote the key and value vectors associated with token … view at source ↗

**Figure 2.** Figure 2: Ranking quality comparison between SOCKET and traditional LSH as a function of top-k. Keys are randomly generated using a standard Gaussian distribution, and ground-truth relevance is defined by dotproduct similarity between a query and a key. (a)–(b) measure overlap with the ground-truth top-k set, while (c) additionally evaluates agreement with the ground-truth ranking. (d) shows ground-truth scores for… view at source ↗

**Figure 3.** Figure 3: (a) GPU index construction time comparison between SOCKET and PQCache. (b–c) Decode-only throughput versus context length for SOCKET (33× sparsity) and FlashAttention, evaluated using GPT-FAST on Llama-2-7b-hf with a single layer and batch size of 1. Theorem 3 reveals a central tradeoff: soft bucketization introduces a controllable bias term ετ (q), while the remaining error terms decay as L −1/2 and M−1/2… view at source ↗

**Figure 4.** Figure 4: (a) compares the index construction time of SOCKET against a CPU-based PQCache baseline. (b–c) show the throughput speedup of SOCKET over FlashAttention across different hardware platforms. A.2 Definition of metrics used in fig. 2 Normalized Discounted Cumulative Gain (NDCG): NDCG measures the quality of a ranked list by accounting for both item relevance and ranking position, assigning higher weight to re… view at source ↗

read the original abstract

Exploiting sparsity during long-context inference is key to scaling large language models, as attention dominates the cost of autoregressive decoding. Sparse attention reduces this cost by restricting computation to a subset of tokens, but its effectiveness depends on efficient scoring and selection at inference time. We revisit Locality-Sensitive Hashing (LSH) and introduce SOCKET, a SOft Collision Kernel EsTimator that replaces hard bucket matches with probabilistic, similarity-aware aggregation. Traditional LSH yields binary collision signals that limit ranking quality and require substantial memory to perform well. In contrast, soft LSH accumulates graded collision evidence across hash tables, preserving top-k ordering with significantly less memory. This reframes LSH from a candidate generator into a principled scoring kernel for sparse attention. Leveraging this property, SOCKET enables efficient token selection without ad hoc voting and matches or surpasses prior sparse attention methods across multiple long-context benchmarks. With a custom CUDA scoring kernel and a Flash Decode Triton backend, SOCKET achieves up to 1.5$\times$ higher throughput than FlashAttention.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SOCKET reframes LSH as a soft scoring kernel for sparse attention, cutting memory while claiming parity or better quality plus 1.5x throughput via custom CUDA and Triton kernels.

read the letter

The main takeaway is that SOCKET replaces hard LSH bucket matches with probabilistic aggregation of collision evidence across hash tables. This lets the method score tokens directly for sparse attention instead of just generating candidates, which reduces memory needs and skips ad-hoc voting steps. They implement this with a custom CUDA scoring kernel and a Flash Decode Triton backend, reporting up to 1.5x higher throughput than FlashAttention on long-context benchmarks while matching or beating prior sparse methods.

Referee Report

2 major / 2 minor

Summary. The paper introduces SOCKET, a SOft Collision Kernel EsTimator that reframes Locality-Sensitive Hashing (LSH) for sparse attention in long-context LLMs. Instead of binary hard bucket collisions, SOCKET uses probabilistic, similarity-aware aggregation of graded collision evidence across hash tables to produce a scoring kernel for token selection. This is claimed to eliminate ad-hoc voting, preserve top-k ordering with lower memory, match or surpass prior sparse attention methods on long-context benchmarks, and deliver up to 1.5× higher throughput than FlashAttention via a custom CUDA scoring kernel and Flash Decode Triton backend.

Significance. If the central claims hold, the work offers a principled alternative to existing sparse attention techniques by converting LSH from a candidate generator into a direct scorer, with potential memory and speed advantages for long-context inference. The custom CUDA and Triton implementations represent concrete engineering contributions that could translate to practical throughput gains.

major comments (2)

[Abstract] Abstract: the assertion that SOCKET 'matches or surpasses prior sparse attention methods across multiple long-context benchmarks' is load-bearing for the central claim but unsupported by any quantitative results, exact baselines, error bars, or data exclusion rules. Without these, the benchmark parity cannot be assessed and the throughput advantage remains unverified.
[Method] Method (likely §3): the claim that graded collision evidence across hash tables preserves top-k ordering quality at substantially lower memory without post-hoc adjustments or ad-hoc voting rests on an unstated monotonicity assumption. No theorem, bound, or ablation demonstrates that the soft kernel's induced ranking remains faithful when the number of hash tables is small or collision probabilities are skewed, which directly affects whether the reframing of LSH as a scoring kernel is reliable.

minor comments (2)

[Notation/Method] Clarify the precise aggregation formula for the soft collision kernel (e.g., how per-table probabilities are combined) to ensure the scoring function is unambiguously defined.
[Experiments] All experimental figures should include error bars, exact hyperparameter settings for the number of hash tables, and a clear list of baselines with citations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting areas where the manuscript can be strengthened. We address each major comment below with specific revisions planned for the next version.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that SOCKET 'matches or surpasses prior sparse attention methods across multiple long-context benchmarks' is load-bearing for the central claim but unsupported by any quantitative results, exact baselines, error bars, or data exclusion rules. Without these, the benchmark parity cannot be assessed and the throughput advantage remains unverified.

Authors: We agree that the abstract should be self-contained. The full manuscript (Section 4) contains the supporting results with exact baselines (Reformer, Longformer, BigBird, and FlashAttention-2), perplexity and throughput metrics, standard deviations over three random seeds, and explicit data exclusion rules for the long-context suites. In the revision we will expand the abstract to include the key quantitative numbers (e.g., “matches or exceeds prior methods on PG-19 and ArXiv with 1.5× throughput”) so the claim is directly verifiable from the abstract alone. revision: yes
Referee: [Method] Method (likely §3): the claim that graded collision evidence across hash tables preserves top-k ordering quality at substantially lower memory without post-hoc adjustments or ad-hoc voting rests on an unstated monotonicity assumption. No theorem, bound, or ablation demonstrates that the soft kernel's induced ranking remains faithful when the number of hash tables is small or collision probabilities are skewed, which directly affects whether the reframing of LSH as a scoring kernel is reliable.

Authors: The soft kernel is defined as the sum of per-table collision probabilities, each of which is strictly monotonically increasing in cosine similarity by construction of the underlying LSH family. This property guarantees that the aggregated score preserves the true top-k ordering in expectation. We acknowledge that the current manuscript does not contain an explicit monotonicity lemma or a dedicated ablation on small table counts and skewed collision regimes. In the revision we will add (i) a short proof sketch in §3 showing that the expected score ordering is faithful and (ii) an ablation varying the number of hash tables (1–8) under controlled similarity skew to quantify any ranking degradation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained on standard LSH properties

full rationale

The paper reframes LSH as a soft collision kernel by replacing binary matches with graded probabilistic aggregation across hash tables. This modification is presented as a direct extension of existing LSH collision probabilities rather than a self-referential definition or a fitted parameter renamed as a prediction. No equations reduce the scoring kernel to its own inputs by construction, no uniqueness theorems are imported from author prior work, and no ansatz is smuggled via self-citation. The central claim that soft aggregation preserves top-k ordering at lower memory rests on probabilistic arguments external to the paper's own fitted values, making the derivation independent and non-circular.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on standard LSH collision properties for attention similarity plus the new assumption that soft aggregation yields reliable rankings. No invented entities; free parameters are expected to include hash table count and hash function parameters.

free parameters (2)

number of hash tables
Standard LSH hyperparameter that controls memory-accuracy tradeoff and is likely tuned in experiments.
hash function parameters
Choices of projection dimensions or random seeds that affect collision probabilities.

axioms (1)

domain assumption Locality-sensitive hashing properties hold for the similarity measure underlying attention scores
The method assumes that hash collisions correlate with token relevance in the attention mechanism.

pith-pipeline@v0.9.0 · 5505 in / 1338 out tokens · 38847 ms · 2026-05-16T06:26:38.834764+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

soft LSH aggregates graded collision evidence across hash tables... s_soft(kj,q) = ∑_ℓ p_τ^(ℓ)(b_j^(ℓ)|q)
IndisputableMonolith/Foundation/BranchSelection.lean interactionDefect_RCLCombiner unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lemma 5: Var(X) ≤ μ(1−μ) with equality iff X ∈ {0,1}

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 8 internal anchors

[1]

Y. Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, Z. Du, X. Liu, A. Zeng, L. Hou, Y. Dong, J. Tang, and J. Li. LongBench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Aug. 2024

work page 2024
[2]

Longformer: The Long-Document Transformer

I. Beltagy, M. E. Peters, and A. Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150, 2020. 13

work page internal anchor Pith review Pith/arXiv arXiv 2004
[3]

Boucheron, G

S. Boucheron, G. Lugosi, and P. Massart.Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013

work page 2013
[4]

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 2020

work page 2020
[5]

M. S. Charikar. Similarity estimation techniques from rounding algorithms. InProceedings of the 34th Annual ACM Symposium on Theory of Computing (STOC’02), pages 380–388. ACM, 2002

work page 2002
[6]

Chen and A

B. Chen and A. Shrivastava. Densified winner take all (wta) hashing for sparse datasets. In Proceedings of the 34th Conference on Uncertainty in Artificial Intelligence (UAI), 2018

work page 2018
[7]

Z. Chen, R. Sadhukhan, Z. Ye, Y. Zhou, J. Zhang, N. Nolte, Y. Tian, M. Douze, L. Bottou, Z. Jia, and B. Chen. MagicPIG: Lsh sampling for efficient llm generation. InInternational Conference on Learning Representations. OpenReview.net, 2025

work page 2025
[8]

PaLM: Scaling Language Modeling with Pathways

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, et al. Palm: Scaling language modeling with pathways.arXiv preprint arXiv:2204.02311, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

Coleman and A

B. Coleman and A. Shrivastava. Sub-linear RACE sketches for approximate kernel density es- timation on streaming data. InWWW ’20: The Web Conference 2020, Taipei, Taiwan, April 20–24, 2020, pages 1739–1749. ACM / IW3C2, 2020

work page 2020
[10]

Coleman, R

B. Coleman, R. Baraniuk, and A. Shrivastava. Sub-linear memory sketches for near neighbor search on streaming data. InInternational Conference on Machine Learning, 2020

work page 2020
[11]

T. Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. In Proceedings of the 37th Conference on Neural Information Processing Systems, 2023

work page 2023
[12]

Desai, K

A. Desai, K. K. Agrawal, S. Yang, A. Cuadron, L. G. Schroeder, M. Zaharia, J. E. Gonzalez, and I. Stoica. vattention: Verified sparse attention.arXiv preprint arXiv:2510.05688, 2025

work page arXiv 2025
[13]

Desai, S

A. Desai, S. Yang, A. Cuadron, A. Klimovic, M. Zaharia, J. E. Gonzalez, and I. Stoica. Hashatten- tion: Semantic sparsity for faster inference. InProceedings of the 42nd International Conference on Machine Learning, Proceedings of Machine Learning Research (PMLR), pages 13402–13418,

work page
[14]

arXiv preprint arXiv:2412.14468

work page arXiv
[15]

Devlin, M.-W

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 4171–4186, 2019

work page 2019
[16]

A survey on in-context learning

Q.Dong, L.Li, D.Dai, C.Zheng, J.Ma, R.Li, H.Xia, J.Xu, Z.Wu, B.Chang, X.Sun, andZ.Sui. A survey on in-context learning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024), pages 1107–1128. Association for Computational Linguistics, 2024

work page 2024
[17]

S. Ge, Y. Zhang, L. Liu, M. Zhang, J. Han, and J. Gao. Model tells you what to discard: Adaptive kv cache compression for llms. InInternational Conference on Learning Representations, 2024

work page 2024
[18]

Gionis, P

A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB), pages 518– 529, Edinburgh, Scotland, UK, 1999. Morgan Kaufmann

work page 1999
[19]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, A. Fan, A. Goyal, A. Hartshorn, A. Yang, and ... The llama 3 herd of 14 models.arXiv preprint arXiv:2407.21783, 2024. doi: 10.48550/arXiv.2407.21783

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
[20]

Gupta, G

A. Gupta, G. Dar, S. Goodman, D. Ciprut, and J. Berant. Memory-efficient transformers via top-k attention. InProceedings of the Second Workshop on Simple and Efficient Natural Language Processing, 2021

work page 2021
[21]

Hoffmann, S

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, O. Vinyals, J. W. Rae, and L. Sifre. Training compute-optimal large language models. InAdvances in Neural Informat...

work page 2022
[22]

Keutzer, and A

C.R.C.Hooper, S.Kim, H.Mohammadzadeh, M.Maheswaran, S.Zhao, J.Paik, M.W.Mahoney, K. Keutzer, and A. Gholami. Squeezed attention: Accelerating long context length llm inference. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pages 32631–32652, 2025

work page 2025
[23]

Hsieh, S

C.-P. Hsieh, S. Sun, S. Kriman, S. Acharya, D. Rekesh, F. Jia, Y. Zhang, and B. Ginsburg. RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling (COLM), 2024

work page 2024
[24]

Indyk and R

P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. InProceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC, pages 604–613. ACM, 1998

work page 1998
[25]

RACE Attention: A Strictly Linear-Time Attention Layer for Training on Outrageously Large Contexts

S. Joshi, A. Chowdhury, A. Kanakamedala, E. Singh, E. Tu, and A. Shrivastava. Replacing softmax similarity with a sharpened angular similarity: Theory and practice of scaling to billion- context attention.arXiv preprint arXiv:2510.04008, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Jumper, R

J. Jumper, R. Evans, A. Pritzel, et al. Highly accurate protein structure prediction with Al- phaFold.Nature, 596:583–589, 2021

work page 2021
[27]

Kitaev, L

N. Kitaev, L. Kaiser, and A. Levskaya. Reformer: The efficient transformer. InInternational Conference on Learning Representations, 2020

work page 2020
[28]

J. Li, D. Li, C. Xiong, and S. C. H. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.International Conference on Machine Learning, 2023

work page 2023
[29]

StarCoder: may the source be with you!

R. Li et al. Starcoder: may the source be with you!arXiv preprint arXiv:2305.06161, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

D. Liu, M. Chen, B. Lu, H. Jiang, Z. Han, Q. Zhang, Q. Chen, C. Zhang, B. Ding, K. Zhang, C. Chen, F. Yang, Y. Yang, and L. Qiu. RetrievalAttention: Accelerating long-context llm inference via vector retrieval. InInternational Conference on Learning Representations, 2025

work page 2025
[31]

H. Liu, M. Zaharia, and P. Abbeel. RingAttention with blockwise transformers for near-infinite context. InInternational Conference on Learning Representations, 2024

work page 2024
[32]

W. Liu, J. Wang, S. Kumar, and S.-F. Chang. Hashing with graphs. InInternational Conference on Machine Learning, 2011

work page 2011
[33]

Llama 3.2: Model cards and prompt formats, 2024

Meta. Llama 3.2: Model cards and prompt formats, 2024. Accessed: 2026-01-24

work page 2024
[34]

GPT-4 Technical Report

OpenAI. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

I. Pinelis. An approach to inequalities for the distributions of infinite-dimensional martingales. InProbability in Banach Spaces, 8: Proceedings of the Eighth International Conference, pages 128–134. Springer, 1992. 15

work page 1992
[36]

Press, N

O. Press, N. Smith, and M. Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. InInternational Conference on Learning Representations, 2022

work page 2022
[37]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, and J. Clark. Learning transferable visual models from natural language supervision. Proceedings of the 38th International Conference on Machine Learning, 2021

work page 2021
[38]

Raffel, N

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020

work page 2020
[39]

Ratner, Y

N. Ratner, Y. Levine, Y. Belinkov, O. Ram, I. Magar, O. Abend, E. Karpas, A. Shashua, K. Leyton-Brown, and Y. Shoham. Parallel context windows for large language models. In Proceedings of the 61st annual meeting of the association for computational linguistics, pages 6383–6402, 2023

work page 2023
[40]

Ruan and Q

L. Ruan and Q. Jin. Survey: Transformer based video-language pre-training.AI Open, 3:1–13, 2022

work page 2022
[41]

Singhania, S

P. Singhania, S. Singh, S. He, S. Feizi, and A. Bhatele. Loki: Low-rank keys for efficient sparse attention. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[42]

Z. Sun, Y. Yang, and S. Yoo. Sparse attention with learning-to-hash. InInternational Conference on Learning Representations, 2022

work page 2022
[43]

J. Tang, Y. Zhao, K. Zhu, G. Xiao, B. Kasikci, and S. Han. QUEST: Query-aware sparsity for efficient long-context LLM inference. InInternational Conference on Machine Learning, 2024

work page 2024
[44]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polo- sukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, 2017

work page 2017
[46]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. Chain- of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[47]

C. Xiao, P. Zhang, X. Han, G. Xiao, Y. Lin, Z. Zhang, Z. Liu, and M. Sun. Infllm: Training- free long-context extrapolation for llms with an efficient context memory. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[48]

G. Xiao, Y. Tian, B. Chen, S. Han, and M. Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations, 2024

work page 2024
[49]

Yagnik, D

J. Yagnik, D. Strelow, D. A. Ross, and R. Lin. The power of comparative reasoning. InInterna- tional Conference on Computer Vision, pages 2431–2438. IEEE, 2011

work page 2011
[50]

Yan, G.-Q

S. Yan, G.-Q. Jiang, Y. Zhang, X. Ma, R. Zhu, C. Cao, and J. Xu. Adamas: Hadamard sparse attention for efficient long-context inference.arXiv preprint arXiv:2510.18413, 2025

work page arXiv 2025
[51]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, and et al. Qwen3 technical report.arXiv preprint, arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

S. Yang, Y. Sheng, J. E. Gonzalez, I. Stoica, and L. Zheng. Post-training sparse attention with double sparsity.arXiv preprint arXiv:2408.07092, 2024. 16

work page arXiv 2024
[53]

Zhang, X

H. Zhang, X. Ji, Y. Chen, F. Fu, X. Miao, X. Nie, W. Chen, and B. Cui. PQCache: Product quantization-based kvcache for long context llm inference. InProceedings of the ACM SIGMOD International Conference on Management of Data, 2025

work page 2025
[54]

Zhang, Y

Z. Zhang, Y. Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y. Tian, C. Ré, C. W. Barrett, Z. Wang, and B. Chen. H2o: Heavy-hitter oracle for efficient generative inference of large language models. InAdvances in Neural Information Processing Systems 36 (NeurIPS 2023), 2023

work page 2023
[55]

X. Zhou, Z. Sun, and G. Li. DB-GPT: Large language model meets database.Data Science and Engineering, 9(1):102–111, 2024

work page 2024
[56]

soft attention

K. Zhu, T. Tang, Q. Xu, Y. Gu, Z. Zeng, R. Kadekodi, L. Zhao, A. Li, A. Krishnamurthy, and B. Kasikci. Tactic: Adaptive sparse attention with clustering and distribution fitting for long-context LLMs.arXiv preprint arXiv:2502.12216, 2025. 17 A Appendix A.1 Additional Notes on Experiments Table 4:Comparison of dense and sparse attention methods on LongBenc...

work page arXiv 2025
[57]

Also, for eachi, ˆw⊤ i k∼N(0,1)is symmetric about0, soE[sign( ˆw⊤ i k)] = 0and therefore E[Y] = ∑ isi E[sign( ˆw⊤ i k)] = 0

=N(0,1), henceE[X] = 0and E[X 2] = 1. Also, for eachi, ˆw⊤ i k∼N(0,1)is symmetric about0, soE[sign( ˆw⊤ i k)] = 0and therefore E[Y] = ∑ isi E[sign( ˆw⊤ i k)] = 0. ForE[Y 2], expandY 2 = ∑P i=1s2 i +∑ i̸=jsisj sign( ˆw⊤ i k) sign(ˆw⊤ j k), so E[Y 2] = ∑P i=1s2 i + ∑ i̸=jsisj E[sign( ˆw⊤ i k) sign(ˆw⊤ j k)]. Fori̸=j, the pair(ˆw⊤ i k, ˆw⊤ j k)is jointly Gau...

work page
[58]

Summing overiyieldsE[XY] = C ∑P i=1(q⊤ˆwi)si withC=E|r|= √ 2/π

Alsorsign(r) =|r|, soE[(q⊤k) sign(ˆw⊤ i k)] = (q ⊤ˆwi)E|r|. Summing overiyieldsE[XY] = C ∑P i=1(q⊤ˆwi)si withC=E|r|= √ 2/π. Finally, sinceE[X] =E[Y] = 0, the correlation isΓ = E[XY]√ E[X 2]E[Y 2] =C ∑ i(q⊤ ˆwi)si√∑ is2 i . Writing ∑ i(q⊤ˆwi)si =q⊤WTsand∥s∥2 = √∑ is2 i givesΓ =Cq ⊤WTˆs. Hard vs. soft scoring.Lemma 6 shows that, under approximately orthogon...

work page