SSV: Sparse Speculative Verification for Efficient LLM Inference

Nuo Shen; Rong Gu; Sheng Zhong; Yuhang Zhou; Zhibin Wang; Ziyu Zhong

arxiv: 2605.19893 · v2 · pith:4SBKT3TQnew · submitted 2026-05-19 · 💻 cs.OS

SSV: Sparse Speculative Verification for Efficient LLM Inference

Zhibin Wang , Ziyu Zhong , Nuo Shen , Yuhang Zhou , Rong Gu , Sheng Zhong This is my paper

Pith reviewed 2026-05-21 07:10 UTC · model grok-4.3

classification 💻 cs.OS

keywords sparse speculative verificationLLM inferencespeculative decodingdynamic sparse attentionkernel fusiongrouped-query executionthroughput optimizationlong-context inference

0 comments

The pith

SSV resolves the mismatch between speculative decoding and sparse attention to reach up to 3.49x LLM inference throughput.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to join speculative decoding, which spreads target-model work across several verifier queries at once, with dynamic sparse attention, which shrinks the working set of key-value data for each query. Straightforward combination runs into trouble because verification needs shared structure across queries while sparse attention gives each query its own sparse layout. SSV fixes the clash by reworking sparse attention into a verification-focused job through overlap-aware grouped-query execution, refresh and reuse kernel fusion, and profile-guided adaptive orchestration. A reader would care because the result is much higher end-to-end speed on long contexts while still meeting user-chosen precision targets.

Core claim

SSV turns dynamic sparse attention into a verification-oriented workload by combining overlap-aware grouped-query execution, refresh/reuse-based NSA kernel fusion, and profile-guided prompt-adaptive orchestration to improve cross-query reuse, reduce selected-index and branch-fusion overheads, and select effective draft-verification strategies under user-specified precision classes.

What carries the argument

Overlap-aware grouped-query execution paired with refresh/reuse-based NSA kernel fusion, which reclaims KV blocks across verifier queries even when each query uses its own sparse layout.

If this is right

End-to-end throughput rises by as much as 3.49 times compared with autoregressive NSA decoding.
Kernel speedups reach up to 6.86 times for the sparse speculative verification workload.
Verification strategy choice becomes input- and regime-aware while staying inside given precision classes.
Cross-query KV-block reuse improves while branch-wise and index-selection costs drop.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same mismatch-resolution pattern could apply when pairing other acceleration methods that also separate shared versus per-query work.
Profile-guided orchestration might transfer to other adaptive LLM pipelines where prompt statistics vary.
Extending the reuse techniques to multi-GPU settings would test whether the reported speedups scale beyond single-device runs.
Applying the approach to even longer contexts could show how reuse benefits grow with sequence length.

Load-bearing premise

The overlap-aware execution, kernel fusion, and profile-guided orchestration can close the gap between shared query patterns and per-query sparse layouts without adding overhead that wipes out the speed gains on real inputs.

What would settle it

Measure end-to-end throughput and kernel times on prompts with widely differing sparsity patterns; if gains fall below 1x or if fusion overheads exceed reported savings, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2605.19893 by Nuo Shen, Rong Gu, Sheng Zhong, Yuhang Zhou, Zhibin Wang, Ziyu Zhong.

**Figure 1.** Figure 1: Background and mismatch when combining speculative decoding with sparse attention. patterns struggle to capture the complex semantic relationships; dynamic methods (e.g., Top-𝑘 attention [15]) have become the preferred standard. Recent systems often employ structured, hardware-aligned blockwise sparsity to better align KV-cache access with GPU execution [13, 33]. Among them, Native Sparse Attention (NSA)… view at source ↗

**Figure 2.** Figure 2: Selected-block overlap ratio between adjacent verifier queries (8K context). = (D, k, T, C, M, S) Profile-Guided Planner Draft Construction depth = D, branching = k Sparse Verification Acceptance/ Deferred Commit input ... q1 q2 ... q1 ... q1 q2 Cross query overlap Token compression Token selection Sliding window D, k, T C, M, S runtime adaption verified logits tree-based input Verification-oriented fusio… view at source ↗

**Figure 3.** Figure 3: Overview of SSV. nearest preceding refresh layer, bypass index derivation, and directly enter a fully fused kernel. Insight 3: Long decode horizons enable prompt-aware adaptive planning. Speculative decoding typically spans many verification rounds when generating a response, which gives the system a natural adaptation window for each prompt. Rather than committing to one fixed strategy for the entire gene… view at source ↗

**Figure 4.** Figure 4: Selected-block overlap ratio versus absolute tokenposition distance Δ (8K context). The cross-query overlap consistently peaks at small Δ and decays as queries become further apart in the sequence. exploits cross-query index overlap and fused execution to maximize hardware efficiency. Overlap-aware cross-query execution (Section 4). The first optimization module targets redundant KV-block traffic inside s… view at source ↗

**Figure 5.** Figure 5: Overlap-aware kernel design. 4.2 Exact Merged-Schedule Variant The exact merged-schedule variant reduces memory traffic by jointly scheduling shared KV blocks without altering per-query selection semantics. As shown in [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Demonstration of three fusion strategies. enforcing strict per-row boundaries and causal masks using each query’s individual absolute position. Accuracy evaluation. To validate its quality impact, we conduct a model-integrity study on a 1B NSA target model [36, 37]. We evaluate the variant using lm-evaluation-harness [12] on PIQA [5], HellaSwag [35], ARC-Easy, and ARC-Challenge [8] (using 3-, 10-, and 25-… view at source ↗

**Figure 7.** Figure 7: Forward latency versus draft length 𝛾 on a Llama3- 1B backbone at 32K context; SSV uses the same backbone with its attention layers replaced by NSA-based sparse verification. traversal order, coarsening mode, coarsening factor, and refresh/reuse schedule. These choices interact with draft construction because changing 𝛾 also changes the realized verifier batch size, the opportunity for cross-query coar… view at source ↗

**Figure 8.** Figure 8: End-to-end generation throughput under EAGLE-3 with different draft-tree shapes. Each cell shows throughput on the first line and speedup on the second line, where the speedup baseline is the NSA decode throughput (49 token/s). Results. As shown in [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 10.** Figure 10: Performance breakdown and ablation of SSV kernel variants. The charts compare grouped-query kernel variants, cross-query overlap 𝑠 (the number of shared selected blocks between adjacent queries), and reuse-layer execution under small and large draft lengths (𝛾 = 4 and 𝛾 = 64). Refresh/reuse indicates the layer type, and no grouping/exact/approximate indicates the grouped-query kernel mode. measurable i… view at source ↗

read the original abstract

Speculative decoding and dynamic sparse attention are two complementary approaches for accelerating long-context LLM inference: the former amortizes target-model execution across multiple verifier queries, while the latter reduces each query's KV-cache working set. Directly combining them, however, exposes a structural mismatch: speculative verification relies on cross-query commonality, whereas dynamic sparse attention assigns query-specific sparse layouts. This mismatch limits KV-block reuse, amplifies NSA's branch-wise overheads, and makes verification strategy selection input- and regime-dependent. We present SSV, a sparse speculative-verification framework that turns dynamic sparse attention into a verification-oriented workload. SSV combines overlap-aware grouped-query execution, refresh/reuse-based NSA kernel fusion, and profile-guided prompt-adaptive orchestration to improve cross-query reuse, reduce selected-index and branch-fusion overheads, and select effective draft-verification strategies under user-specified precision classes. Experiments on NVIDIA H100 GPUs show that SSV achieves up to 3.49x end-to-end throughput over autoregressive NSA decoding and up to 6.86x kernel speedups for sparse speculative verification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SSV gives a workable way to combine speculative decoding with dynamic sparse attention by fixing their reuse mismatch, but the reported speedups rest on thin experimental description.

read the letter

The main point is that SSV tackles the mismatch where speculative verification wants shared work across queries while dynamic sparse attention gives each query its own sparse pattern. The authors fix this with overlap-aware grouped-query execution, refresh/reuse NSA kernel fusion, and profile-guided orchestration that picks strategies based on the prompt and precision target. That combination is the actual new piece; it is not a brand-new algorithm but a practical assembly of existing pieces aimed at long-context serving on GPUs like the H100. The reported 3.49x end-to-end throughput and 6.86x kernel gains are the numbers that matter to practitioners who care about hardware cost for longer contexts. If those hold under normal conditions, the work is useful. The soft spot is the lack of detail on baselines, variance, exact token counts, or how they avoided cherry-picking inputs. Without those, it is difficult to know whether the gains survive different models, prompt lengths, or sparsity patterns. The abstract alone does not let a reader reproduce or even fully judge the central claim. This paper is for systems people who already run speculative decoding or NSA and want to glue them together without losing the benefits. A reading group focused on inference stacks would get value from the concrete engineering choices. I would send it to peer review because the framing is clear, the problem is real, and the proposed mitigations are specific enough that referees can check the measurements and see whether the overheads stay low across regimes.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes SSV, a sparse speculative-verification framework that integrates dynamic sparse attention with speculative decoding for long-context LLM inference. It identifies a structural mismatch between cross-query commonality in verification and query-specific sparse layouts, and addresses it via three techniques: overlap-aware grouped-query execution, refresh/reuse-based NSA kernel fusion, and profile-guided prompt-adaptive orchestration. Experiments on NVIDIA H100 GPUs report up to 3.49× end-to-end throughput gains over autoregressive NSA decoding and up to 6.86× kernel speedups under user-specified precision classes.

Significance. If the empirical results hold under rigorous validation, SSV would represent a meaningful systems contribution by enabling effective combination of two complementary LLM acceleration methods, with potential impact on efficient inference serving for long-context models. The work supplies concrete H100 measurements that could inform practical deployment decisions.

major comments (1)

[Experiments] Experiments section: The abstract and reported results state concrete speedup numbers (3.49× end-to-end, 6.86× kernel) from H100 experiments, but provide no details on baselines, variance across runs, data exclusion rules, or exact measurement methodology (e.g., timing scope, batch sizes, or precision handling); this renders the central performance claims difficult to evaluate or reproduce from the manuscript alone.

minor comments (1)

[Introduction] Introduction: The description of the structural mismatch could benefit from a small illustrative diagram or concrete example of how cross-query reuse is limited under query-specific sparse layouts.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We agree that additional experimental details are necessary to support reproducibility of the reported performance numbers and will revise the paper accordingly.

read point-by-point responses

Referee: [Experiments] Experiments section: The abstract and reported results state concrete speedup numbers (3.49× end-to-end, 6.86× kernel) from H100 experiments, but provide no details on baselines, variance across runs, data exclusion rules, or exact measurement methodology (e.g., timing scope, batch sizes, or precision handling); this renders the central performance claims difficult to evaluate or reproduce from the manuscript alone.

Authors: We acknowledge that the current manuscript version does not provide sufficient methodological details to fully evaluate or reproduce the reported speedups. In the revised manuscript we will expand the Experiments section with: (1) explicit descriptions of all baselines and their configurations (including autoregressive NSA, standard speculative decoding, and any other comparators); (2) variance statistics across repeated runs together with standard deviations; (3) any data exclusion or outlier-handling rules; and (4) a precise measurement protocol covering timing scope (kernel-only vs. end-to-end including overheads), batch sizes, sequence lengths, and precision settings (e.g., FP16/BF16). We will also add a dedicated “Experimental Methodology” subsection that consolidates these elements. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an engineering system (SSV) that combines speculative decoding with dynamic sparse attention via three concrete optimizations: overlap-aware grouped-query execution, refresh/reuse-based NSA kernel fusion, and profile-guided orchestration. All load-bearing claims are empirical throughput and kernel-speedup measurements on H100 hardware under stated precision classes. No equations, fitted parameters, uniqueness theorems, or ansatzes appear in the abstract or framing; the central argument is that the listed mitigations overcome the stated structural mismatch, and this is justified by reported experimental outcomes rather than by any quantity defined in terms of itself. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The work appears to rely on standard assumptions about GPU kernel behavior and prompt statistics that are not detailed here.

pith-pipeline@v0.9.0 · 5729 in / 1168 out tokens · 38002 ms · 2026-05-21T07:10:46.805358+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SSV combines overlap-aware grouped-query execution, refresh/reuse-based NSA kernel fusion, and profile-guided prompt-adaptive orchestration to improve cross-query reuse...
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments on NVIDIA H100 GPUs show that SSV achieves up to 3.49x end-to-end throughput...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 10 internal anchors

[1]

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. Gqa: Training general- ized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 4895–4901

work page 2023
[2]

2023.Children Stories Collection

ajibawa 2023. 2023.Children Stories Collection. doi:10.57967/hf/2480

work page doi:10.57967/hf/2480 2023
[3]

Yushi Bai, Qian Dong, Ting Jiang, Xin Lv, Zhengxiao Du, Aohan Zeng, Jie Tang, and Juanzi Li. 2026. IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse.arXiv preprint arXiv:2603.12201 (2026)

work page arXiv 2026
[4]

Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[5]

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al . 2020. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 7432–7439

work page 2020
[6]

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. 2024. Medusa: Simple llm inference acceleration framework with multiple decoding heads.arXiv preprint arXiv:2401.10774(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. 2023. Accelerating Large Language Model Decoding with Speculative Sampling.arXiv preprint arXiv:2302.01318(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabhar- wal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

2022.Introduction to algorithms

Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. 2022.Introduction to algorithms. MIT press

work page 2022
[10]

Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Dhruv Deshmukh, Saurabh Goyal, Nipun Kwatra, and Ramachandran Ramjee. 2025. Kascade: A Practical Sparse Attention Method for Long- Context LLM Inference.arXiv preprint arXiv:2512.16391(2025)

work page arXiv 2025
[12]

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2024. The L...

work page doi:10.5281/zenodo.12608602 2024
[13]

Yizhao Gao, Jianyu Wei, Qihao Zhang, Yu Cheng, Shimao Chen, Zhengju Tang, Zihan Jiang, Yifan Song, Hailin Zhang, Liang Zhao, et al

work page
[14]

HySparse: A Hybrid Sparse Attention Architecture with Oracle Token Selection and KV Cache Sharing.arXiv preprint arXiv:2602.03560 (2026)

work page arXiv 2026
[15]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ankush Kadian, Amal Al-Dahle, Aiesha Letman, Anukriti Mathur, Ashwin Schelten, Angela Yang, et al. 2024. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Ankit Gupta, Guy Dar, Shaya Goodman, David Ciprut, and Jonathan Berant. 2021. Memory-efficient Transformers via Top- 𝑘 Attention. arXiv preprint arXiv:2106.06899(2021)

work page arXiv 2021
[17]

Dhiraj Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja Vooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, et al . 2019. A study of BFLOAT16 for deep learning training.arXiv preprint arXiv:1905.12322(2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[18]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica

work page
[19]

InProceedings of the 29th symposium on operating systems principles

Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles. 611–626

work page
[20]

Xunhao Lai. 2025. native-sparse-attention-triton.https://github.com/ XunhaoLai/native-sparse-attention-triton

work page 2025
[21]

Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, and Xun Zhou. 2025. Flexprefill: A context-aware sparse attention mechanism for efficient long-sequence inference.arXiv preprint arXiv:2502.20766(2025)

work page arXiv 2025
[22]

Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast Inference from Transformers via Speculative Decoding. InProceedings of the 40th International Conference on Machine Learning. 19274–19286

work page 2023
[23]

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2024. Eagle: Speculative sampling requires rethinking feature uncertainty. arXiv preprint arXiv:2401.15077(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2025. Eagle-3: Scaling up inference acceleration of large language models via training-time test.arXiv preprint arXiv:2503.01840(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. 2025. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Meta. 2024. Llama-3.1-8B-Instruct.https://huggingface.co/meta- llama/Llama-3.1-8B-InstructAccessed: 2026-05-13

work page 2024
[27]

Abhyankar, and Zhihao Jia

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna 13 Wang et al. Abhyankar, and Zhihao Jia. 2024. SpecInfer: Accelerating Large Lan- guage Model Serving with Tree-based Speculative Inference and Veri- fication. InProc...

work page 2024
[28]

Sanjit Neelam, Vaclav Cvicek, Daniel Heinlein, Akshay Mishra, Mahdi Nazemi, and Gilbert Hendry. 2025. Speculative Decoding with Block- wise Sparse Attention. MatX Research.https://matx.com/research/ sd_nsaAccessed: 2026-04-26

work page 2025
[29]

NVIDIA Corporation. 2024. NVIDIA H100 Tensor Core GPU.https: //www.nvidia.com/en-us/data-center/h100/. Accessed: 2026-04-26

work page 2024
[30]

PyTorch Contributors. 2025. torch.cuda.Event.https://docs.pytorch. org/docs/2.11/generated/torch.cuda.Event.html. PyTorch 2.11 docu- mentation. Accessed: 2026-05-09

work page 2025
[31]

Philippe Tillet, Hsiang-Tsung Kung, and David Cox. 2019. Triton: an intermediate language and compiler for tiled neural network computa- tions. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. 10–19

work page 2019
[32]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

work page 2017
[33]

2026.Programming massively parallel processors: a hands-on approach

W Hwu Wen-Mei, David B Kirk, and Izzat El Hajj. 2026.Programming massively parallel processors: a hands-on approach. Morgan Kaufmann

work page 2026
[34]

Ran Yan, Youhe Jiang, and Binhang Yuan. 2025. Flash sparse attention: An alternative efficient implementation of native sparse attention kernel.arXiv e-prints(2025), arXiv–2508

work page 2025
[35]

Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Yuxing Wei, Lean Wang, Zhiping Xiao, et al. 2025. Native sparse attention: Hardware-aligned and natively trainable sparse attention. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 23078–23097

work page 2025
[36]

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. 2020. Big bird: Transformers for longer sequences.Advances in neural information processing systems33 (2020), 17283–17297

work page 2020
[37]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence?. In Proceedings of the 57th annual meeting of the association for computa- tional linguistics. 4791–4800

work page 2019
[38]

zen-E. 2025. NSA-1B.https://huggingface.co/zen-E/NSA-1B. Accessed: 2026-05-05

work page 2025
[39]

zhenyi4. 2025. SSA.https://github.com/zhenyi4/ssa. Accessed: 2026- 05-05. 14

work page 2025

[1] [1]

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. Gqa: Training general- ized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 4895–4901

work page 2023

[2] [2]

2023.Children Stories Collection

ajibawa 2023. 2023.Children Stories Collection. doi:10.57967/hf/2480

work page doi:10.57967/hf/2480 2023

[3] [3]

Yushi Bai, Qian Dong, Ting Jiang, Xin Lv, Zhengxiao Du, Aohan Zeng, Jie Tang, and Juanzi Li. 2026. IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse.arXiv preprint arXiv:2603.12201 (2026)

work page arXiv 2026

[4] [4]

Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020

[5] [5]

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al . 2020. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 7432–7439

work page 2020

[6] [6]

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. 2024. Medusa: Simple llm inference acceleration framework with multiple decoding heads.arXiv preprint arXiv:2401.10774(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. 2023. Accelerating Large Language Model Decoding with Speculative Sampling.arXiv preprint arXiv:2302.01318(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabhar- wal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[9] [9]

2022.Introduction to algorithms

Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. 2022.Introduction to algorithms. MIT press

work page 2022

[10] [10]

Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

Dhruv Deshmukh, Saurabh Goyal, Nipun Kwatra, and Ramachandran Ramjee. 2025. Kascade: A Practical Sparse Attention Method for Long- Context LLM Inference.arXiv preprint arXiv:2512.16391(2025)

work page arXiv 2025

[12] [12]

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2024. The L...

work page doi:10.5281/zenodo.12608602 2024

[13] [13]

Yizhao Gao, Jianyu Wei, Qihao Zhang, Yu Cheng, Shimao Chen, Zhengju Tang, Zihan Jiang, Yifan Song, Hailin Zhang, Liang Zhao, et al

work page

[14] [14]

HySparse: A Hybrid Sparse Attention Architecture with Oracle Token Selection and KV Cache Sharing.arXiv preprint arXiv:2602.03560 (2026)

work page arXiv 2026

[15] [15]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ankush Kadian, Amal Al-Dahle, Aiesha Letman, Anukriti Mathur, Ashwin Schelten, Angela Yang, et al. 2024. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Ankit Gupta, Guy Dar, Shaya Goodman, David Ciprut, and Jonathan Berant. 2021. Memory-efficient Transformers via Top- 𝑘 Attention. arXiv preprint arXiv:2106.06899(2021)

work page arXiv 2021

[17] [17]

Dhiraj Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja Vooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, et al . 2019. A study of BFLOAT16 for deep learning training.arXiv preprint arXiv:1905.12322(2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019

[18] [18]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica

work page

[19] [19]

InProceedings of the 29th symposium on operating systems principles

Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles. 611–626

work page

[20] [20]

Xunhao Lai. 2025. native-sparse-attention-triton.https://github.com/ XunhaoLai/native-sparse-attention-triton

work page 2025

[21] [21]

Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, and Xun Zhou. 2025. Flexprefill: A context-aware sparse attention mechanism for efficient long-sequence inference.arXiv preprint arXiv:2502.20766(2025)

work page arXiv 2025

[22] [22]

Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast Inference from Transformers via Speculative Decoding. InProceedings of the 40th International Conference on Machine Learning. 19274–19286

work page 2023

[23] [23]

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2024. Eagle: Speculative sampling requires rethinking feature uncertainty. arXiv preprint arXiv:2401.15077(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2025. Eagle-3: Scaling up inference acceleration of large language models via training-time test.arXiv preprint arXiv:2503.01840(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. 2025. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Meta. 2024. Llama-3.1-8B-Instruct.https://huggingface.co/meta- llama/Llama-3.1-8B-InstructAccessed: 2026-05-13

work page 2024

[27] [27]

Abhyankar, and Zhihao Jia

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna 13 Wang et al. Abhyankar, and Zhihao Jia. 2024. SpecInfer: Accelerating Large Lan- guage Model Serving with Tree-based Speculative Inference and Veri- fication. InProc...

work page 2024

[28] [28]

Sanjit Neelam, Vaclav Cvicek, Daniel Heinlein, Akshay Mishra, Mahdi Nazemi, and Gilbert Hendry. 2025. Speculative Decoding with Block- wise Sparse Attention. MatX Research.https://matx.com/research/ sd_nsaAccessed: 2026-04-26

work page 2025

[29] [29]

NVIDIA Corporation. 2024. NVIDIA H100 Tensor Core GPU.https: //www.nvidia.com/en-us/data-center/h100/. Accessed: 2026-04-26

work page 2024

[30] [30]

PyTorch Contributors. 2025. torch.cuda.Event.https://docs.pytorch. org/docs/2.11/generated/torch.cuda.Event.html. PyTorch 2.11 docu- mentation. Accessed: 2026-05-09

work page 2025

[31] [31]

Philippe Tillet, Hsiang-Tsung Kung, and David Cox. 2019. Triton: an intermediate language and compiler for tiled neural network computa- tions. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. 10–19

work page 2019

[32] [32]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

work page 2017

[33] [33]

2026.Programming massively parallel processors: a hands-on approach

W Hwu Wen-Mei, David B Kirk, and Izzat El Hajj. 2026.Programming massively parallel processors: a hands-on approach. Morgan Kaufmann

work page 2026

[34] [34]

Ran Yan, Youhe Jiang, and Binhang Yuan. 2025. Flash sparse attention: An alternative efficient implementation of native sparse attention kernel.arXiv e-prints(2025), arXiv–2508

work page 2025

[35] [35]

Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Yuxing Wei, Lean Wang, Zhiping Xiao, et al. 2025. Native sparse attention: Hardware-aligned and natively trainable sparse attention. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 23078–23097

work page 2025

[36] [36]

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. 2020. Big bird: Transformers for longer sequences.Advances in neural information processing systems33 (2020), 17283–17297

work page 2020

[37] [37]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence?. In Proceedings of the 57th annual meeting of the association for computa- tional linguistics. 4791–4800

work page 2019

[38] [38]

zen-E. 2025. NSA-1B.https://huggingface.co/zen-E/NSA-1B. Accessed: 2026-05-05

work page 2025

[39] [39]

zhenyi4. 2025. SSA.https://github.com/zhenyi4/ssa. Accessed: 2026- 05-05. 14

work page 2025