pith. machine review for the scientific record. sign in

arxiv: 2605.07719 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI· cs.PF

Recognition: no theorem link

An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:52 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.PF
keywords long-context inferencesparse attentionhybrid CPU-GPU executionKV cacheattention optimizationsystem efficiency
0
0 comments X

The pith

Fluxion accelerates long-context inference 1.5x-3.7x over fixed sparse baselines by dynamically budgeting CPU-resident KV caches and overlapping CPU-GPU sparse attention execution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Fluxion to handle cases where long-context KV caches must stay in CPU memory because they exceed GPU capacity or because prefill and decode are disaggregated. It builds a hybrid sparse attention design around three elements: output-aware KV budgeting that decides how many tokens each head keeps, head-specific and granularity-aware sparse patterns that choose which blocks to attend to, and a priority scheduler that overlaps CPU top-k selection and sparse computation with GPU work. A lightweight predictor and budget selector drive these decisions at low cost. The result is end-to-end efficiency while keeping quality close to dense attention across models and tasks. A reader should care because this removes the practical barrier of moving huge KV states back and forth across PCIe or leaving the GPU idle during CPU-side work.

Core claim

Fluxion jointly optimizes KV budget allocation, head-specific granularity-aware sparse configuration, and cross-device execution overlap through a lightweight head-property predictor, a granularity-budget selector, and a priority-based scheduler, enabling hybrid sparse attention over CPU-resident KV caches to deliver 1.5×-3.7× speedup over the strongest fixed-sparse hybrid baseline while limiting average quality degradation to -0.26 relative to full attention.

What carries the argument

The central mechanism is output-aware KV budgeting combined with head-specific granularity-aware sparse configuration, coordinated by a priority-based scheduler that overlaps CPU-side top-k selection and sparse computation with GPU execution.

Load-bearing premise

The lightweight head-property predictor and granularity-budget selector can accurately guide sparse configuration and scheduling without adding meaningful overhead or quality loss across diverse models and tasks.

What would settle it

Running the same models and tasks but replacing the learned predictor with random budget and granularity choices, then measuring whether quality drops below -1.0 relative to full attention or speedup falls below 1.2×.

Figures

Figures reproduced from arXiv: 2605.07719 by Feiyu Yao, Juan Fang, Qian Wang, Xiaqing Li, Yongqiang Xiong, Zhixiong Niu.

Figure 2
Figure 2. Figure 2: Time breakdown of two sparse-attention placements during decoding. (a) Varying sequence length at BSZ=8, budget=5%. (b) Varying batch size at SeqLen=32K, budget=5%. (c) GPU idle ratio and accuracy under different budgets at BSZ=8, SeqLen=32K. (d) Fraction of CPU-side Top-K selection in Top-K + Attention under different block sizes and budgets. GPU-only sparse attention. This class of methods per￾forms bloc… view at source ↗
Figure 3
Figure 3. Figure 3: (a) shows that high attention-score coverage is a poor proxy for small attention-output deviation: even when 0% 100% 200% 300% (a) Relative Error 0.00 0.25 0.50 0.75 1.00 CDF 20% 0.9 0.95 0 4 8 12 16 20 24 28 (b) Layer 0.00 2.00 4.00 6.00 Mean Value Norm Sink Token Other Token 0 4 8 12 16 20 24 28 (c) Layer 0.00 0.25 0.50 0.75 1.00 Attention-score Fraction [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Per-head output deviation alone is insufficient for budget allocation. Llama-3.1-8B-Instruct has 32 heads, each with 128 output dimensions. 95% of the total attention score is preserved, 25% of attention heads still incur a relative output error above 20%. A closer analysis shows that this mismatch is mainly driven by the ubiquitous attention sink phenomenon [42, 45] [PITH_FULL_IMAGE:figures/full_fig_p004… view at source ↗
Figure 5
Figure 5. Figure 5: Per-head minimum budget vs. block size for 32 heads in a layer. Each line represents one head. is closer to the final approximation target than attention￾score coverage, it implicitly assumes that errors from dif￾ferent heads are equally important. In practice, however, different heads capture different patterns and make unequal contributions to the final representation after the O projec￾tion, leading to … view at source ↗
Figure 6
Figure 6. Figure 6: The overview of Fluxion. 4 Design Overview To improve the efficiency of block-sparse attention on het￾erogeneous CPU-GPU architectures, we design Fluxion, an efficient hybrid sparse-attention mechanism for long-context LLM inference. Fluxion consists of three key components. (i) Head property predictor: it identifies each attention head as either a streaming head or a retrieval head at low overhead and pre… view at source ↗
Figure 7
Figure 7. Figure 7: TPOT comparison under different batch sizes and sequence lengths on RULER. speedups are conservative. Even so, Fluxion achieves 2.5×- 3.7× speedup on Llama and 1.9×-3.4× on Qwen. The gain further increases with either batch size or context length. For example, at batch size 4, the speedup on Qwen rises from 1.9× at 32K to 3.4× at 128K. Although w/o Fluxion(32, 0.02) is typically the fastest fixed configura… view at source ↗
Figure 8
Figure 8. Figure 8: TPOT under mixed-length workloads on RULER. w/o Fluxion(blk, bgt) Qwen2.5-7B-Instruct Llama-3.1-8B-Instruct (BSZ, SeqLen) (16, 0.05) Fluxion (16, 0.05) Fluxion (16, 32K) 88.79% / 8.72 69.64% / 2.16 87.42% / 13.23 74.36% / 3.00 (8, 32K) 81.46% / 4.54 56.27% / 1.23 86.45% / 6.91 61.79% / 1.69 (4, 32K) 70.65% / 2.33 45.78% / 0.78 85.83% / 4.06 52.16% / 1.00 (4, 64K) 81.30% / 4.43 53.68% / 1.11 86.50% / 6.43 5… view at source ↗
Figure 10
Figure 10. Figure 10: (a) Predictor overhead. (b) Effect of 𝜏. unnecessary budget waste. In contrast, further incorporating head relative contribution enables more effective KV bud￾get allocation and significantly reduces latency with little accuracy loss. Dynamic granularity-budget allocation and stream￾ing head skipping [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Normality of CPU-offloaded 𝑞–𝑘 interac￾tions. We plot 𝑧𝑖(𝑞) = ⟨𝑞, 𝑘𝑖⟩/(∥𝑞∥2 √ 𝐷) for CPU-offloaded keys. (b): histogram with a fitted Gaussian density (layer 0). (a): Q–Q plot against N (𝜇, 𝜎2 ) (layer 20). Results are from Llama-3.1-8B-Instruct on RULER with 32K context, using two representative layers. A.1 Feature Definition We define features at a Transformer layer ℓ and attention head ℎ ∈ {1, . . . , … view at source ↗
read the original abstract

Long-context inference increasingly operates over CPU-resident KV caches, either because decoding-time KV states exceed GPU memory capacity or because disaggregated prefill-decode systems place KV data in host memory. Although block-sparse attention reduces attention cost in this setting, sparsity alone is insufficient for end-to-end efficiency. GPU-only designs remain constrained by PCIe bandwidth and metadata memory overhead, while CPU-GPU hybrid designs still suffer from substantial GPU idle time and bottlenecks in CPU-side top-k selection and sparse attention computation. Fluxion is built on three key insights: output-aware KV budgeting, head-specific and granularity-aware sparse configuration, and cross-device coordinated execution for sparse attention over CPU-resident KV caches. Guided by these insights, Fluxion combines a lightweight head-property predictor, a granularity-budget selector, and a priority-based scheduler to jointly optimize budget allocation, sparse configuration, and CPU-GPU execution overlap. This co-design enables hybrid sparse attention to achieve both accuracy and system efficiency in long-context inference. Across 2 models, 3 benchmarks, and 40 tasks, Fluxion preserves quality well -- the worst average degradation is only -0.26 relative to FULL, while delivering 1.5$\times$-3.7$\times$ speedup over the strongest fixed sparse hybrid baseline, whose KV budget is only 0.05.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Fluxion, a hybrid sparse attention system for long-context LLM inference with CPU-resident KV caches. It combines output-aware KV budgeting, head-specific and granularity-aware sparse configurations, and cross-device coordinated execution via a lightweight head-property predictor, a granularity-budget selector, and a priority-based scheduler. Empirical evaluation across 2 models, 3 benchmarks, and 40 tasks reports a worst-case average quality degradation of -0.26 relative to full attention and 1.5×–3.7× end-to-end speedup over the strongest fixed-sparse hybrid baseline (KV budget 0.05).

Significance. If the results hold under detailed verification, the work is significant for practical long-context inference on hybrid CPU-GPU platforms, where memory capacity and PCIe bandwidth are bottlenecks. It advances beyond isolated sparse attention by co-designing algorithmic choices (budgeting and per-head sparsity) with system scheduling for overlap. The breadth of the evaluation (multiple models and tasks) is a strength; the focus on end-to-end metrics rather than micro-benchmarks is also positive.

major comments (3)
  1. [§3.2] §3.2 (Head-Property Predictor): The central quality claim (worst avg. degradation -0.26 vs. FULL) depends on the predictor accurately selecting head-specific sparse configurations. The manuscript describes it as lightweight but provides no equations for its input features, training loss, or per-head accuracy metrics; without these, it is impossible to assess whether mispredictions on even a subset of heads would violate the reported bound.
  2. [§4.2] §4.2 (Granularity-Budget Selector and Scheduler): The speedup range (1.5×–3.7×) rests on the selector and priority scheduler successfully hiding CPU top-k and sparse-attn latency via CPU-GPU overlap. The text gives no ablation isolating selector overhead or measuring prediction accuracy across the 40 tasks; if selector errors force conservative budgets or poor overlap, the speedup over the fixed 0.05-budget baseline would shrink substantially.
  3. [§4.1] §4.1 (Experimental Setup): The reported numbers lack any mention of run count, standard deviation, or statistical tests. Because the speedup and degradation figures are load-bearing for the main claims, the absence of variance information makes it difficult to judge whether the results are robust across hardware or task variations.
minor comments (2)
  1. [Figure 3] Figure 3 caption: the legend does not explicitly state what the shaded regions represent (e.g., min-max or std. dev. across tasks).
  2. [§2.2] §2.2: the notation for KV budget (B) is introduced without a clear definition of its units or normalization relative to sequence length.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and the positive evaluation of the significance of our work. We address each of the major comments in detail below and will revise the manuscript accordingly to improve clarity and completeness.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Head-Property Predictor): The central quality claim (worst avg. degradation -0.26 vs. FULL) depends on the predictor accurately selecting head-specific sparse configurations. The manuscript describes it as lightweight but provides no equations for its input features, training loss, or per-head accuracy metrics; without these, it is impossible to assess whether mispredictions on even a subset of heads would violate the reported bound.

    Authors: We agree that additional details on the head-property predictor are necessary to fully substantiate the quality claims. In the revised manuscript, we will include the equations for the input features, the training loss, and per-head accuracy metrics. These additions will allow assessment of the predictor's reliability. revision: yes

  2. Referee: [§4.2] §4.2 (Granularity-Budget Selector and Scheduler): The speedup range (1.5×–3.7×) rests on the selector and priority scheduler successfully hiding CPU top-k and sparse-attn latency via CPU-GPU overlap. The text gives no ablation isolating selector overhead or measuring prediction accuracy across the 40 tasks; if selector errors force conservative budgets or poor overlap, the speedup over the fixed 0.05-budget baseline would shrink substantially.

    Authors: We acknowledge the need for ablations on the granularity-budget selector and scheduler. The revised manuscript will include new experiments isolating the selector's overhead and reporting its prediction accuracy across all 40 tasks. This will confirm that the overhead is minimal and that the overlap is effective, supporting the reported speedups. revision: yes

  3. Referee: [§4.1] §4.1 (Experimental Setup): The reported numbers lack any mention of run count, standard deviation, or statistical tests. Because the speedup and degradation figures are load-bearing for the main claims, the absence of variance information makes it difficult to judge whether the results are robust across hardware or task variations.

    Authors: We agree that statistical robustness information is important for the main claims. In the revision, we will report the number of experimental runs, standard deviations for the speedup and quality degradation figures, and results of statistical significance tests to demonstrate that the results are robust. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system evaluation with independent benchmarks

full rationale

The paper describes a hybrid sparse attention system (Fluxion) built from three insights and implemented via a head-property predictor, granularity-budget selector, and scheduler. All load-bearing claims are empirical: measured quality degradation (-0.26 worst-case vs FULL) and speedups (1.5-3.7x) across 2 models, 3 benchmarks, and 40 tasks, compared to fixed-sparse baselines. No equations, fitted parameters renamed as predictions, self-citation chains, uniqueness theorems, or ansatzes appear in the provided text. The derivation chain is absent; results are direct measurements against external baselines, satisfying the self-contained criterion for score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an engineering systems paper focused on implementation and empirical evaluation rather than theoretical derivations; no free parameters, axioms, or invented entities are apparent from the abstract.

pith-pipeline@v0.9.0 · 5553 in / 1191 out tokens · 118503 ms · 2026-05-11T02:52:25.294686+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 7 internal anchors

  1. [1]

    Chenxin An, Shansan Gong, Ming Zhong, Xingjian Zhao, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu. 2024. L-Eval: Institut- ing Standardized Evaluation for Long Context Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers). 14388–14411

  2. [2]

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2024. LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). ...

  3. [3]

    Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2024. LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks.arXiv preprint arXiv:2412.15204(2024)

  4. [4]

    Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150 (2020)

  5. [5]

    Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, and Xiao Wen

  6. [6]

    Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069(2024)

  7. [7]

    Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xiaoxuan Liu, Ying Sheng, Joseph E Gonzalez, Matei Zaharia, and Ion Stoica. 2025. Moe- lightning: High-throughput moe inference on memory-constrained gpus. InProceedings of the 30th ACM International Conference on Ar- chitectural Support for Programming Languages and Operating Systems, Volume 1. 715–730

  8. [8]

    Hongtao Chen, Weiyu Xie, Boxin Zhang, Jingqi Tang, Jiahao Wang, Jianwei Dong, Shaoyuan Chen, Ziwei Yuan, Chen Lin, Chengyu Qiu, et al. 2025. Ktransformers: Unleashing the full potential of cpu/gpu hybrid inference for moe models. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles. 1014–1029

  9. [9]

    Renze Chen, Zhuofeng Wang, Beiquan Cao, Tong Wu, Size Zheng, Xiuhong Li, Xuechao Wei, Shengen Yan, Meng Li, and Yun Liang. 2024. Arkvale: Efficient generative llm inference with recallable key-value eviction.Advances in Neural Information Processing Systems37 (2024), 113134–113155

  10. [10]

    Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, et al. 2024. Magicpig: Lsh sampling for efficient llm gener- ation.arXiv preprint arXiv:2410.16179(2024)

  11. [11]

    Leonardo Dagum and Ramesh Menon. 1998. OpenMP: an industry standard API for shared-memory programming.IEEE computational science and engineering5, 1 (1998), 46–55

  12. [12]

    Weishu Deng, Yujie Yang, Peiran Du, Lingfeng Xiang, Zhen Lin, Chen Zhong, Song Jiang, Hui Lu, and Jia Rao. 2025. HGCA: Hybrid GPU- CPU Attention for Long Context LLM Inference.arXiv preprint arXiv:2507.03153(2025)

  13. [13]

    Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2024. {Cost- Efficient} large language model serving for multi-turn conversations with {CachedAttention}. In2024 USENIX annual technical conference (USENIX ATC 24). 111–126

  14. [14]

    Yizhao Gao, Shuming Guo, Shijie Cao, Yuqing Xia, Yu Cheng, Lei Wang, Lingxiao Ma, Yutao Sun, Tianzhu Ye, Li Dong, et al. 2025. Seerattention- r: Sparse attention adaptation for long reasoning.arXiv preprint arXiv:2506.08889(2025)

  15. [15]

    Yizhao Gao, Zhichen Zeng, Dayou Du, Shijie Cao, Hayden Kwok- Hay So, Ting Cao, Fan Yang, and Mao Yang. 2024. SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs.arXiv preprint arXiv:2410.13276(2024)

  16. [16]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)

  17. [17]

    Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, Kangdi Chen, Yuhan Dong, and Yu Wang. 2023. Flashdecod- ing++: Faster large language model inference on gpus.arXiv preprint arXiv:2311.01282(2023)

  18. [18]

    Yuxiang Huang, Pengjie Wang, Jicheng Han, Weilin Zhao, Zhou Su, Ao Sun, Hongya Lyu, Hengyu Zhao, Yudong Wang, Chaojun Xiao, et al

  19. [19]

    Nosa: Native and offloadable sparse attention.arXiv preprint arXiv:2510.13602(2025)

  20. [20]

    Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He

  21. [21]

    Deepspeed ulysses: System optimizations for enabling train- ing of extreme long sequence transformer models.arXiv preprint arXiv:2309.14509(2023)

  22. [22]

    Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xu- fang Luo, Surin Ahn, Zhenhua Han, Amir H Abdi, Dongsheng Li, Chin-Yew Lin, et al. 2024. Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention.Advances in Neural Information Processing Systems37 (2024), 52481–52515

  23. [23]

    Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, and Minlan Yu. 2025. Neo: Saving gpu memory crisis with cpu offloading for online llm inference.Proceedings of Machine Learning and Systems7 (2025)

  24. [24]

    Daya Khudia, Jianyu Huang, Protonu Basu, Summer Deng, Haixin Liu, Jongsoo Park, and Mikhail Smelyanskiy. 2021. FBGEMM: Enabling High-Performance Low-Precision Deep Learning Inference.arXiv preprint arXiv:2101.05615(2021)

  25. [25]

    Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, and Xun Zhou. 2025. Flexprefill: A context-aware sparse attention mechanism for efficient long-sequence inference.arXiv preprint arXiv:2502.20766(2025)

  26. [26]

    Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. {InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 155–172

  27. [27]

    Ming Li, Han Chen, Chenguang Wang, Dang Nguyen, Dianqi Li, and Tianyi Zhou. 2025. RuleR: Improving LLM Controllability by Rule- based Data Recycling. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers). 926–943

  28. [28]

    Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong Zhang, Bailu Ding, Kai Zhang, et al. [n. d.]. RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  29. [29]

    Hao Liu, Matei Zaharia, and Pieter Abbeel. 2023. Ring attention with blockwise transformers for near-infinite context.arXiv preprint arXiv:2310.01889(2023)

  30. [30]

    Jiaheng Liu, Dawei Zhu, Zhiqi Bai, Yancheng He, Huanxuan Liao, Hao- ran Que, Zekun Wang, Chenchen Zhang, Ge Zhang, Jiebin Zhang, et al

  31. [31]

    A Comprehensive Sur- vey on Long Context Language Modeling.arXiv preprint arXiv:2503.17407, 2025

    A comprehensive survey on long context language modeling. arXiv preprint arXiv:2503.17407(2025)

  32. [32]

    Yuhan Liu, Yihua Cheng, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaot- ing Feng, Yuyang Huang, Samuel Shen, Rui Zhang, Kuntai Du, et al

  33. [33]

    Lmcache: An efficient KV cache layer for enterprise-scale LLM inference.arXiv preprint arXiv:2510.09665(2025)

  34. [34]

    Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, et al. 2025. Moba: Mixture of block attention for long-context llms.arXiv preprint arXiv:2502.13189(2025)

  35. [35]

    McCalpin

    John D. McCalpin. 1995. Memory Bandwidth and Machine Balance in Current High Performance Computers.IEEE Computer Society 13 Technical Committee on Computer Architecture (TCCA) Newsletter(Dec. 1995), 19–25

  36. [36]

    Xiurui Pan, Endian Li, Qiao Li, Shengwen Liang, Yizhou Shan, Ke Zhou, Yingwei Luo, Xiaolin Wang, and Jie Zhang. 2025. InstAttention: in-storage attention offloading for cost-effective long-context LLM inference. In2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 1510–1525

  37. [37]

    Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2025. Mooncake: Trading more storage for less computation—a {KVCache-centric} architecture for serving {LLM} chatbot. In23rd USENIX conference on file and storage technologies (FAST 25). 155–170

  38. [38]

    Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, and Douglas Orr. 2023. Sparq attention: Bandwidth- efficient llm inference.arXiv preprint arXiv:2312.04985(2023)

  39. [39]

    Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, and Douglas Orr. 2024. SparQ attention: bandwidth- efficient LLM inference. InProceedings of the 41st International Confer- ence on Machine Learning. Article 1731, 26 pages

  40. [40]

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053(2019)

  41. [41]

    Prajwal Singhania, Siddharth Singh, Shwai He, Soheil Feizi, and Abhi- nav Bhatele. 2024. Loki: Low-rank keys for efficient sparse attention. Advances in Neural Information Processing Systems37 (2024), 16692– 16723

  42. [42]

    Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, and Beidi Chen. 2024. ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference. arXiv preprint arXiv:2410.21465(2024)

  43. [43]

    Nazmul Takbir, Hamidreza Alikhani, Nikil Dutt, and Sangeetha Abdu Jyothi. 2025. FlexiCache: Leveraging Temporal Stability of Atten- tion Heads for Efficient KV Cache Management.arXiv preprint arXiv:2511.00868(2025)

  44. [44]

    Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. 2024. QUEST: query-aware sparsity for efficient long- context LLM inference. InProceedings of the 41st International Confer- ence on Machine Learning. 47901–47911

  45. [45]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Syl- vain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush

  46. [46]

    InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

    Transformers: State-of-the-Art Natural Language Processing. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 38–45

  47. [47]

    Chaojun Xiao, Pengle Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, and Maosong Sun. 2024. Infllm: Training-free long-context extrapolation for llms with an efficient context memory.Advances in neural information processing systems 37 (2024), 119638–119661

  48. [48]

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2023. Efficient Streaming Language Models with Attention Sinks.arXiv(2023)

  49. [49]

    Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, and Song Han. 2025. Xattention: Block sparse attention with antidiagonal scor- ing.arXiv preprint arXiv:2503.16428(2025)

  50. [50]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al

  51. [51]

    5 Technical Report.arXiv e-prints(2024), arXiv–2412

    Qwen2. 5 Technical Report.arXiv e-prints(2024), arXiv–2412

  52. [52]

    Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, and Song Han

  53. [53]

    Lserve: Efficient long-sequence llm serving with unified sparse attention.Proceedings of Machine Learning and Systems7 (2025)

  54. [54]

    Chengye Yu, Tianyu Wang, Zili Shao, Linjie Zhu, Xu Zhou, and Song Jiang. 2024. Twinpilots: A new computing paradigm for gpu-cpu parallel llm inference. InProceedings of the 17th ACM international systems and storage conference. 91–103

  55. [55]

    Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Yuxing Wei, Lean Wang, Zhiping Xiao, et al. 2025. Native sparse attention: Hardware-aligned and natively trainable sparse attention. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 23078–23097

  56. [56]

    Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. 2020. Big bird: Transformers for longer sequences.Advances in neural information processing systems33 (2020), 17283–17297

  57. [57]

    Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Weipeng Chen, and Bin Cui. 2025. Pqcache: Product quantization-based kvcache for long context llm inference.Proceedings of the ACM on Management of Data3, 3 (2025), 1–30

  58. [58]

    Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, and Jianfei Chen. 2025. Spargeattention: Accurate and training-free sparse attention accelerating any model inference.arXiv preprint arXiv:2502.18137(2025)

  59. [59]

    Yanqi Zhang, Yuwei Hu, Runyuan Zhao, John CS Lui, and Haibo Chen

  60. [60]

    InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles

    Diffkv: Differentiated memory management for large language models with parallel kv compaction. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles. 431–445

  61. [61]

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. 2024. Sglang: Efficient execution of structured lan- guage model programs.Advances in neural information processing systems37 (2024), 62557–62583

  62. [62]

    Qihui Zhou, Peiqi Yin, Pengfei Zuo, and James Cheng. 2025. Progres- sive sparse attention: Algorithm and system co-design for efficient attention in llm serving.arXiv preprint arXiv:2503.00392(2025)

  63. [63]

    Qihui Zhou, Peiqi Yin, Pengfei Zuo, and James Cheng. 2025. Spars- eserve: Unlocking parallelism for dynamic sparse attention in long- context llm serving.arXiv preprint arXiv:2509.24626(2025)

  64. [64]

    Qianchao Zhu, Jiangfei Duan, Chang Chen, Siran Liu, Xiuhong Li, Guanyu Feng, Xin Lv, Xiao Chuanfu, Dahua Lin, and Chao Yang

  65. [65]

    14 A Detailed Feature Definitions and Extraction The predictor in Fluxion utilizes a total of 41 low-overhead features to capture key patterns in attention computation

    Sampleattention: Near-lossless acceleration of long context llm inference with adaptive structured sparse attention.Proceedings of Machine Learning and Systems7 (2025). 14 A Detailed Feature Definitions and Extraction The predictor in Fluxion utilizes a total of 41 low-overhead features to capture key patterns in attention computation. We provide detailed...