pith. machine review for the scientific record. sign in

arxiv: 2604.26837 · v1 · submitted 2026-04-29 · 💻 cs.LG

Recognition: unknown

Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving

Baotong Lu, Fan Yang, Haiying Shen, Jing Liu, Ming-Chang Yang, Qi Chen, Shengjie Lin, Yanqi Zhang, Yizou Chen, Zihan Zhao, Ziming Miao

Pith reviewed 2026-05-07 13:09 UTC · model grok-4.3

classification 💻 cs.LG
keywords sparse attentionKV cache managementhierarchical memoryLLM servinglong-context inferenceGPU-CPU data transferthroughput optimization
0
0 comments X

The pith

SPIN unifies different sparse attention methods under one hierarchical KV memory system to realize their promised speedups in long-context LLM serving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that sparse attention can move from algorithmic promise to practical system gains by treating the KV cache as a shared, page-based substrate rather than leaving each sparsity pattern to its own ad-hoc code. Current barriers include mismatched access granularities across algorithms and expensive irregular transfers over the GPU-CPU link that often cancel the intended savings. SPIN counters both problems with a common partition layer, a locality-aware cache manager that budgets HBM space per request and applies bucketed LRU replacement, and compact metadata sized only to the active set. When these pieces are added to vLLM and tested with three representative sparse methods, measured throughput rises 1.66-5.66 times while time-to-first-token drops 7-9 times and per-token latency falls by as much as 58 percent. A reader interested in deployable long-context models would care because the work shows how to keep the algorithmic benefit when the cache no longer fits in GPU memory alone.

Core claim

SPIN is a sparse-attention-aware inference framework built on vLLM that co-designs the execution pipeline with hierarchical KV storage. It introduces a unified partition abstraction that maps differing sparsity granularities onto a shared page-based KV substrate, a locality-aware KV cache manager that dynamically sizes per-request HBM budgets and employs a GPU-friendly bucketed LRU policy to reduce PCIe round-trips, and a two-level hierarchical metadata layout sized to the active working set. Across three representative sparse attention algorithms the resulting system reports 1.66-5.66x higher end-to-end throughput and 7-9x lower TTFT than plain vLLM while cutting TPOT by up to 58 percent.

What carries the argument

The unified partition abstraction that maps varying sparsity granularities onto a shared page-based KV substrate together with the locality-aware bucketed LRU manager that sizes HBM budgets per request.

If this is right

  • Sparse attention algorithms no longer require separate system-level implementations to achieve end-to-end gains.
  • Hierarchical GPU-CPU KV storage becomes practical without the irregular transfers erasing sparsity benefits.
  • Per-request HBM budgets can be adjusted dynamically while still preserving locality across decoding steps.
  • Metadata overhead stays proportional to the active working set rather than the full context length.
  • The same framework supports multiple representative sparse methods without per-algorithm tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same partition and bucketed-LRU ideas could be tested on attention patterns that mix local and global tokens to see whether the locality signal remains strong enough.
  • Extending the page substrate to include slower tiers such as NVMe would test how far the reduction in round-trips generalizes when latency gaps widen.
  • If the unified abstraction proves stable, it opens a route to compile-time rewriting of new sparse kernels directly onto the page layout instead of hand-coded kernels.

Load-bearing premise

Different sparse attention patterns can be expressed as partitions over the same page-based KV substrate and that the bucketed LRU policy will reduce PCIe transfers enough to outweigh any added management cost.

What would settle it

Running the same three sparse attention algorithms on identical long-context workloads and observing that total PCIe bytes transferred or end-to-end latency do not decrease relative to the original per-algorithm implementations would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2604.26837 by Baotong Lu, Fan Yang, Haiying Shen, Jing Liu, Ming-Chang Yang, Qi Chen, Shengjie Lin, Yanqi Zhang, Yizou Chen, Zihan Zhao, Ziming Miao.

Figure 1
Figure 1. Figure 1: Gap Between Theoretical and Realized Performance. view at source ↗
Figure 3
Figure 3. Figure 3: The abstracted inference pipeline for sparse attention in Spin, illustrating the decoupling of computation and data move￾ment. Select and Retrieve are only invoked during the decode. Scalable Metadata Management. Head-wise sparsity re￾quires per-head page tables to track KV residency across GPU and CPU and to drive cache replacement. At long context lengths, these metadata can become substantial, potential… view at source ↗
Figure 4
Figure 4. Figure 4: Pseudocode illustrations demonstrating the integration of ShadowKV [55] into Spin. Highlighted lines (4–9 and 15–16) denote the core algorithmic logic, directly copied from the original implementation. RoPE because it operates on pre-RoPE queries and keys view at source ↗
Figure 5
Figure 5. Figure 5: Overview of Spin’s memory management system. Steps 1–4 describe the process of “Retrieve” for critical partitions. manager must address three foundamental challenges: recon￾ciling the mismatch between algorithmic access granularities and hardware-efficient block accesses, maximizing GPU KV cache residency under dynamic workloads, and minimizing metadata overhead at long context lengths. Spin addresses thes… view at source ↗
Figure 6
Figure 6. Figure 6: Two-level page table design. systems, Spin organizes all four metadata tables with two￾level indexing. As shown in view at source ↗
Figure 7
Figure 7. Figure 7: End-to-end Throughput vs. Request Rate on LongBench-v2. view at source ↗
Figure 8
Figure 8. Figure 8: End-to-end Throughput vs. Request Rate on LongGenBench. view at source ↗
Figure 9
Figure 9. Figure 9: Average Batch Size. Average number of batched re￾quests when serving Qwen3-14B on an A100 for the LongBench-v2 and LongGenBench workloads (both at 1.5 req/s). 1 50 100 500 Request Rate (req/s) (a) Time to First Token 0 500 1000 1500 2000 TTFT (sec) ×10 −3 ×10 −3 1 50 100 500 Request Rate (req/s) (b) Time per Output Token 0.02 0.04 0.06 0.08 TPOT (sec) ×10 −3 ×10 −3 1 5 10 50 10 15 20 25 1 5 10 0.02 0.03 0.… view at source ↗
Figure 10
Figure 10. Figure 10: Average Latency. Average TTFT and TPOT when serving Qwen3-14B on an A100 for the LongBench-v2 workloads under varying request rates. 32K 64K 120K Context Length 0 10 20 30 40 50 Prefill Latency (sec) SPIN-ShadowKV SPIN-RetroInfer SPIN-SeerAttention vLLM vLLM-Offload LServe view at source ↗
Figure 11
Figure 11. Figure 11: Prefill Latency vs. Context Length on an A100. view at source ↗
Figure 12
Figure 12. Figure 12: Offline Decode Throughput vs. Batch Size on different Context Lengths and GPUs. view at source ↗
Figure 13
Figure 13. Figure 13: Per-token Decode Latency Breakdown of Spin and Original Sparse Implementations. Spin consistently reduces per￾token decode latency across all three sparse attention algorithms, where gains primarily come from lower PCIe retrieval overhead for ShadowKV and RetroInfer and from more efficient GPU kernels for SeerAttention-R. ratio reaches 4×. Beyond this point, additional cache capacity yields only marginal … view at source ↗
Figure 16
Figure 16. Figure 16: Cache Hit Ratio vs. Cache Size. Increasing cache size significantly improves hit ratio; however, further increases beyond a certain point yield only marginal improvement. have emerged. Speculative decoding [10, 29, 30] reduces la￾tency by predicting tokens via a lightweight draft model with verification. Offloading-based systems [31, 37, 47, 52] move KV cache or model weights across a multi-tier mem￾ory h… view at source ↗
read the original abstract

Long-context LLM serving is bottlenecked by the cost of attending over ever-growing KV caches. Dynamic sparse attention promises relief by accessing only a small, query-dependent subset of the KV state per decoding step and extending the KV storage to CPU memory. In practice, however, these algorithmic savings rarely translate into end-to-end system-level gains because sparse methods typically operate at different granularities and thus rely on ad hoc, per-algorithm implementations. At the same time, hierarchical KV storage introduces a new systems bottleneck: retrieving fine-grained, irregular KV subsets across the GPU-CPU boundary can easily erase the benefits of sparsity. We present SPIN, a sparse-attention-aware inference framework that co-designs the execution pipeline with hierarchical KV storage through three techniques: (1) a unified partition abstraction that maps different sparsity granularities onto a shared page-based KV substrate; (2) a locality-aware KV cache manager that dynamically sizes per-request HBM budgets and uses a GPU-friendly bucketed LRU policy to cut PCIe round-trips; and (3) a two-level hierarchical metadata layout sized to the active working set rather than the worst-case address space. Built on vLLM with three representative sparse attention algorithms, SPIN delivers 1.66-5.66x higher end-to-end throughput and 7-9x lower TTFT than vLLM, and reduces TPOT by up to 58% over the original sparse-attention implementations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents SPIN, a sparse-attention-aware inference framework that co-designs the LLM serving pipeline with hierarchical KV storage via three techniques: (1) a unified partition abstraction mapping varying sparsity granularities to a shared page-based KV substrate, (2) a locality-aware KV cache manager using dynamic HBM budgeting and GPU-friendly bucketed LRU to reduce PCIe round-trips, and (3) a two-level hierarchical metadata layout sized to the active working set. Built atop vLLM and evaluated with three representative sparse attention algorithms, SPIN reports 1.66-5.66x higher end-to-end throughput and 7-9x lower TTFT than vLLM, plus up to 58% TPOT reduction versus the original sparse kernels.

Significance. If the empirical gains hold under rigorous validation, the work is significant for scalable long-context LLM serving: it directly tackles the granularity mismatch between sparse attention and hierarchical memory plus the PCIe retrieval bottleneck. The co-design of the three techniques (unified partition, bucketed LRU, and working-set metadata) is a concrete strength that could be adopted by production serving systems.

major comments (2)
  1. [§4] §4 (Experimental Evaluation): The central performance claims (1.66-5.66x throughput, 7-9x TTFT, 58% TPOT) rest on end-to-end benchmarks, yet the manuscript provides insufficient detail on experimental setup (models, context lengths, hardware configuration, sparsity patterns, number of trials, and statistical significance). This is load-bearing for the empirical result and must be expanded with tables or appendices showing raw data and controls.
  2. [§3.1] §3.1 (Unified Partition Abstraction): The claim that the page-based substrate successfully maps differing sparsity granularities without offsetting overheads is central to the co-design argument, but no microbenchmark or overhead breakdown (e.g., fragmentation, extra indirection cost) is supplied to confirm the mapping preserves sparsity savings across the three algorithms.
minor comments (2)
  1. [Abstract] The abstract and introduction would benefit from a short table summarizing the three techniques and their targeted bottlenecks for quick reference.
  2. [§3.2] Notation for the bucketed LRU policy (e.g., bucket size, eviction threshold) should be defined once in §3.2 and used consistently in later sections and figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript describing SPIN. The feedback identifies key areas where additional details and validation would strengthen the presentation of our results. We address each major comment below.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental Evaluation): The central performance claims (1.66-5.66x throughput, 7-9x TTFT, 58% TPOT) rest on end-to-end benchmarks, yet the manuscript provides insufficient detail on experimental setup (models, context lengths, hardware configuration, sparsity patterns, number of trials, and statistical significance). This is load-bearing for the empirical result and must be expanded with tables or appendices showing raw data and controls.

    Authors: We agree with the referee that more comprehensive details on the experimental setup are necessary to fully substantiate the performance claims. In the revised manuscript, we will significantly expand §4 to include specific information on the models evaluated, the range of context lengths, the hardware configuration (including GPU and host memory specifications), the sparsity patterns employed by each algorithm, the number of experimental trials, and statistical measures such as variance across runs. We will also add appendices containing raw data tables and additional controls to facilitate reproducibility and rigorous validation of the reported throughput, TTFT, and TPOT improvements. revision: yes

  2. Referee: [§3.1] §3.1 (Unified Partition Abstraction): The claim that the page-based substrate successfully maps differing sparsity granularities without offsetting overheads is central to the co-design argument, but no microbenchmark or overhead breakdown (e.g., fragmentation, extra indirection cost) is supplied to confirm the mapping preserves sparsity savings across the three algorithms.

    Authors: We recognize that providing microbenchmarks would offer direct evidence that the unified partition abstraction does not introduce significant overheads that offset the sparsity benefits. Although our end-to-end evaluations across three algorithms support the overall efficacy, we will incorporate microbenchmark results in the revised manuscript. These will quantify potential overheads including fragmentation, indirection costs, and PCIe transfer efficiencies for each sparsity granularity, demonstrating that the page-based substrate preserves the intended savings. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical systems framework (SPIN) built on vLLM that implements three co-design techniques for sparse attention and hierarchical KV storage. All central claims consist of measured end-to-end throughput, TTFT, and TPOT improvements obtained from concrete implementations and benchmarks against vLLM and prior sparse kernels. No equations, fitted parameters, self-definitional mappings, or load-bearing self-citations appear in the derivation chain; the results are produced by running the described system rather than by any reduction to prior inputs or definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim rests on the domain assumption that sparse attention preserves quality while accessing small KV subsets and on the engineering effectiveness of the three introduced techniques; no free parameters or new physical entities are specified.

axioms (1)
  • domain assumption Dynamic sparse attention can maintain model quality while accessing only a small, query-dependent subset of the KV cache
    Invoked in the opening problem statement as the basis for promising relief from KV cache costs.
invented entities (3)
  • unified partition abstraction no independent evidence
    purpose: Maps different sparsity granularities onto a shared page-based KV substrate
    Introduced as the first co-design technique.
  • locality-aware KV cache manager no independent evidence
    purpose: Dynamically sizes per-request HBM budgets and applies GPU-friendly bucketed LRU to reduce PCIe transfers
    Introduced as the second co-design technique.
  • two-level hierarchical metadata layout no independent evidence
    purpose: Sized to the active working set rather than worst-case address space
    Introduced as the third co-design technique.

pith-pipeline@v0.9.0 · 5595 in / 1574 out tokens · 89086 ms · 2026-05-07T13:09:04.284605+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

83 extracted references · 41 canonical work pages · 11 internal anchors

  1. [1]

    Gulavani, Alexey Tumanov, and Ramachandran Ramjee

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming Throughput-Latency Tradeoff in LLM In- ference with Sarathi-Serve. In18th USENIX Symposium on Operat- ing Systems Design and Implementation. USENIX Association, Santa Clara, CA, USA, 117–134.https://www.usenix...

  2. [2]

    Gulavani, and Ramachandran Ramjee

    Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee. 2023. SARATHI: Ef- ficient LLM Inference by Piggybacking Decodes with Chunked Prefills. CoRRabs/2308.16369 (2023). doi:10.48550/ARXIV.2308.16369

  3. [3]

    ai-dynamo. 2026. AIPerf: A comprehensive benchmarking tool that measures the performance of generative AI models served by your preferred inference solution.https://github.com/ai-dynamo/aiperf. GitHub repository, accessed 2026-03-29

  4. [4]

    Khatamifard, Minsik Cho, Carlo C

    Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, S. Khatamifard, Minsik Cho, Carlo C. del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar. 2024. LLM in a flash: Efficient Large Language Model Inference with Limited Memory. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Co...

  5. [5]

    Anthropic. 2025. Claude.https://www.anthropic.com/claude. Ac- cessed: 2025-08-01

  6. [6]

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhid- ian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2024. LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Paper...

  7. [7]

    Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2024. LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks.arXiv preprint arXiv:2412.15204(2024)

  8. [8]

    C., Arun Iyer, Suresh Parthasarathy, Sriram K

    Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, Vageesh D. C., Arun Iyer, Suresh Parthasarathy, Sriram K. Rajamani, Balasubra- manyan Ashok, and Shashank Shet. 2024. CodePlan: Repository-Level Coding using LLMs and Planning.Proceedings of the ACM on Software Engineering1, FSE (2024), 675–698. doi:10.1145/3643757

  9. [9]

    Laszlo A. Belady. 1966. A study of replacement algorithms for a virtual-storage computer.IBM Systems journal5, 2 (1966), 78–101

  10. [10]

    Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

    Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. 2024. Medusa: Simple LLM In- ference Acceleration Framework with Multiple Decoding Heads. arXiv:2401.10774 [cs.LG]https://arxiv.org/abs/2401.10774

  11. [11]

    Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. 2023. Accelerating Large Language Model Decoding with Speculative Sampling.CoRR abs/2302.01318 (2023). doi:10.48550/ARXIV.2302.01318

  12. [12]

    Renze Chen, Zhuofeng Wang, Beiquan Cao, Tong Wu, Size Zheng, Xiuhong Li, Xuechao Wei, Shengen Yan, Meng Li, and Yun Liang. 2024. ArkVale: Efficient Generative LLM Inference with Recallable Key-Value Eviction.https://openreview.net/forum?id= 4oAt5L4lYe&referrer=%5Bthe%20profile%20of%20Renze%20Chen% 5D(%2Fprofile%3Fid%3D~Renze_Chen1)

  13. [13]

    Yaoqi Chen, Jinkai Zhang, Baotong Lu, Qianxi Zhang, Chengruidong Zhang, Jingjia Luo, Di Liu, Huiqiang Jiang, Qi Chen, Jing Liu, Bailu Ding, Xiao Yan, Jiawei Jiang, Chen Chen, Mingxing Zhang, Yuqing Yang, Fan Yang, and Mao Yang. 2025. RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference. arXiv:2505.02922 [cs] doi:10.48550/arXiv.2505.02922

  14. [14]

    Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Léon Bottou, Zhihao Jia, and Beidi Chen. 2025. MagicPIG: LSH Sampling for Efficient LLM Generation. InThe Thirteenth International Confer- ence on Learning Representations. OpenReview.net, Singapore.https: //openreview.net/forum?id=ALzTQUgW8a

  15. [15]

    Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating Long Sequences with Sparse Transformers.CoRR abs/1904.10509 (2019).http://arxiv.org/abs/1904.10509

  16. [16]

    Adaptively Sparse Transformers

    Gonçalo M. Correia, Vlad Niculae, and André F. T. Martins. 2019. Adaptively Sparse Transformers. InProceedings of the 2019 Confer- ence on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Associ- ation for Computational Linguistics, Hong Kong, China, 2174–2184. doi:10.18653/V1/D19-1223

  17. [17]

    Fu, Stefano Ermon, Atri Rudra, and Christo- pher Ré

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christo- pher Ré. 2022. FlashAttention: Fast and Memory-Efficient Ex- act Attention with IO-Awareness. InThe Thirty-Sixth Annual Conference on Neural Information Processing Systems. New Or- leans, LA, USA.http://papers.nips.cc/paper_files/paper/2022/hash/ 67d57c32e20fd0a7a302cb81d36e40d5-Abstract-Confe...

  18. [18]

    Yichuan Deng, Zhao Song, Jing Xiong, and Chiwun Yang. 2024. How Sparse Attention Approximates Exact Attention? Your Attention is Naturally𝑛 𝐶-Sparse.arXiv preprint arXiv:2404.02690(2024). 12

  19. [19]

    Yizhao Gao, Shuming Guo, Shijie Cao, Yuqing Xia, Yu Cheng, Lei Wang, Lingxiao Ma, Yutao Sun, Tianzhu Ye, Li Dong, et al. 2025. Seerattention- r: Sparse attention adaptation for long reasoning.arXiv preprint arXiv:2506.08889(2025)

  20. [20]

    Google. 2025. Gemini.https://gemini.google.com/app. Accessed: 2025-08-01

  21. [21]

    Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami

    Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami

  22. [22]

    InThe Thirty-Eighth Annual Conference on Neural Information Processing Systems

    KVQuant: Towards 10 Million Context Length LLM Infer- ence with KV Cache Quantization. InThe Thirty-Eighth Annual Conference on Neural Information Processing Systems. Vancouver, BC, Canada.http://papers.nips.cc/paper_files/paper/2024/hash/ 028fcbcf85435d39a40c4d61b42c99a4-Abstract-Conference.html

  23. [23]

    Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. 2023. DeepSpeed Ulysses: System Optimizations for Enabling Training of Ex- treme Long Sequence Transformer Models. arXiv:2309.14509 [cs.LG] https://arxiv.org/abs/2309.14509

  24. [24]

    Abdi, Dong- sheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu

    Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dong- sheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2024. MIn- ference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention. InThe Thirty-Eighth Annual Con- ference on Neural Information Processing Systems. Vancouv...

  25. [25]

    Bowen Jin, Jinsung Yoon, Jiawei Han, and Sercan O Arik. 2024. Long- context llms meet rag: Overcoming challenges for long inputs in rag. arXiv preprint arXiv:2410.05983(2024)

  26. [27]

    Gonzalez, Hao Zhang, and Ion Stoica

    Efficient Memory Management for Large Language Model Serving with PagedAttention. In29th Symposium on Operating Systems Principles(Koblenz Germany, 2023-10-23). ACM, 611–626. doi:10.1145/ 3600006.3613165

  27. [28]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica

  28. [29]

    InProceedings of the 29th Symposium on Operating Systems Principles (SOSP ’23)

    Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles. ACM, Koblenz, Germany, 611–626. doi:10.1145/3600006.3613165

  29. [30]

    Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management. In18th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, Santa Clara, CA, 155–172.https://www.usenix.org/conference/osdi24/ presentation/lee

  30. [31]

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Dem- ing Chen. 2024. SnapKV: LLM Knows What You are Look- ing for Before Generation. InThe Thirty-Eighth Annual Con- ference on Neural Information Processing Systems. Vancouver, BC, Canada.http://papers.nips.cc/paper_files/paper/2024/hash/ 2...

  31. [32]

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2025. EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test. arXiv:2503.01840 [cs.CL]https://arxiv.org/ abs/2503.01840

  32. [33]

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2025. EA- GLE: Speculative Sampling Requires Rethinking Feature Uncertainty. arXiv:2401.15077 [cs.LG]https://arxiv.org/abs/2401.15077

  33. [34]

    Bin Lin, Chen Zhang, Tao Peng, Hanyu Zhao, Wencong Xiao, Minmin Sun, Anmin Liu, Zhipeng Zhang, Lanbo Li, Xiafei Qiu, Shen Li, Zhi- gang Ji, Tao Xie, Yong Li, and Wei Lin. 2024. Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache. arXiv:2401.02669 [cs.DC]https://arxiv.org/abs/2401.02669

  34. [35]

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Accel- eration. InProceedings of the Seventh Annual Conference on Machine Learning and Systems. mlsys.org, Santa Clara, CA, USA.https://procee...

  35. [36]

    Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. 2025. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556(2025)

  36. [37]

    Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong Zhang, Bailu Ding, Kai Zhang, Chen Chen, Fan Yang, Yuqing Yang, and Lili Qiu. 2024. RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval.CoRR abs/2409.10516 (2024). doi:10.48550/ARXIV.2409.10516

  37. [38]

    Hao Liu, Matei Zaharia, and Pieter Abbeel. 2023. Ring At- tention with Blockwise Transformers for Near-Infinite Context. arXiv:2310.01889 [cs.CL]https://arxiv.org/abs/2310.01889

  38. [39]

    Xiang Liu, Peijie Dong, Xuming Hu, and Xiaowen Chu. 2024. LongGen- Bench: Long-context Generation Benchmark. arXiv:2410.04199 [cs.CL] https://arxiv.org/abs/2410.04199

  39. [40]

    Yuhan Liu, Yihua Cheng, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaot- ing Feng, Yuyang Huang, Samuel Shen, Rui Zhang, Kuntai Du, and Junchen Jiang. 2025. LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference. arXiv:2510.09665 [cs.LG]https: //arxiv.org/abs/2510.09665

  40. [41]

    Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen (Henry) Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2024. KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache. InForty- first International Conference on Machine Learning. OpenReview.net, Vienna, Austria.https://openreview.net/forum?id=L057s2Rq8O

  41. [42]

    Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, Pan Zhang, Liangming Pan, Yu-Gang Jiang, Jiaqi Wang, Yixin Cao, and Aixin Sun. 2024. MMLongBench-Doc: Benchmarking Long-context Docu- ment Understanding with Visualizations. arXiv:2407.01523 [cs.CV] https://arxiv.org/abs/2407.01523

  42. [43]

    Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, and Rashmi Vinayak. 2025. Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1(New York, NY, USA, 2025-02-06)(ASPLOS ...

  43. [44]

    Meta. 2024. Llama-3.1-70B.https://huggingface.co/meta-llama/Llama- 3.1-70B. Accessed: 2024-09-25

  44. [45]

    Meta. 2025. The Llama 4 herd: The beginning of a new era of na- tively multimodal AI innovation.https://ai.meta.com/blog/llama-4- multimodal-intelligence/. Accessed: 2025-04-05

  45. [46]

    OpenAI. 2025. ChatGPT.https://chat.chatbotapp.ai/. Accessed: 2025-08-01

  46. [47]

    Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) (2024-06). 118–132. doi:10.1109/ISCA59077.2024.00019

  47. [48]

    Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shiv- ani Agrawal, and Jeff Dean. 2023. Efficiently Scaling Trans- former Inference. InProceedings of the Sixth Conference on 13 Zhao et al. Machine Learning and Systems. mlsys.org, Miami, FL, USA. https://proceedings.mlsys.org/paper_files/paper/2023...

  48. [49]

    Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, and Ashish Panwar. 2025. vAttention: Dynamic Memory Manage- ment for Serving LLMs without PagedAttention. InProceedings of the 30th ACM International Conference on Architectural Support for Pro- gramming Languages and Operating Systems, Volume 1(New York, NY, USA, 2025-03-30)(ASPLOS ’25). Ass...

  49. [50]

    Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2025. Mooncake: Trading More Storage for Less Computation - A KVCache-centric Ar- chitecture for Serving LLM Chatbot. In23rd USENIX Conference on File and Storage Technologies. USENIX Association, Santa Clara, CA, USA, 155–170.https://www.useni...

  50. [51]

    Qwen. 2025. Qwen3-14B.https://huggingface.co/Qwen/Qwen3-14B. Accessed: 2026-04-07

  51. [52]

    Qwen. 2025. Qwen3-32B.https://huggingface.co/Qwen/Qwen3-32B. Accessed: 2026-04-07

  52. [53]

    Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, and Douglas Orr. 2024. SparQ Attention: Bandwidth- Efficient LLM Inference. InForty-first International Conference on Ma- chine Learning. OpenReview.net, Vienna, Austria.https://openreview. net/forum?id=OS5dqxmmtl

  53. [54]

    Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wen- han Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Tou- vron, Louis Martin, ...

  54. [55]

    Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. 2023. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. InFortieth Interna- tional Conference on Machine Learning (Proceedings of Machine Learn- ing Research, Vol. 202). PMLR, Honolul...

  55. [56]

    Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. 2024. PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Sys- tems Principles. ACM, Austin, TX, USA, 590–606. doi:10.1145/3694715. 3695964

  56. [57]

    Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. 2024. Llumnix: Dynamic Scheduling for Large Language Model Serving. In18th USENIX Symposium on Operating Sys- tems Design and Implementation (OSDI 24)(Santa Clara, CA, 2024-07). USENIX Association, 173–191.https://www.usenix.org/conference/ osdi24/presentation/sun-biao

  57. [58]

    Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, and Beidi Chen. 2025. ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference. https://openreview.net/forum?id=oa7MYAO6h6

  58. [59]

    Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. 2024. QUEST: Query-Aware Sparsity for Efficient Long-Context LLM Inference. InForty-First International Conference on Machine Learning

  59. [60]

    Gemini Team. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv:2403.05530 [cs.CL]https: //arxiv.org/abs/2403.05530

  60. [61]

    Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https://arxiv.org/abs/2505.09388

  61. [62]

    Zhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, and Mi Zhang. 2024. D2O: Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models.CoRRabs/2406.13035 (2024). doi:10.48550/ ARXIV.2406.13035

  62. [63]

    Dongsheng Wang, Natraj Raman, Mathieu Sibue, Zhiqiang Ma, Petr Babkin, Simerjot Kaur, Yulong Pei, Armineh Nourbakhsh, and Xiaomo Liu. 2023. DocLLM: A layout-aware generative language model for multimodal document understanding. arXiv:2401.00908 [cs.CL]https: //arxiv.org/abs/2401.00908

  63. [64]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. InThe Thirty-Sixth Annual Con- ference on Neural Information Processing Systems. New Orleans, LA, USA.http://papers.nips.cc/paper_files/paper/2022/hash/ 9d5609613...

  64. [65]

    Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, and Xin Jin. 2024. LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles. ACM, Austin, TX, USA, 640–654. doi:10.1145/3694715.3695948

  65. [66]

    Wei Wu, Zhuoshi Pan, Chao Wang, Liyi Chen, Yunchu Bai, Kun Fu, Zheng Wang, and Hui Xiong. 2024. TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dy- namic Token-Level KV Cache Selection.CoRRabs/2411.02886 (2024). doi:10.48550/ARXIV.2411.02886

  66. [67]

    Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. 2025. DuoAttention: Ef- ficient Long-Context LLM Inference with Retrieval and Streaming Heads. InThe Thirteenth International Conference on Learning Rep- resentations. OpenReview.net, Singapore.https://openreview.net/ forum?id=cFu7ze7xUm

  67. [68]

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. Efficient Streaming Language Models with Attention Sinks. InThe Twelfth International Conference on Learning Represen- tations. OpenReview.net, Vienna, Austria.https://openreview.net/ forum?id=NG7sS51zVF

  68. [69]

    Fangyuan Xu, Tanya Goyal, and Eunsol Choi. 2024. Recycled At- tention: Efficient inference for long-context language models.CoRR abs/2411.05787 (2024). doi:10.48550/ARXIV.2411.05787

  69. [70]

    Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, and Song Han. 2025. XAttention: Block Sparse Attention with Antidiagonal Scoring. InForty-second International Conference on Machine Learning. OpenReview.net, Vancouver, BC, Canada.https://openreview.net/ forum?id=KG6aBfGi6e

  70. [71]

    Zhenliang Xue, Yixin Song, Zeyu Mi, Le Chen, Yubin Xia, and Haibo Chen. 2024. PowerInfer-2: Fast Large Language Model Inference on a Smartphone.CoRRabs/2406.06282 (2024). doi:10.48550/ARXIV.2406. 06282

  71. [72]

    Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, and Song Han

  72. [73]

    Language Model Cascades: Token-Level Uncertainty and Beyond

    LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention.CoRRabs/2502.14866 (2025). doi:10.48550/ARXIV. 2502.14866

  73. [74]

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. In16th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, Carlsbad, CA, USA, 521–538.https://www.usenix.org/conference/ osdi22/presentation/yu

  74. [75]

    Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Yuxing Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, and 14 Wangding Zeng. 2025. Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention. InProceedings of the 63rd An- nual Meeting of the Association for Co...

  75. [76]

    Chen Zhang, Kuntai Du, Shu Liu, Woosuk Kwon, Xiangxi Mo, Yufeng Wang, Xiaoxuan Liu, Kaichao You, Zhuohan Li, Mingsheng Long, Ji- dong Zhai, Joseph Gonzalez, and Ion Stoica. 2025. Jenga: Effective Memory Management for Serving LLM with Heterogeneity. InPro- ceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles (SOSP ’25). Association fo...

  76. [77]

    Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. 2023. RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Ka- lika Bali (Eds.). Associati...

  77. [78]

    Tianyi Zhang, Jonah Yi, Zhaozhuo Xu, and Anshumali Shrivastava

  78. [79]

    InThe Thirty-Eighth Annual Conference on Neural Information Processing Systems

    KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization. InThe Thirty-Eighth Annual Conference on Neural Information Processing Systems. Vancou- ver, BC, Canada.http://papers.nips.cc/paper_files/paper/2024/hash/ 05d6b5b6901fb57d2c287e1d3ce6d63c-Abstract-Conference.html

  79. [80]

    Barrett, Zhangyang Wang, and Beidi Chen

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lian- min Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christo- pher Ré, Clark W. Barrett, Zhangyang Wang, and Beidi Chen

  80. [81]

    InThe Thirty-Seventh Annual Conference on Neural Information Processing Systems

    H2O: Heavy-Hitter Oracle for Efficient Generative Infer- ence of Large Language Models. InThe Thirty-Seventh Annual Conference on Neural Information Processing Systems. New Or- leans, LA, USA.http://papers.nips.cc/paper_files/paper/2023/hash/ 6ceefa7b15572587b78ecfcebb2827f8-Abstract-Conference.html

Showing first 80 references.