Recognition: unknown
Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving
Pith reviewed 2026-05-07 13:09 UTC · model grok-4.3
The pith
SPIN unifies different sparse attention methods under one hierarchical KV memory system to realize their promised speedups in long-context LLM serving.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SPIN is a sparse-attention-aware inference framework built on vLLM that co-designs the execution pipeline with hierarchical KV storage. It introduces a unified partition abstraction that maps differing sparsity granularities onto a shared page-based KV substrate, a locality-aware KV cache manager that dynamically sizes per-request HBM budgets and employs a GPU-friendly bucketed LRU policy to reduce PCIe round-trips, and a two-level hierarchical metadata layout sized to the active working set. Across three representative sparse attention algorithms the resulting system reports 1.66-5.66x higher end-to-end throughput and 7-9x lower TTFT than plain vLLM while cutting TPOT by up to 58 percent.
What carries the argument
The unified partition abstraction that maps varying sparsity granularities onto a shared page-based KV substrate together with the locality-aware bucketed LRU manager that sizes HBM budgets per request.
If this is right
- Sparse attention algorithms no longer require separate system-level implementations to achieve end-to-end gains.
- Hierarchical GPU-CPU KV storage becomes practical without the irregular transfers erasing sparsity benefits.
- Per-request HBM budgets can be adjusted dynamically while still preserving locality across decoding steps.
- Metadata overhead stays proportional to the active working set rather than the full context length.
- The same framework supports multiple representative sparse methods without per-algorithm tuning.
Where Pith is reading between the lines
- The same partition and bucketed-LRU ideas could be tested on attention patterns that mix local and global tokens to see whether the locality signal remains strong enough.
- Extending the page substrate to include slower tiers such as NVMe would test how far the reduction in round-trips generalizes when latency gaps widen.
- If the unified abstraction proves stable, it opens a route to compile-time rewriting of new sparse kernels directly onto the page layout instead of hand-coded kernels.
Load-bearing premise
Different sparse attention patterns can be expressed as partitions over the same page-based KV substrate and that the bucketed LRU policy will reduce PCIe transfers enough to outweigh any added management cost.
What would settle it
Running the same three sparse attention algorithms on identical long-context workloads and observing that total PCIe bytes transferred or end-to-end latency do not decrease relative to the original per-algorithm implementations would falsify the central performance claim.
Figures
read the original abstract
Long-context LLM serving is bottlenecked by the cost of attending over ever-growing KV caches. Dynamic sparse attention promises relief by accessing only a small, query-dependent subset of the KV state per decoding step and extending the KV storage to CPU memory. In practice, however, these algorithmic savings rarely translate into end-to-end system-level gains because sparse methods typically operate at different granularities and thus rely on ad hoc, per-algorithm implementations. At the same time, hierarchical KV storage introduces a new systems bottleneck: retrieving fine-grained, irregular KV subsets across the GPU-CPU boundary can easily erase the benefits of sparsity. We present SPIN, a sparse-attention-aware inference framework that co-designs the execution pipeline with hierarchical KV storage through three techniques: (1) a unified partition abstraction that maps different sparsity granularities onto a shared page-based KV substrate; (2) a locality-aware KV cache manager that dynamically sizes per-request HBM budgets and uses a GPU-friendly bucketed LRU policy to cut PCIe round-trips; and (3) a two-level hierarchical metadata layout sized to the active working set rather than the worst-case address space. Built on vLLM with three representative sparse attention algorithms, SPIN delivers 1.66-5.66x higher end-to-end throughput and 7-9x lower TTFT than vLLM, and reduces TPOT by up to 58% over the original sparse-attention implementations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents SPIN, a sparse-attention-aware inference framework that co-designs the LLM serving pipeline with hierarchical KV storage via three techniques: (1) a unified partition abstraction mapping varying sparsity granularities to a shared page-based KV substrate, (2) a locality-aware KV cache manager using dynamic HBM budgeting and GPU-friendly bucketed LRU to reduce PCIe round-trips, and (3) a two-level hierarchical metadata layout sized to the active working set. Built atop vLLM and evaluated with three representative sparse attention algorithms, SPIN reports 1.66-5.66x higher end-to-end throughput and 7-9x lower TTFT than vLLM, plus up to 58% TPOT reduction versus the original sparse kernels.
Significance. If the empirical gains hold under rigorous validation, the work is significant for scalable long-context LLM serving: it directly tackles the granularity mismatch between sparse attention and hierarchical memory plus the PCIe retrieval bottleneck. The co-design of the three techniques (unified partition, bucketed LRU, and working-set metadata) is a concrete strength that could be adopted by production serving systems.
major comments (2)
- [§4] §4 (Experimental Evaluation): The central performance claims (1.66-5.66x throughput, 7-9x TTFT, 58% TPOT) rest on end-to-end benchmarks, yet the manuscript provides insufficient detail on experimental setup (models, context lengths, hardware configuration, sparsity patterns, number of trials, and statistical significance). This is load-bearing for the empirical result and must be expanded with tables or appendices showing raw data and controls.
- [§3.1] §3.1 (Unified Partition Abstraction): The claim that the page-based substrate successfully maps differing sparsity granularities without offsetting overheads is central to the co-design argument, but no microbenchmark or overhead breakdown (e.g., fragmentation, extra indirection cost) is supplied to confirm the mapping preserves sparsity savings across the three algorithms.
minor comments (2)
- [Abstract] The abstract and introduction would benefit from a short table summarizing the three techniques and their targeted bottlenecks for quick reference.
- [§3.2] Notation for the bucketed LRU policy (e.g., bucket size, eviction threshold) should be defined once in §3.2 and used consistently in later sections and figures.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript describing SPIN. The feedback identifies key areas where additional details and validation would strengthen the presentation of our results. We address each major comment below.
read point-by-point responses
-
Referee: [§4] §4 (Experimental Evaluation): The central performance claims (1.66-5.66x throughput, 7-9x TTFT, 58% TPOT) rest on end-to-end benchmarks, yet the manuscript provides insufficient detail on experimental setup (models, context lengths, hardware configuration, sparsity patterns, number of trials, and statistical significance). This is load-bearing for the empirical result and must be expanded with tables or appendices showing raw data and controls.
Authors: We agree with the referee that more comprehensive details on the experimental setup are necessary to fully substantiate the performance claims. In the revised manuscript, we will significantly expand §4 to include specific information on the models evaluated, the range of context lengths, the hardware configuration (including GPU and host memory specifications), the sparsity patterns employed by each algorithm, the number of experimental trials, and statistical measures such as variance across runs. We will also add appendices containing raw data tables and additional controls to facilitate reproducibility and rigorous validation of the reported throughput, TTFT, and TPOT improvements. revision: yes
-
Referee: [§3.1] §3.1 (Unified Partition Abstraction): The claim that the page-based substrate successfully maps differing sparsity granularities without offsetting overheads is central to the co-design argument, but no microbenchmark or overhead breakdown (e.g., fragmentation, extra indirection cost) is supplied to confirm the mapping preserves sparsity savings across the three algorithms.
Authors: We recognize that providing microbenchmarks would offer direct evidence that the unified partition abstraction does not introduce significant overheads that offset the sparsity benefits. Although our end-to-end evaluations across three algorithms support the overall efficacy, we will incorporate microbenchmark results in the revised manuscript. These will quantify potential overheads including fragmentation, indirection costs, and PCIe transfer efficiencies for each sparsity granularity, demonstrating that the page-based substrate preserves the intended savings. revision: yes
Circularity Check
No significant circularity
full rationale
The paper describes an empirical systems framework (SPIN) built on vLLM that implements three co-design techniques for sparse attention and hierarchical KV storage. All central claims consist of measured end-to-end throughput, TTFT, and TPOT improvements obtained from concrete implementations and benchmarks against vLLM and prior sparse kernels. No equations, fitted parameters, self-definitional mappings, or load-bearing self-citations appear in the derivation chain; the results are produced by running the described system rather than by any reduction to prior inputs or definitions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Dynamic sparse attention can maintain model quality while accessing only a small, query-dependent subset of the KV cache
invented entities (3)
-
unified partition abstraction
no independent evidence
-
locality-aware KV cache manager
no independent evidence
-
two-level hierarchical metadata layout
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Gulavani, Alexey Tumanov, and Ramachandran Ramjee
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming Throughput-Latency Tradeoff in LLM In- ference with Sarathi-Serve. In18th USENIX Symposium on Operat- ing Systems Design and Implementation. USENIX Association, Santa Clara, CA, USA, 117–134.https://www.usenix...
2024
-
[2]
Gulavani, and Ramachandran Ramjee
Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee. 2023. SARATHI: Ef- ficient LLM Inference by Piggybacking Decodes with Chunked Prefills. CoRRabs/2308.16369 (2023). doi:10.48550/ARXIV.2308.16369
-
[3]
ai-dynamo. 2026. AIPerf: A comprehensive benchmarking tool that measures the performance of generative AI models served by your preferred inference solution.https://github.com/ai-dynamo/aiperf. GitHub repository, accessed 2026-03-29
2026
-
[4]
Khatamifard, Minsik Cho, Carlo C
Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, S. Khatamifard, Minsik Cho, Carlo C. del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar. 2024. LLM in a flash: Efficient Large Language Model Inference with Limited Memory. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Co...
-
[5]
Anthropic. 2025. Claude.https://www.anthropic.com/claude. Ac- cessed: 2025-08-01
2025
-
[6]
Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhid- ian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2024. LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Paper...
-
[7]
Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2024. LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks.arXiv preprint arXiv:2412.15204(2024)
-
[8]
C., Arun Iyer, Suresh Parthasarathy, Sriram K
Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, Vageesh D. C., Arun Iyer, Suresh Parthasarathy, Sriram K. Rajamani, Balasubra- manyan Ashok, and Shashank Shet. 2024. CodePlan: Repository-Level Coding using LLMs and Planning.Proceedings of the ACM on Software Engineering1, FSE (2024), 675–698. doi:10.1145/3643757
-
[9]
Laszlo A. Belady. 1966. A study of replacement algorithms for a virtual-storage computer.IBM Systems journal5, 2 (1966), 78–101
1966
-
[10]
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. 2024. Medusa: Simple LLM In- ference Acceleration Framework with Multiple Decoding Heads. arXiv:2401.10774 [cs.LG]https://arxiv.org/abs/2401.10774
work page internal anchor Pith review arXiv 2024
-
[11]
Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. 2023. Accelerating Large Language Model Decoding with Speculative Sampling.CoRR abs/2302.01318 (2023). doi:10.48550/ARXIV.2302.01318
work page internal anchor Pith review doi:10.48550/arxiv.2302.01318 2023
-
[12]
Renze Chen, Zhuofeng Wang, Beiquan Cao, Tong Wu, Size Zheng, Xiuhong Li, Xuechao Wei, Shengen Yan, Meng Li, and Yun Liang. 2024. ArkVale: Efficient Generative LLM Inference with Recallable Key-Value Eviction.https://openreview.net/forum?id= 4oAt5L4lYe&referrer=%5Bthe%20profile%20of%20Renze%20Chen% 5D(%2Fprofile%3Fid%3D~Renze_Chen1)
2024
-
[13]
Yaoqi Chen, Jinkai Zhang, Baotong Lu, Qianxi Zhang, Chengruidong Zhang, Jingjia Luo, Di Liu, Huiqiang Jiang, Qi Chen, Jing Liu, Bailu Ding, Xiao Yan, Jiawei Jiang, Chen Chen, Mingxing Zhang, Yuqing Yang, Fan Yang, and Mao Yang. 2025. RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference. arXiv:2505.02922 [cs] doi:10.48550/arXiv.2505.02922
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.02922 2025
-
[14]
Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Léon Bottou, Zhihao Jia, and Beidi Chen. 2025. MagicPIG: LSH Sampling for Efficient LLM Generation. InThe Thirteenth International Confer- ence on Learning Representations. OpenReview.net, Singapore.https: //openreview.net/forum?id=ALzTQUgW8a
2025
-
[15]
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating Long Sequences with Sparse Transformers.CoRR abs/1904.10509 (2019).http://arxiv.org/abs/1904.10509
work page internal anchor Pith review arXiv 2019
-
[16]
Adaptively Sparse Transformers
Gonçalo M. Correia, Vlad Niculae, and André F. T. Martins. 2019. Adaptively Sparse Transformers. InProceedings of the 2019 Confer- ence on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Associ- ation for Computational Linguistics, Hong Kong, China, 2174–2184. doi:10.18653/V1/D19-1223
-
[17]
Fu, Stefano Ermon, Atri Rudra, and Christo- pher Ré
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christo- pher Ré. 2022. FlashAttention: Fast and Memory-Efficient Ex- act Attention with IO-Awareness. InThe Thirty-Sixth Annual Conference on Neural Information Processing Systems. New Or- leans, LA, USA.http://papers.nips.cc/paper_files/paper/2022/hash/ 67d57c32e20fd0a7a302cb81d36e40d5-Abstract-Confe...
2022
- [18]
- [19]
-
[20]
Google. 2025. Gemini.https://gemini.google.com/app. Accessed: 2025-08-01
2025
-
[21]
Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami
Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami
-
[22]
InThe Thirty-Eighth Annual Conference on Neural Information Processing Systems
KVQuant: Towards 10 Million Context Length LLM Infer- ence with KV Cache Quantization. InThe Thirty-Eighth Annual Conference on Neural Information Processing Systems. Vancouver, BC, Canada.http://papers.nips.cc/paper_files/paper/2024/hash/ 028fcbcf85435d39a40c4d61b42c99a4-Abstract-Conference.html
2024
-
[23]
Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. 2023. DeepSpeed Ulysses: System Optimizations for Enabling Training of Ex- treme Long Sequence Transformer Models. arXiv:2309.14509 [cs.LG] https://arxiv.org/abs/2309.14509
work page internal anchor Pith review arXiv 2023
-
[24]
Abdi, Dong- sheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu
Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dong- sheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2024. MIn- ference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention. InThe Thirty-Eighth Annual Con- ference on Neural Information Processing Systems. Vancouv...
2024
- [25]
-
[27]
Gonzalez, Hao Zhang, and Ion Stoica
Efficient Memory Management for Large Language Model Serving with PagedAttention. In29th Symposium on Operating Systems Principles(Koblenz Germany, 2023-10-23). ACM, 611–626. doi:10.1145/ 3600006.3613165
-
[28]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica
-
[29]
InProceedings of the 29th Symposium on Operating Systems Principles (SOSP ’23)
Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles. ACM, Koblenz, Germany, 611–626. doi:10.1145/3600006.3613165
-
[30]
Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management. In18th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, Santa Clara, CA, 155–172.https://www.usenix.org/conference/osdi24/ presentation/lee
2024
-
[31]
Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Dem- ing Chen. 2024. SnapKV: LLM Knows What You are Look- ing for Before Generation. InThe Thirty-Eighth Annual Con- ference on Neural Information Processing Systems. Vancouver, BC, Canada.http://papers.nips.cc/paper_files/paper/2024/hash/ 2...
2024
- [32]
- [33]
-
[34]
Bin Lin, Chen Zhang, Tao Peng, Hanyu Zhao, Wencong Xiao, Minmin Sun, Anmin Liu, Zhipeng Zhang, Lanbo Li, Xiafei Qiu, Shen Li, Zhi- gang Ji, Tao Xie, Yong Li, and Wei Lin. 2024. Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache. arXiv:2401.02669 [cs.DC]https://arxiv.org/abs/2401.02669
-
[35]
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Accel- eration. InProceedings of the Seventh Annual Conference on Machine Learning and Systems. mlsys.org, Santa Clara, CA, USA.https://procee...
2024
-
[36]
Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. 2025. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556(2025)
work page internal anchor Pith review arXiv 2025
-
[37]
Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong Zhang, Bailu Ding, Kai Zhang, Chen Chen, Fan Yang, Yuqing Yang, and Lili Qiu. 2024. RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval.CoRR abs/2409.10516 (2024). doi:10.48550/ARXIV.2409.10516
-
[38]
Hao Liu, Matei Zaharia, and Pieter Abbeel. 2023. Ring At- tention with Blockwise Transformers for Near-Infinite Context. arXiv:2310.01889 [cs.CL]https://arxiv.org/abs/2310.01889
work page internal anchor Pith review arXiv 2023
- [39]
-
[40]
Yuhan Liu, Yihua Cheng, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaot- ing Feng, Yuyang Huang, Samuel Shen, Rui Zhang, Kuntai Du, and Junchen Jiang. 2025. LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference. arXiv:2510.09665 [cs.LG]https: //arxiv.org/abs/2510.09665
-
[41]
Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen (Henry) Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2024. KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache. InForty- first International Conference on Machine Learning. OpenReview.net, Vienna, Austria.https://openreview.net/forum?id=L057s2Rq8O
2024
-
[42]
Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, Pan Zhang, Liangming Pan, Yu-Gang Jiang, Jiaqi Wang, Yixin Cao, and Aixin Sun. 2024. MMLongBench-Doc: Benchmarking Long-context Docu- ment Understanding with Visualizations. arXiv:2407.01523 [cs.CV] https://arxiv.org/abs/2407.01523
-
[43]
Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, and Rashmi Vinayak. 2025. Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1(New York, NY, USA, 2025-02-06)(ASPLOS ...
-
[44]
Meta. 2024. Llama-3.1-70B.https://huggingface.co/meta-llama/Llama- 3.1-70B. Accessed: 2024-09-25
2024
-
[45]
Meta. 2025. The Llama 4 herd: The beginning of a new era of na- tively multimodal AI innovation.https://ai.meta.com/blog/llama-4- multimodal-intelligence/. Accessed: 2025-04-05
2025
-
[46]
OpenAI. 2025. ChatGPT.https://chat.chatbotapp.ai/. Accessed: 2025-08-01
2025
-
[47]
Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) (2024-06). 118–132. doi:10.1109/ISCA59077.2024.00019
-
[48]
Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shiv- ani Agrawal, and Jeff Dean. 2023. Efficiently Scaling Trans- former Inference. InProceedings of the Sixth Conference on 13 Zhao et al. Machine Learning and Systems. mlsys.org, Miami, FL, USA. https://proceedings.mlsys.org/paper_files/paper/2023...
2023
-
[49]
Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, and Ashish Panwar. 2025. vAttention: Dynamic Memory Manage- ment for Serving LLMs without PagedAttention. InProceedings of the 30th ACM International Conference on Architectural Support for Pro- gramming Languages and Operating Systems, Volume 1(New York, NY, USA, 2025-03-30)(ASPLOS ’25). Ass...
-
[50]
Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2025. Mooncake: Trading More Storage for Less Computation - A KVCache-centric Ar- chitecture for Serving LLM Chatbot. In23rd USENIX Conference on File and Storage Technologies. USENIX Association, Santa Clara, CA, USA, 155–170.https://www.useni...
2025
-
[51]
Qwen. 2025. Qwen3-14B.https://huggingface.co/Qwen/Qwen3-14B. Accessed: 2026-04-07
2025
-
[52]
Qwen. 2025. Qwen3-32B.https://huggingface.co/Qwen/Qwen3-32B. Accessed: 2026-04-07
2025
-
[53]
Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, and Douglas Orr. 2024. SparQ Attention: Bandwidth- Efficient LLM Inference. InForty-first International Conference on Ma- chine Learning. OpenReview.net, Vienna, Austria.https://openreview. net/forum?id=OS5dqxmmtl
2024
-
[54]
Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wen- han Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Tou- vron, Louis Martin, ...
work page internal anchor Pith review arXiv 2024
-
[55]
Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. 2023. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. InFortieth Interna- tional Conference on Machine Learning (Proceedings of Machine Learn- ing Research, Vol. 202). PMLR, Honolul...
2023
-
[56]
Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. 2024. PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Sys- tems Principles. ACM, Austin, TX, USA, 590–606. doi:10.1145/3694715. 3695964
-
[57]
Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. 2024. Llumnix: Dynamic Scheduling for Large Language Model Serving. In18th USENIX Symposium on Operating Sys- tems Design and Implementation (OSDI 24)(Santa Clara, CA, 2024-07). USENIX Association, 173–191.https://www.usenix.org/conference/ osdi24/presentation/sun-biao
2024
-
[58]
Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, and Beidi Chen. 2025. ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference. https://openreview.net/forum?id=oa7MYAO6h6
2025
-
[59]
Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. 2024. QUEST: Query-Aware Sparsity for Efficient Long-Context LLM Inference. InForty-First International Conference on Machine Learning
2024
-
[60]
Gemini Team. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv:2403.05530 [cs.CL]https: //arxiv.org/abs/2403.05530
work page internal anchor Pith review arXiv 2024
-
[61]
Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https://arxiv.org/abs/2505.09388
work page internal anchor Pith review arXiv 2025
- [62]
-
[63]
Dongsheng Wang, Natraj Raman, Mathieu Sibue, Zhiqiang Ma, Petr Babkin, Simerjot Kaur, Yulong Pei, Armineh Nourbakhsh, and Xiaomo Liu. 2023. DocLLM: A layout-aware generative language model for multimodal document understanding. arXiv:2401.00908 [cs.CL]https: //arxiv.org/abs/2401.00908
-
[64]
Chi, Quoc V
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. InThe Thirty-Sixth Annual Con- ference on Neural Information Processing Systems. New Orleans, LA, USA.http://papers.nips.cc/paper_files/paper/2022/hash/ 9d5609613...
2022
-
[65]
Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, and Xin Jin. 2024. LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles. ACM, Austin, TX, USA, 640–654. doi:10.1145/3694715.3695948
-
[66]
Wei Wu, Zhuoshi Pan, Chao Wang, Liyi Chen, Yunchu Bai, Kun Fu, Zheng Wang, and Hui Xiong. 2024. TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dy- namic Token-Level KV Cache Selection.CoRRabs/2411.02886 (2024). doi:10.48550/ARXIV.2411.02886
-
[67]
Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. 2025. DuoAttention: Ef- ficient Long-Context LLM Inference with Retrieval and Streaming Heads. InThe Thirteenth International Conference on Learning Rep- resentations. OpenReview.net, Singapore.https://openreview.net/ forum?id=cFu7ze7xUm
2025
-
[68]
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. Efficient Streaming Language Models with Attention Sinks. InThe Twelfth International Conference on Learning Represen- tations. OpenReview.net, Vienna, Austria.https://openreview.net/ forum?id=NG7sS51zVF
2024
-
[69]
Fangyuan Xu, Tanya Goyal, and Eunsol Choi. 2024. Recycled At- tention: Efficient inference for long-context language models.CoRR abs/2411.05787 (2024). doi:10.48550/ARXIV.2411.05787
-
[70]
Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, and Song Han. 2025. XAttention: Block Sparse Attention with Antidiagonal Scoring. InForty-second International Conference on Machine Learning. OpenReview.net, Vancouver, BC, Canada.https://openreview.net/ forum?id=KG6aBfGi6e
2025
-
[71]
Zhenliang Xue, Yixin Song, Zeyu Mi, Le Chen, Yubin Xia, and Haibo Chen. 2024. PowerInfer-2: Fast Large Language Model Inference on a Smartphone.CoRRabs/2406.06282 (2024). doi:10.48550/ARXIV.2406. 06282
-
[72]
Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, and Song Han
-
[73]
Language Model Cascades: Token-Level Uncertainty and Beyond
LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention.CoRRabs/2502.14866 (2025). doi:10.48550/ARXIV. 2502.14866
work page internal anchor Pith review doi:10.48550/arxiv 2025
-
[74]
Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. In16th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, Carlsbad, CA, USA, 521–538.https://www.usenix.org/conference/ osdi22/presentation/yu
2022
-
[75]
Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Yuxing Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, and 14 Wangding Zeng. 2025. Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention. InProceedings of the 63rd An- nual Meeting of the Association for Co...
2025
-
[76]
Chen Zhang, Kuntai Du, Shu Liu, Woosuk Kwon, Xiangxi Mo, Yufeng Wang, Xiaoxuan Liu, Kaichao You, Zhuohan Li, Mingsheng Long, Ji- dong Zhai, Joseph Gonzalez, and Ion Stoica. 2025. Jenga: Effective Memory Management for Serving LLM with Heterogeneity. InPro- ceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles (SOSP ’25). Association fo...
-
[77]
Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. 2023. RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Ka- lika Bali (Eds.). Associati...
-
[78]
Tianyi Zhang, Jonah Yi, Zhaozhuo Xu, and Anshumali Shrivastava
-
[79]
InThe Thirty-Eighth Annual Conference on Neural Information Processing Systems
KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization. InThe Thirty-Eighth Annual Conference on Neural Information Processing Systems. Vancou- ver, BC, Canada.http://papers.nips.cc/paper_files/paper/2024/hash/ 05d6b5b6901fb57d2c287e1d3ce6d63c-Abstract-Conference.html
2024
-
[80]
Barrett, Zhangyang Wang, and Beidi Chen
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lian- min Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christo- pher Ré, Clark W. Barrett, Zhangyang Wang, and Beidi Chen
-
[81]
InThe Thirty-Seventh Annual Conference on Neural Information Processing Systems
H2O: Heavy-Hitter Oracle for Efficient Generative Infer- ence of Large Language Models. InThe Thirty-Seventh Annual Conference on Neural Information Processing Systems. New Or- leans, LA, USA.http://papers.nips.cc/paper_files/paper/2023/hash/ 6ceefa7b15572587b78ecfcebb2827f8-Abstract-Conference.html
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.