RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference
Pith reviewed 2026-05-22 15:53 UTC · model grok-4.3
The pith
RetroInfer retrieves only the most relevant KV cache tokens from CPU memory using a wave index to deliver up to 4.4X higher decoding throughput at 120K contexts while matching full attention accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By building an Attention-aWare VEctor index that combines tripartite attention approximation, accuracy-bound attention estimation, and segmented clustering, RetroInfer creates a practical sparsity-based KV cache; when paired with the wave buffer for heterogeneous memory management, this yields up to 4.4X decoding throughput over full attention at 120K context and up to 12.2X over sparse baselines at 1M tokens while preserving full-attention accuracy.
What carries the argument
The wave index, an Attention-aWare VEctor index that uses tripartite attention approximation, accuracy-bound attention estimation, and segmented clustering to reduce retrieval cost while bounding accuracy loss in sparse KV cache access.
If this is right
- Decoding throughput rises by up to 4.4X versus full attention when context reaches 120K tokens.
- Speedups reach 12.2X over earlier sparse attention methods once context hits 1 million tokens.
- Accuracy stays equivalent to full attention across tested models and workloads.
- GPU memory and bandwidth demands drop enough to support contexts of at least 1 million tokens on existing hardware.
Where Pith is reading between the lines
- The same retrieval logic could extend to other memory-bound stages such as feed-forward layers in very long sequences.
- Lower bandwidth use from sparse access may reduce total power draw when serving many long-context requests.
- Buffer management between CPU and GPU could be reused in other inference systems that mix fast and slow memory tiers.
- Dynamic adjustment of the wave index during generation might further improve accuracy on tasks with shifting attention patterns.
Load-bearing premise
Attention sparsity patterns across models and workloads can be captured well enough by tripartite approximation and segmented clustering to avoid accuracy loss that would require per-model or per-task tuning.
What would settle it
Running RetroInfer on a new model or task and finding that generated outputs deviate measurably from full-attention outputs at the same context length would show the sparsity approximation is insufficient.
Figures
read the original abstract
Recent large language models (LLMs) are rapidly extending their context windows, yet inference throughput lags due to increasing GPU memory and bandwidth demands. This is because the key-value (KV) cache, an intermediate structure storing token representations, grows linearly with context length and requires an iterative linear scan for attention computation. A promising direction to accelerate long-context inference is to exploit attention's inherent sparsity by offloading the KV cache to CPU memory and retrieving only a small subset of tokens important to the current generation step. However, prior sparse attention approaches struggle to balance accuracy and retrieval cost due to varying sparsity patterns and inefficient GPU-CPU memory management. We present RetroInfer, a vector storage engine that realizes a sparsity-based KV cache for long-context inference. RetroInfer introduces an Attention-aWare VEctor index (wave index), which fundamentally improves the tradeoff between attention accuracy and retrieval cost through tripartite attention approximation, accuracy-bound attention estimation, and segmented clustering. We also design the wave buffer, a GPU-CPU buffer manager that assigns computation and manages data across heterogeneous hardware. We evaluate RetroInfer across a range of models and workloads, demonstrating up to 4.4X decoding throughput over full attention at 120K context and up to 12.2X over sparse attention baselines at 1 million tokens -- all while preserving full-attention-level accuracy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces RetroInfer, a vector storage engine for efficient long-context LLM inference by exploiting attention sparsity in the KV cache. It proposes the Attention-aWare VEctor index (wave index) using tripartite attention approximation, accuracy-bound attention estimation, and segmented clustering to retrieve important tokens, along with a wave buffer for GPU-CPU data management. Evaluations across models and workloads report up to 4.4X decoding throughput over full attention at 120K context and 12.2X over sparse attention baselines at 1M tokens while preserving full-attention-level accuracy.
Significance. If the empirical claims hold, RetroInfer could meaningfully advance practical long-context LLM deployment by reducing memory bandwidth pressure through a specialized vector index and buffer manager. The introduction of attention-specific approximations (tripartite, accuracy-bound estimation, segmented clustering) and the wave buffer represents a concrete systems contribution. The reported speedups are substantial and would be of high practical interest if shown to be robust and reproducible without hidden accuracy costs.
major comments (2)
- [§4 (Evaluation)] §4 (Evaluation): The manuscript claims preservation of full-attention-level accuracy and reports concrete speedups, yet provides no details on experimental controls, number of runs, statistical significance of throughput numbers, or how accuracy was measured across all tokens, layers, and heads. This directly undermines confidence in the central claim that the tripartite approximation and segmented clustering avoid degradation.
- [§3.2] §3.2 (Tripartite attention approximation and accuracy-bound estimation): The description of how the accuracy-bound estimation and segmented clustering reliably capture sparsity patterns across diverse models, layers, and generation steps lacks formal bounds or ablation evidence showing that token importance is not systematically under- or over-estimated for long-range dependencies. This is load-bearing for the 'full-attention-level accuracy' guarantee.
minor comments (2)
- [Abstract] Abstract and §1: The acronym 'wave index' is used before its expansion and a brief description of its three components; adding a short parenthetical on first use would improve readability.
- [§4] Figures in §4: Ensure all plots include error bars or variance indicators and explicit legends distinguishing full attention, sparse baselines, and RetroInfer across context lengths.
Simulated Author's Rebuttal
We thank the referee for their thorough review and valuable comments on our work. We address each of the major comments below and outline the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: §4 (Evaluation): The manuscript claims preservation of full-attention-level accuracy and reports concrete speedups, yet provides no details on experimental controls, number of runs, statistical significance of throughput numbers, or how accuracy was measured across all tokens, layers, and heads. This directly undermines confidence in the central claim that the tripartite approximation and segmented clustering avoid degradation.
Authors: We agree that the experimental details in the current manuscript are insufficient to fully substantiate the accuracy and performance claims. In the revised version, we will add a dedicated 'Experimental Methodology' subsection to §4. This subsection will specify the full experimental controls (hardware configuration with exact GPU/CPU models and memory sizes, software stack, and workload generation procedures), the number of independent runs performed (5 runs per configuration using different random seeds, with results reported as mean ± standard deviation), statistical significance testing (paired t-tests on throughput measurements with p-values), and the precise accuracy evaluation protocol. Accuracy is measured via (i) end-to-end perplexity and token-level match rate against full-attention outputs on standard benchmarks and (ii) layer- and head-wise comparison of approximated attention scores to full attention scores across all context tokens. revision: yes
-
Referee: §3.2 (Tripartite attention approximation and accuracy-bound estimation): The description of how the accuracy-bound estimation and segmented clustering reliably capture sparsity patterns across diverse models, layers, and generation steps lacks formal bounds or ablation evidence showing that token importance is not systematically under- or over-estimated for long-range dependencies. This is load-bearing for the 'full-attention-level accuracy' guarantee.
Authors: We acknowledge that stronger theoretical and empirical grounding would increase confidence in the approximation techniques. The manuscript already contains multi-model, multi-layer evaluations at long contexts that empirically support preserved accuracy, but we agree these do not constitute formal bounds or targeted long-range ablations. In the revision we will extend §3.2 with a short derivation of an error bound for the tripartite approximation (showing the per-token estimation error is upper-bounded by a term linear in the attention sparsity ratio) and add a new ablation subsection in §4 that isolates long-range dependency retrieval (tokens >50k positions) across generation steps and layers, reporting both importance-score correlation with full attention and any observed systematic bias. revision: yes
Circularity Check
No significant circularity; claims rest on novel components and empirical measurements
full rationale
The paper presents RetroInfer as a new vector storage engine with a wave index built from tripartite attention approximation, accuracy-bound estimation, and segmented clustering, plus a wave buffer for heterogeneous memory management. These are introduced as original designs, with performance claims (4.4X throughput at 120K, 12.2X at 1M tokens) backed by direct experimental comparisons to full attention and sparse baselines while reporting preserved accuracy. No equations or sections reduce a claimed prediction or result to a fitted parameter or self-citation by construction; the derivation chain consists of system architecture choices evaluated externally rather than tautological redefinitions.
Axiom & Free-Parameter Ledger
invented entities (2)
-
wave index
no independent evidence
-
wave buffer
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
tripartite attention approximation... steady zone, retrieval zone, and estimation zone... accuracy-bound attention estimation... segmented clustering
-
IndisputableMonolith/Foundation/DimensionForcing.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
segmented clustering... 8K segment size... update segment size to 1K tokens
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
KVDrive: A Holistic Multi-Tier KV Cache Management System for Long-Context LLM Inference
KVDrive introduces a multi-tier KV cache management system that achieves up to 1.74x higher throughput for long-context LLM inference through adaptive cache placement, pipeline restructuring, and cross-tier coordinati...
-
AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference
AB-Sparse adaptively allocates per-head block sizes for sparse attention, adds lossless centroid quantization and custom variable-block GPU kernels, and reports up to 5.43% accuracy gain over fixed-block baselines wit...
-
Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving
SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.
Reference graph
Works this paper leans on
-
[1]
01-ai. 2024. Yi-6B-200K. https://huggingface.co/01-ai/Yi-6B-200K. Accessed: 2024-11-11
work page 2024
-
[2]
01-ai. 2024. Yi-9B-200K. https://huggingface.co/01-ai/Yi-9B-200K. Accessed: 2024-11-11
work page 2024
-
[3]
Gulavani, Alexey Tumanov, and Ramachandran Ramjee
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwa- tra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. In18th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, Santa Clara, CA, USA, 117–134. https://www.usenix....
work page 2024
-
[4]
SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee. 2023. SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.CoRRabs/2308.16369 (2023). https://doi.org/10.48550/ARXIV.2308.16369
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.16369 2023
-
[5]
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore, 4895–4901. https:...
work page 2023
-
[6]
doi: 10.18653/v1/2024.acl-long
Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, S. Khatamifard, Minsik Cho, Carlo C. del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar. 2024. LLM in a flash: Efficient Large Language Model Inference with Limited Memory. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for C...
-
[7]
Anthropic. 2025. Claude. https://www.anthropic.com/claude. Accessed: 2025- 08-01
work page 2025
-
[8]
C., Arun Iyer, Suresh Parthasarathy, Sriram K
Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, Vageesh D. C., Arun Iyer, Suresh Parthasarathy, Sriram K. Rajamani, Balasubramanyan Ashok, and Shashank Shet. 2024. CodePlan: Repository-Level Coding using LLMs and Planning.Proceedings of the ACM on Software Engineering1, FSE (2024), 675–
work page 2024
-
[9]
https://doi.org/10.1145/3643757
-
[10]
Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, and Wen Xiao. 2024. PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling. CoRRabs/2406.02069 (2024). https://doi.org/10.48550/ARXIV.2406.02069
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.02069 2024
-
[11]
Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. 2023. Accelerating Large Language Model Decoding with Speculative Sampling.CoRRabs/2302.01318 (2023). https: //doi.org/10.48550/ARXIV.2302.01318
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.01318 2023
-
[12]
Cheng Chen, Chenzhe Jin, Yunan Zhang, Sasha Podolsky, Chun Wu, Szu- Po Wang, Eric Hanson, Zhou Sun, Robert Walzer, and Jianguo Wang. 2024. SingleStore-V: An Integrated Vector Database System in SingleStore.Proc. VLDB Endow.17, 12 (2024), 3772–3785. https://doi.org/10.14778/3685800.3685805
-
[13]
Meng Chen, Kai Zhang, Zhenying He, Yinan Jing, and X. Sean Wang. 2024. RoarGraph: A Projected Bipartite Graph for Efficient Cross-Modal Approximate Nearest Neighbor Search.Proc. VLDB Endow.17, 11 (2024), 2735–2749. https: //doi.org/10.14778/3681954.3681959
-
[14]
Weijian Chen, Shuibing He, Haoyang Qu, Ruidong Zhang, Siling Yang, Ping Chen, Yi Zheng, Baoxing Huai, and Gang Chen. 2025. IMPRESS: An Importance- Informed Multi-Tier Prefix KV Storage System for Large Language Model Infer- ence. In23rd USENIX Conference on File and Storage Technologies. USENIX As- sociation, Santa Clara, CA, USA, 187–201. https://www.use...
work page 2025
-
[15]
Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. 2024. Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs.CoRRabs/2412.21187 (2024). https: //doi.org/10.48550/ARXIV.2412.21187
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.21187 2024
-
[16]
Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Léon Bottou, Zhihao Jia, and Beidi Chen. 2025. MagicPIG: LSH Sampling for Efficient LLM Generation. In The Thirteenth International Conference on Learning Representations. OpenRe- view.net, Singapore. https://openreview.net/forum?id=ALzTQUgW8a
work page 2025
-
[17]
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating Long Sequences with Sparse Transformers.CoRRabs/1904.10509 (2019). http: //arxiv.org/abs/1904.10509
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[18]
and Niculae, Vlad and Martins, André F.T
Gonçalo M. Correia, Vlad Niculae, and André F. T. Martins. 2019. Adaptively Sparse Transformers. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Hong Kong, China, 2174–2184. https://doi.org/10.18653...
-
[19]
Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InThe Thirty-Sixth Annual Conference on Neural Information Processing Systems. New Orleans, LA, USA. http://papers.nips.cc/paper_files/paper/2022/hash/ 67d57c32e20fd0a7a302cb81d36e40d5-Abstract-Conference.html
work page 2022
-
[20]
DeepSeek. 2025. DeepSeek-R1-Distill-Llama-8B. https://huggingface.co/ deepseek-ai/DeepSeek-R1-Distill-Llama-8B. Accessed: 2025-08-01
work page 2025
-
[21]
DeepSeek. 2025. DeepSeek-R1-Distill-Qwen-7B. https://huggingface.co/ deepseek-ai/DeepSeek-R1-Distill-Qwen-7B. Accessed: 2025-08-01
work page 2025
-
[22]
DeepSeek-AI. 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.CoRRabs/2501.12948 (2025). https://doi.org/10. 48550/ARXIV.2501.12948
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [23]
-
[24]
Yangshen Deng, Zhengxin You, Long Xiang, Qilong Li, Peiqi Yuan, Zhaoyang Hong, Yitao Zheng, Wanting Li, Runzhong Li, Haotian Liu, Kyriakos Mouratidis, Man Lung Yiu, Huan Li, Qiaomu Shen, Rui Mao, and Bo Tang. 2025. AlayaDB: The Data Foundation for Efficient and Effective Long-context LLM Inference. InCompanion of the 2025 International Conference on Manag...
-
[26]
Cong Fu, Chao Xiang, Changxu Wang, and Deng Cai. 2019. Fast Approximate Nearest Neighbor Search With The Navigating Spreading-out Graph.Proc. VLDB Endow.12, 5 (2019), 461–474. https://doi.org/10.14778/3303753.3303754
-
[27]
Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. 2024. ServerlessLLM: Low-Latency Serverless In- ference for Large Language Models. In18th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, Santa Clara, CA, USA, 135–153. https://www.usenix.org/conference/osdi24/presentation/fu
work page 2024
-
[28]
Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2024. Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention. InProceedings of the 2024 USENIX Annual Technical Conference. USENIX Asso- ciation, Santa Clara, CA, USA, 111–126. https://www.usenix.org/...
work page 2024
-
[29]
Shiwei Gao, Youmin Chen, and Jiwu Shu. 2025. Fast State Restoration in LLM Serving with HCache. InProceedings of the Twentieth European Conference on Computer Systems. ACM, Rotterdam, The Netherlands, 128–143. https: //doi.org/10.1145/3689031.3696072
-
[30]
Shihong Gao, Xin Zhang, Yanyan Shen, and Lei Chen. 2025. Apt-Serve: Adap- tive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serv- ing.Proceedings of the ACM on Management of Data3, 3 (2025), 130:1–130:28. https://doi.org/10.1145/3725394
- [31]
-
[32]
Google. 2025. Gemini. https://gemini.google.com/app. Accessed: 2025-08-01
work page 2025
-
[33]
gradientai. 2024. Llama-3-8B-Instruct-Gradient-1048k. https://huggingface.co/ gradientai/Llama-3-8B-Instruct-Gradient-1048k. Accessed: 2024-10-29
work page 2024
-
[34]
Greg Kamradt. 2023. Needle in a haystack - pressure testing llms. https: //github.com/gkamradt/LLMTest_NeedleInAHaystack. Accessed: 2024-08-12
work page 2023
-
[35]
Rentong Guo, Xiaofan Luan, Long Xiang, Xiao Yan, Xiaomeng Yi, Jigao Luo, Qianya Cheng, Weizhi Xu, Jiarui Luo, Frank Liu, Zhenshan Cao, Yanliang Qiao, Ting Wang, Bo Tang, and Charles Xie. 2022. Manu: A Cloud Native Vector Database Management System.Proc. VLDB Endow.15, 12 (2022), 3548–3561. https://doi.org/10.14778/3554821.3554843
-
[36]
Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami
Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. 2024. KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization. In The Thirty-Eighth Annual Conference on Neural Information Processing Systems. Vancouver, BC, Canada. http://papers.nips.cc/paper_files/paper/202...
work page 2024
-
[37]
Mahoney, Kurt Keutzer, and Amir Gholami
Coleman Richard Charles Hooper, Sehoon Kim, Hiva Mohammadzadeh, Mon- ishwaran Maheswaran, Sebastian Zhao, June Paik, Michael W. Mahoney, Kurt Keutzer, and Amir Gholami. 2025. Squeezed Attention: Accelerating Long Context Length LLM Inference. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)....
work page 2025
-
[38]
Kurt Hornik, Ingo Feinerer, Martin Kober, and Christian Buchta. 2012. Spherical k-means clustering.Journal of statistical software50 (2012), 1–22
work page 2012
-
[39]
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. 2024. RULER: What’s the Real Context Size of Your Long-Context Language Models?CoRRabs/2404.06654 (2024). https://doi.org/10.48550/ARXIV.2404.06654
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.06654 2024
-
[40]
Piotr Indyk and Rajeev Motwani. 1998. Approximate Nearest Neighbors: To- wards Removing the Curse of Dimensionality. InProceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing. ACM, Dallas, Texas, USA, 604–613. https://doi.org/10.1145/276698.276876
-
[41]
InfiniGen. 2024. InfiniGen Code. https://github.com/snu-comparch/InfiniGen. Accessed: 2025-04-01
work page 2024
-
[42]
Johan Ludwig William Valdemar Jensen. 1906. Sur les fonctions convexes et les inégalités entre les valeurs moyennes.Acta mathematica30, 1 (1906), 175–193
work page 1906
-
[43]
Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu
Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2024. MInference 1.0: Accelerating Pre- filling for Long-Context LLMs via Dynamic Sparse Attention. InThe Thirty- Eighth Annual Conference on Neural Information Processing Systems. Van- couver...
work page 2024
-
[44]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles. ACM, Koblenz, Germany, 611–626. https://doi.org/10.1145/3600006.3613165
-
[45]
Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management. In18th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, Santa Clara, CA, 155–172. https: //www.usenix.org/conference/osdi24/presentation/lee
work page 2024
- [46]
-
[47]
Viktor Leis, Adnan Alhomssi, Tobias Ziegler, Yannick Loeck, and Christian Dietrich. 2023. Virtual-Memory Assisted Buffer Management.Proceedings of the ACM on Management of Data1, 1 (2023), 7:1–7:25. https://doi.org/10.1145/ 3588687
work page 2023
-
[48]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- täschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Genera- tion for Knowledge-Intensive NLP Tasks. InThe Thirty-fourth Annual Conference on Neural Information Processing Systems. virtual. ht...
work page 2020
-
[49]
Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, Xuanhe Zhou, Chenhao Ma, Guoliang Li, Kevin Chen-Chuan Chang, Fei Huang, Reynold Cheng, and Yongbin Li. 2023. Can LLM Already Serve as A Database Interface? A Big Bench for Large-Scale Database Grounded Text-to-SQLs. InThe Thirty-Seventh Annual ...
work page 2023
-
[50]
Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. 2024. SnapKV: LLM Knows What You are Looking for Before Generation. InThe Thirty- Eighth Annual Conference on Neural Information Processing Systems. Van- couver, BC, Canada. http://papers.nips.cc/paper_files/paper/2024/hash/ 28a...
work page 2024
-
[51]
Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica
-
[52]
In17th USENIX Symposium on Operating Systems Design and Implementation
AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving. In17th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, 663–679. https://www.usenix.org/ conference/osdi23/presentation/li-zhouhan
-
[53]
Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, and Lili Qiu. 2024. Parrot: Efficient Serving of LLM-based Applications with Semantic Variable. In18th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, Santa Clara, CA, USA, 929–945. https: //www.usenix.org/conference/osdi24/presentation/lin-chaofan
work page 2024
-
[54]
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei- Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration. InProceedings of the Seventh An- nual Conference on Machine Learning and Systems. mlsys.org, Santa Clara, CA, USA. https://proc...
work page 2024
-
[55]
Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong Zhang, Bailu Ding, Kai Zhang, Chen Chen, Fan Yang, Yuqing Yang, and Lili Qiu. 2024. RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval.CoRRabs/2409.10516 (2024). https://doi. org/10.48550/ARXIV.2409.10516
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2409.10516 2024
-
[56]
Guangda Liu, Chengwei Li, Jieru Zhao, Chenqi Zhang, and Minyi Guo. 2025. ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression. In62nd ACM/IEEE Design Automation Conference. IEEE, San Francisco, CA, USA, 1–7. https://doi.org/10.1109/DAC63849.2025.11132479
-
[57]
Shige Liu, Zhifang Zeng, Li Chen, Adil Ainihaer, Arun Ramasami, Songting Chen, Yu Xu, Mingxi Wu, and Jianguo Wang. 2025. TigerVector: Supporting Vector Search in Graph Databases for Advanced RAGs. InCompanion of the 2025 International Conference on Management of Data. ACM, Berlin, Germany, 553–565. https://doi.org/10.1145/3722212.3724456
-
[58]
Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen (Henry) Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2024. KIVI: A Tuning-Free Asym- metric 2bit Quantization for KV Cache. InForty-first International Conference on Machine Learning. OpenReview.net, Vienna, Austria. https://openreview. net/forum?id=L057s2Rq8O
work page 2024
-
[59]
Kejing Lu, Mineichi Kudo, Chuan Xiao, and Yoshiharu Ishikawa. 2021. HVS: Hierarchical Graph Structure Based on Voronoi Diagrams for Solving Approx- imate Nearest Neighbor Search.Proc. VLDB Endow.15, 2 (2021), 246–258. https://doi.org/10.14778/3489496.3489506
-
[60]
MagicPIG. 2024. MagicPIG Code. https://github.com/Infini-AI-Lab/MagicPIG. Accessed: 2025-04-01
work page 2024
-
[61]
Yury A. Malkov and Dmitry A. Yashunin. 2020. Efficient and Robust Approx- imate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs.IEEE Transactions on Pattern Analysis and Machine Intelligence42, 4 (2020), 824–836. https://doi.org/10.1109/TPAMI.2018.2889473
-
[62]
Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, and Rashmi Vinayak. 2025. Helix: Serving Large Language Models over Hetero- geneous GPUs and Network via Max-Flow. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. ACM, Rotterdam, The Netherlands, 58...
-
[63]
Meta. 2024. Llama-3.1-8B-Instruct. https://huggingface.co/meta-llama/Llama- 3.1-8B-Instruct. Accessed: 2024-09-25
work page 2024
-
[64]
Meta. 2025. The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation. https://ai.meta.com/blog/llama-4-multimodal-intelligence/. Ac- cessed: 2025-04-05
work page 2025
-
[65]
Ilyas, Umar Farooq Minhas, Jeffrey Pound, and Theodoros Rekatsinas
Jason Mohoney, Anil Pacaci, Shihabur Rahman Chowdhury, Ali Mousavi, Ihab F. Ilyas, Umar Farooq Minhas, Jeffrey Pound, and Theodoros Rekatsinas. 2023. High-Throughput Vector Similarity Search in Knowledge Graphs.Proceedings of the ACM on Management of Data1, 2 (2023), 197:1–197:25. https://doi.org/ 10.1145/3589777
-
[66]
Jiaqi Mu and Pramod Viswanath. 2018. All-but-the-Top: Simple and Effective Postprocessing for Word Representations. InThe Sixth International Conference on Learning Representations. OpenReview.net, Vancouver, BC, Canada. https: //openreview.net/forum?id=HkuGJ3kCb
work page 2018
-
[67]
Ansong Ni, Jeevana Priya Inala, Chenglong Wang, Alex Polozov, Christopher Meek, Dragomir Radev, and Jianfeng Gao. 2023. Learning Math Reasoning from Self-Sampled Correct and Partially-Correct Solutions. InThe Eleventh International Conference on Learning Representations. OpenReview.net, Kigali, Rwanda. https://openreview.net/forum?id=4D4TSJE6-K
work page 2023
-
[68]
NVIDIA. 2020. NVIDIA A100 Tensor Core GPU. https://www.nvidia.com/en- us/data-center/a100/. Accessed: 2025-04-01
work page 2020
-
[69]
NVIDIA. 2020. NVIDIA RTX A6000 Graphics Card. https://www.nvidia.com/en- us/products/workstations/rtx-a6000/. Accessed: 2025-10-01
work page 2020
-
[70]
Art of Problem Solving. 2024. AIME Problems and Solutions. https:// artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions. Ac- cessed: 2025-08-01
work page 2024
-
[71]
Hiroyuki Ootomo, Akira Naruse, Corey Nolet, Ray Wang, Tamas Feher, and Yong Wang. 2024. CAGRA: Highly Parallel Graph Construction and Approximate Nearest Neighbor Search for GPUs. In40th IEEE International Conference on Data Engineering. IEEE, Utrecht, The Netherlands, 4236–4247. https://doi.org/ 10.1109/ICDE60146.2024.00323
-
[72]
OpenAI. 2025. ChatGPT. https://chat.chatbotapp.ai/. Accessed: 2025-08-01
work page 2025
-
[73]
James Jie Pan, Jianguo Wang, and Guoliang Li. 2024. Survey of vector database management systems.VLDB J.33, 5 (2024), 1591–1615. https://doi.org/10.1007/ S00778-024-00864-X
work page 2024
-
[74]
Liana Patel, Peter Kraft, Carlos Guestrin, and Matei Zaharia. 2024. ACORN: Performant and Predicate-Agnostic Search Over Vector Embeddings and Struc- tured Data.Proceedings of the ACM on Management of Data2, 3 (2024), 120. https://doi.org/10.1145/3654923
-
[75]
Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean
-
[76]
InProceedings of the Sixth Conference on Machine Learning and Systems
Efficiently Scaling Transformer Inference. InProceedings of the Sixth Conference on Machine Learning and Systems. mlsys.org, Mi- ami, FL, USA. https://proceedings.mlsys.org/paper_files/paper/2023/hash/ c4be71ab8d24cdfb45e3d06dbfca2780-Abstract-mlsys2023.html
work page 2023
-
[77]
PQCache. 2024. PQCache. https://github.com/HugoZHL/PQCache. Accessed: 2025-04-01
work page 2024
-
[78]
Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2025. Mooncake: Trading More Storage for Less Computation - A KVCache-centric Architecture for Serving LLM Chatbot. In23rd USENIX Conference on File and Storage Technologies. USENIX Association, Santa Clara, CA, USA, 155–170. https://www.usenix...
work page 2025
-
[79]
Quest. 2024. Quest Code. https://github.com/mit-han-lab/Quest. Accessed: 2025-04-01
work page 2024
-
[80]
Qwen. 2024. Qwen2.5-72B-Instruct. https://huggingface.co/Qwen/Qwen2.5- 72B-Instruct. Accessed: 2025-01-12
work page 2024
-
[81]
Qwen. 2024. Qwen2.5-7B-Instruct. https://huggingface.co/Qwen/Qwen2.5-7B- Instruct. Accessed: 2025-01-12
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.