pith. sign in

arxiv: 2505.02922 · v3 · submitted 2025-05-05 · 💻 cs.LG

RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference

Pith reviewed 2026-05-22 15:53 UTC · model grok-4.3

classification 💻 cs.LG
keywords long-context LLM inferenceKV cachesparse attentionvector indexGPU-CPU memory managementattention approximationinference accelerationretrieval engine
0
0 comments X

The pith

RetroInfer retrieves only the most relevant KV cache tokens from CPU memory using a wave index to deliver up to 4.4X higher decoding throughput at 120K contexts while matching full attention accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Long-context LLM inference slows because the KV cache grows linearly and requires scanning every entry for attention at each step. RetroInfer offloads the full KV cache to CPU memory and retrieves a small important subset using a vector index designed specifically for attention patterns. The wave index applies tripartite attention approximation, accuracy-bound estimation, and segmented clustering to keep retrieval costs low without hurting output quality. A wave buffer coordinates data movement and computation across GPU and CPU hardware. Tests across models and tasks show large speedups over both dense and prior sparse baselines at context lengths up to one million tokens.

Core claim

By building an Attention-aWare VEctor index that combines tripartite attention approximation, accuracy-bound attention estimation, and segmented clustering, RetroInfer creates a practical sparsity-based KV cache; when paired with the wave buffer for heterogeneous memory management, this yields up to 4.4X decoding throughput over full attention at 120K context and up to 12.2X over sparse baselines at 1M tokens while preserving full-attention accuracy.

What carries the argument

The wave index, an Attention-aWare VEctor index that uses tripartite attention approximation, accuracy-bound attention estimation, and segmented clustering to reduce retrieval cost while bounding accuracy loss in sparse KV cache access.

If this is right

  • Decoding throughput rises by up to 4.4X versus full attention when context reaches 120K tokens.
  • Speedups reach 12.2X over earlier sparse attention methods once context hits 1 million tokens.
  • Accuracy stays equivalent to full attention across tested models and workloads.
  • GPU memory and bandwidth demands drop enough to support contexts of at least 1 million tokens on existing hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same retrieval logic could extend to other memory-bound stages such as feed-forward layers in very long sequences.
  • Lower bandwidth use from sparse access may reduce total power draw when serving many long-context requests.
  • Buffer management between CPU and GPU could be reused in other inference systems that mix fast and slow memory tiers.
  • Dynamic adjustment of the wave index during generation might further improve accuracy on tasks with shifting attention patterns.

Load-bearing premise

Attention sparsity patterns across models and workloads can be captured well enough by tripartite approximation and segmented clustering to avoid accuracy loss that would require per-model or per-task tuning.

What would settle it

Running RetroInfer on a new model or task and finding that generated outputs deviate measurably from full-attention outputs at the same context length would show the sparsity approximation is insufficient.

Figures

Figures reproduced from arXiv: 2505.02922 by Bailu Ding, Baotong Lu, Chen Chen, Cheng Li, Chengruidong Zhang, Di Liu, Fan Yang, Huiqiang Jiang, Jiawei Jiang, Jingjia Luo, Jing Liu, Jinkai Zhang, Mao Yang, Mingxing Zhang, Qianxi Zhang, Qi Chen, Xiao Yan, Yaoqi Chen, Yuqing Yang.

Figure 1
Figure 1. Figure 1: RetroInfer raises the trade-off ceiling between ac￾curacy and retrieval cost, and manages data across hardware. steps, and input tasks [9, 15]; thus, standard indexes struggle to retrieve a small number of tokens while maintaining high accuracy. As shown in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Dynamic sparsity in attention (Llama3-8B-1048K, [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Variety of attention sparsity across (a) model layers [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Attention-aware design of wave index. The wave [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Illustration of computation cost for three zones. The [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Centroid representativeness and estimation accu [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Design of wave buffer. Black arrows denote pointer [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: RULER accuracy under different context lengths [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: RULER accuracy (128K) [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗
Figure 14
Figure 14. Figure 14: Maximum decoding throughput across tasks and [PITH_FULL_IMAGE:figures/full_fig_p011_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Prefilling la￾tency under different context lengths. RetroIn￾fer’s prefilling latency is only slightly higher than full attention due to light￾weight index building. 1 4 8 16 32 Batch size 100 200 300 Tokens/second Base W/ GPU cache W/ Async cache update [PITH_FULL_IMAGE:figures/full_fig_p011_15.png] view at source ↗
Figure 18
Figure 18. Figure 18: Impact of three zone sizes on maximum decoding throughput and task accuracy (Llama3.1-8B, 128K context). [PITH_FULL_IMAGE:figures/full_fig_p012_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: (a) Estimation helps improve the model accuracy. [PITH_FULL_IMAGE:figures/full_fig_p012_19.png] view at source ↗
read the original abstract

Recent large language models (LLMs) are rapidly extending their context windows, yet inference throughput lags due to increasing GPU memory and bandwidth demands. This is because the key-value (KV) cache, an intermediate structure storing token representations, grows linearly with context length and requires an iterative linear scan for attention computation. A promising direction to accelerate long-context inference is to exploit attention's inherent sparsity by offloading the KV cache to CPU memory and retrieving only a small subset of tokens important to the current generation step. However, prior sparse attention approaches struggle to balance accuracy and retrieval cost due to varying sparsity patterns and inefficient GPU-CPU memory management. We present RetroInfer, a vector storage engine that realizes a sparsity-based KV cache for long-context inference. RetroInfer introduces an Attention-aWare VEctor index (wave index), which fundamentally improves the tradeoff between attention accuracy and retrieval cost through tripartite attention approximation, accuracy-bound attention estimation, and segmented clustering. We also design the wave buffer, a GPU-CPU buffer manager that assigns computation and manages data across heterogeneous hardware. We evaluate RetroInfer across a range of models and workloads, demonstrating up to 4.4X decoding throughput over full attention at 120K context and up to 12.2X over sparse attention baselines at 1 million tokens -- all while preserving full-attention-level accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces RetroInfer, a vector storage engine for efficient long-context LLM inference by exploiting attention sparsity in the KV cache. It proposes the Attention-aWare VEctor index (wave index) using tripartite attention approximation, accuracy-bound attention estimation, and segmented clustering to retrieve important tokens, along with a wave buffer for GPU-CPU data management. Evaluations across models and workloads report up to 4.4X decoding throughput over full attention at 120K context and 12.2X over sparse attention baselines at 1M tokens while preserving full-attention-level accuracy.

Significance. If the empirical claims hold, RetroInfer could meaningfully advance practical long-context LLM deployment by reducing memory bandwidth pressure through a specialized vector index and buffer manager. The introduction of attention-specific approximations (tripartite, accuracy-bound estimation, segmented clustering) and the wave buffer represents a concrete systems contribution. The reported speedups are substantial and would be of high practical interest if shown to be robust and reproducible without hidden accuracy costs.

major comments (2)
  1. [§4 (Evaluation)] §4 (Evaluation): The manuscript claims preservation of full-attention-level accuracy and reports concrete speedups, yet provides no details on experimental controls, number of runs, statistical significance of throughput numbers, or how accuracy was measured across all tokens, layers, and heads. This directly undermines confidence in the central claim that the tripartite approximation and segmented clustering avoid degradation.
  2. [§3.2] §3.2 (Tripartite attention approximation and accuracy-bound estimation): The description of how the accuracy-bound estimation and segmented clustering reliably capture sparsity patterns across diverse models, layers, and generation steps lacks formal bounds or ablation evidence showing that token importance is not systematically under- or over-estimated for long-range dependencies. This is load-bearing for the 'full-attention-level accuracy' guarantee.
minor comments (2)
  1. [Abstract] Abstract and §1: The acronym 'wave index' is used before its expansion and a brief description of its three components; adding a short parenthetical on first use would improve readability.
  2. [§4] Figures in §4: Ensure all plots include error bars or variance indicators and explicit legends distinguishing full attention, sparse baselines, and RetroInfer across context lengths.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable comments on our work. We address each of the major comments below and outline the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: §4 (Evaluation): The manuscript claims preservation of full-attention-level accuracy and reports concrete speedups, yet provides no details on experimental controls, number of runs, statistical significance of throughput numbers, or how accuracy was measured across all tokens, layers, and heads. This directly undermines confidence in the central claim that the tripartite approximation and segmented clustering avoid degradation.

    Authors: We agree that the experimental details in the current manuscript are insufficient to fully substantiate the accuracy and performance claims. In the revised version, we will add a dedicated 'Experimental Methodology' subsection to §4. This subsection will specify the full experimental controls (hardware configuration with exact GPU/CPU models and memory sizes, software stack, and workload generation procedures), the number of independent runs performed (5 runs per configuration using different random seeds, with results reported as mean ± standard deviation), statistical significance testing (paired t-tests on throughput measurements with p-values), and the precise accuracy evaluation protocol. Accuracy is measured via (i) end-to-end perplexity and token-level match rate against full-attention outputs on standard benchmarks and (ii) layer- and head-wise comparison of approximated attention scores to full attention scores across all context tokens. revision: yes

  2. Referee: §3.2 (Tripartite attention approximation and accuracy-bound estimation): The description of how the accuracy-bound estimation and segmented clustering reliably capture sparsity patterns across diverse models, layers, and generation steps lacks formal bounds or ablation evidence showing that token importance is not systematically under- or over-estimated for long-range dependencies. This is load-bearing for the 'full-attention-level accuracy' guarantee.

    Authors: We acknowledge that stronger theoretical and empirical grounding would increase confidence in the approximation techniques. The manuscript already contains multi-model, multi-layer evaluations at long contexts that empirically support preserved accuracy, but we agree these do not constitute formal bounds or targeted long-range ablations. In the revision we will extend §3.2 with a short derivation of an error bound for the tripartite approximation (showing the per-token estimation error is upper-bounded by a term linear in the attention sparsity ratio) and add a new ablation subsection in §4 that isolates long-range dependency retrieval (tokens >50k positions) across generation steps and layers, reporting both importance-score correlation with full attention and any observed systematic bias. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on novel components and empirical measurements

full rationale

The paper presents RetroInfer as a new vector storage engine with a wave index built from tripartite attention approximation, accuracy-bound estimation, and segmented clustering, plus a wave buffer for heterogeneous memory management. These are introduced as original designs, with performance claims (4.4X throughput at 120K, 12.2X at 1M tokens) backed by direct experimental comparisons to full attention and sparse baselines while reporting preserved accuracy. No equations or sections reduce a claimed prediction or result to a fitted parameter or self-citation by construction; the derivation chain consists of system architecture choices evaluated externally rather than tautological redefinitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The paper introduces two new system components whose performance depends on internal design choices whose sensitivity is not quantified in the abstract.

invented entities (2)
  • wave index no independent evidence
    purpose: Attention-aware vector index for efficient KV cache retrieval
    Core new data structure presented as the main technical contribution.
  • wave buffer no independent evidence
    purpose: GPU-CPU buffer manager for heterogeneous memory
    New buffer design for managing data movement across hardware.

pith-pipeline@v0.9.0 · 5834 in / 1186 out tokens · 42329 ms · 2026-05-22T15:53:31.506571+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. KVDrive: A Holistic Multi-Tier KV Cache Management System for Long-Context LLM Inference

    cs.CL 2026-05 unverdicted novelty 6.0

    KVDrive introduces a multi-tier KV cache management system that achieves up to 1.74x higher throughput for long-context LLM inference through adaptive cache placement, pipeline restructuring, and cross-tier coordinati...

  2. AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference

    cs.DC 2026-05 unverdicted novelty 6.0

    AB-Sparse adaptively allocates per-head block sizes for sparse attention, adds lossless centroid quantization and custom variable-block GPU kernels, and reports up to 5.43% accuracy gain over fixed-block baselines wit...

  3. Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving

    cs.LG 2026-04 unverdicted novelty 6.0

    SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.

Reference graph

Works this paper leans on

130 extracted references · 130 canonical work pages · cited by 3 Pith papers · 11 internal anchors

  1. [1]

    01-ai. 2024. Yi-6B-200K. https://huggingface.co/01-ai/Yi-6B-200K. Accessed: 2024-11-11

  2. [2]

    01-ai. 2024. Yi-9B-200K. https://huggingface.co/01-ai/Yi-9B-200K. Accessed: 2024-11-11

  3. [3]

    Gulavani, Alexey Tumanov, and Ramachandran Ramjee

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwa- tra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. In18th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, Santa Clara, CA, USA, 117–134. https://www.usenix....

  4. [4]

    SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

    Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee. 2023. SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.CoRRabs/2308.16369 (2023). https://doi.org/10.48550/ARXIV.2308.16369

  5. [5]

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore, 4895–4901. https:...

  6. [6]

    doi: 10.18653/v1/2024.acl-long

    Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, S. Khatamifard, Minsik Cho, Carlo C. del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar. 2024. LLM in a flash: Efficient Large Language Model Inference with Limited Memory. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for C...

  7. [7]

    Anthropic. 2025. Claude. https://www.anthropic.com/claude. Accessed: 2025- 08-01

  8. [8]

    C., Arun Iyer, Suresh Parthasarathy, Sriram K

    Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, Vageesh D. C., Arun Iyer, Suresh Parthasarathy, Sriram K. Rajamani, Balasubramanyan Ashok, and Shashank Shet. 2024. CodePlan: Repository-Level Coding using LLMs and Planning.Proceedings of the ACM on Software Engineering1, FSE (2024), 675–

  9. [9]

    https://doi.org/10.1145/3643757

  10. [10]

    Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, and Wen Xiao. 2024. PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling. CoRRabs/2406.02069 (2024). https://doi.org/10.48550/ARXIV.2406.02069

  11. [11]

    Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. 2023. Accelerating Large Language Model Decoding with Speculative Sampling.CoRRabs/2302.01318 (2023). https: //doi.org/10.48550/ARXIV.2302.01318

  12. [12]

    Cheng Chen, Chenzhe Jin, Yunan Zhang, Sasha Podolsky, Chun Wu, Szu- Po Wang, Eric Hanson, Zhou Sun, Robert Walzer, and Jianguo Wang. 2024. SingleStore-V: An Integrated Vector Database System in SingleStore.Proc. VLDB Endow.17, 12 (2024), 3772–3785. https://doi.org/10.14778/3685800.3685805

  13. [13]

    Sean Wang

    Meng Chen, Kai Zhang, Zhenying He, Yinan Jing, and X. Sean Wang. 2024. RoarGraph: A Projected Bipartite Graph for Efficient Cross-Modal Approximate Nearest Neighbor Search.Proc. VLDB Endow.17, 11 (2024), 2735–2749. https: //doi.org/10.14778/3681954.3681959

  14. [14]

    Weijian Chen, Shuibing He, Haoyang Qu, Ruidong Zhang, Siling Yang, Ping Chen, Yi Zheng, Baoxing Huai, and Gang Chen. 2025. IMPRESS: An Importance- Informed Multi-Tier Prefix KV Storage System for Large Language Model Infer- ence. In23rd USENIX Conference on File and Storage Technologies. USENIX As- sociation, Santa Clara, CA, USA, 187–201. https://www.use...

  15. [15]

    Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. 2024. Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs.CoRRabs/2412.21187 (2024). https: //doi.org/10.48550/ARXIV.2412.21187

  16. [16]

    Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Léon Bottou, Zhihao Jia, and Beidi Chen. 2025. MagicPIG: LSH Sampling for Efficient LLM Generation. In The Thirteenth International Conference on Learning Representations. OpenRe- view.net, Singapore. https://openreview.net/forum?id=ALzTQUgW8a

  17. [17]

    Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating Long Sequences with Sparse Transformers.CoRRabs/1904.10509 (2019). http: //arxiv.org/abs/1904.10509

  18. [18]

    and Niculae, Vlad and Martins, André F.T

    Gonçalo M. Correia, Vlad Niculae, and André F. T. Martins. 2019. Adaptively Sparse Transformers. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Hong Kong, China, 2174–2184. https://doi.org/10.18653...

  19. [19]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InThe Thirty-Sixth Annual Conference on Neural Information Processing Systems. New Orleans, LA, USA. http://papers.nips.cc/paper_files/paper/2022/hash/ 67d57c32e20fd0a7a302cb81d36e40d5-Abstract-Conference.html

  20. [20]

    DeepSeek. 2025. DeepSeek-R1-Distill-Llama-8B. https://huggingface.co/ deepseek-ai/DeepSeek-R1-Distill-Llama-8B. Accessed: 2025-08-01

  21. [21]

    DeepSeek. 2025. DeepSeek-R1-Distill-Qwen-7B. https://huggingface.co/ deepseek-ai/DeepSeek-R1-Distill-Qwen-7B. Accessed: 2025-08-01

  22. [22]

    DeepSeek-AI. 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.CoRRabs/2501.12948 (2025). https://doi.org/10. 48550/ARXIV.2501.12948

  23. [23]

    Yichuan Deng, Zhao Song, Jing Xiong, and Chiwun Yang. 2024. How Sparse Attention Approximates Exact Attention? Your Attention is Naturally 𝑛𝐶 - Sparse.arXiv preprint arXiv:2404.02690(2024)

  24. [24]

    Yangshen Deng, Zhengxin You, Long Xiang, Qilong Li, Peiqi Yuan, Zhaoyang Hong, Yitao Zheng, Wanting Li, Runzhong Li, Haotian Liu, Kyriakos Mouratidis, Man Lung Yiu, Huan Li, Qiaomu Shen, Rui Mao, and Bo Tang. 2025. AlayaDB: The Data Foundation for Efficient and Effective Long-context LLM Inference. InCompanion of the 2025 International Conference on Manag...

  25. [26]

    Cong Fu, Chao Xiang, Changxu Wang, and Deng Cai. 2019. Fast Approximate Nearest Neighbor Search With The Navigating Spreading-out Graph.Proc. VLDB Endow.12, 5 (2019), 461–474. https://doi.org/10.14778/3303753.3303754

  26. [27]

    Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. 2024. ServerlessLLM: Low-Latency Serverless In- ference for Large Language Models. In18th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, Santa Clara, CA, USA, 135–153. https://www.usenix.org/conference/osdi24/presentation/fu

  27. [28]

    Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2024. Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention. InProceedings of the 2024 USENIX Annual Technical Conference. USENIX Asso- ciation, Santa Clara, CA, USA, 111–126. https://www.usenix.org/...

  28. [29]

    Shiwei Gao, Youmin Chen, and Jiwu Shu. 2025. Fast State Restoration in LLM Serving with HCache. InProceedings of the Twentieth European Conference on Computer Systems. ACM, Rotterdam, The Netherlands, 128–143. https: //doi.org/10.1145/3689031.3696072

  29. [30]

    Shihong Gao, Xin Zhang, Yanyan Shen, and Lei Chen. 2025. Apt-Serve: Adap- tive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serv- ing.Proceedings of the ACM on Management of Data3, 3 (2025), 130:1–130:28. https://doi.org/10.1145/3725394

  30. [31]

    Yizhao Gao, Zhichen Zeng, Dayou Du, Shijie Cao, Hayden Kwok-Hay So, Ting Cao, Fan Yang, and Mao Yang. 2024. SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs.CoRRabs/2410.13276 (2024). https://doi.org/10.48550/ ARXIV.2410.13276

  31. [32]

    Google. 2025. Gemini. https://gemini.google.com/app. Accessed: 2025-08-01

  32. [33]

    gradientai. 2024. Llama-3-8B-Instruct-Gradient-1048k. https://huggingface.co/ gradientai/Llama-3-8B-Instruct-Gradient-1048k. Accessed: 2024-10-29

  33. [34]

    Greg Kamradt. 2023. Needle in a haystack - pressure testing llms. https: //github.com/gkamradt/LLMTest_NeedleInAHaystack. Accessed: 2024-08-12

  34. [35]

    Rentong Guo, Xiaofan Luan, Long Xiang, Xiao Yan, Xiaomeng Yi, Jigao Luo, Qianya Cheng, Weizhi Xu, Jiarui Luo, Frank Liu, Zhenshan Cao, Yanliang Qiao, Ting Wang, Bo Tang, and Charles Xie. 2022. Manu: A Cloud Native Vector Database Management System.Proc. VLDB Endow.15, 12 (2022), 3548–3561. https://doi.org/10.14778/3554821.3554843

  35. [36]

    Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami

    Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. 2024. KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization. In The Thirty-Eighth Annual Conference on Neural Information Processing Systems. Vancouver, BC, Canada. http://papers.nips.cc/paper_files/paper/202...

  36. [37]

    Mahoney, Kurt Keutzer, and Amir Gholami

    Coleman Richard Charles Hooper, Sehoon Kim, Hiva Mohammadzadeh, Mon- ishwaran Maheswaran, Sebastian Zhao, June Paik, Michael W. Mahoney, Kurt Keutzer, and Amir Gholami. 2025. Squeezed Attention: Accelerating Long Context Length LLM Inference. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)....

  37. [38]

    Kurt Hornik, Ingo Feinerer, Martin Kober, and Christian Buchta. 2012. Spherical k-means clustering.Journal of statistical software50 (2012), 1–22

  38. [39]

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. 2024. RULER: What’s the Real Context Size of Your Long-Context Language Models?CoRRabs/2404.06654 (2024). https://doi.org/10.48550/ARXIV.2404.06654

  39. [40]

    Piotr Indyk and Rajeev Motwani. 1998. Approximate Nearest Neighbors: To- wards Removing the Curse of Dimensionality. InProceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing. ACM, Dallas, Texas, USA, 604–613. https://doi.org/10.1145/276698.276876

  40. [41]

    InfiniGen. 2024. InfiniGen Code. https://github.com/snu-comparch/InfiniGen. Accessed: 2025-04-01

  41. [42]

    Johan Ludwig William Valdemar Jensen. 1906. Sur les fonctions convexes et les inégalités entre les valeurs moyennes.Acta mathematica30, 1 (1906), 175–193

  42. [43]

    Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu

    Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2024. MInference 1.0: Accelerating Pre- filling for Long-Context LLMs via Dynamic Sparse Attention. InThe Thirty- Eighth Annual Conference on Neural Information Processing Systems. Van- couver...

  43. [44]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles. ACM, Koblenz, Germany, 611–626. https://doi.org/10.1145/3600006.3613165

  44. [45]

    Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management. In18th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, Santa Clara, CA, 155–172. https: //www.usenix.org/conference/osdi24/presentation/lee

  45. [46]

    Viktor Leis. 2024. LeanStore: A High-Performance Storage Engine for NVMe SSDs.Proc. VLDB Endow.17, 12 (2024), 4536–4545. https://doi.org/10.14778/ 3685800.3685915

  46. [47]

    Viktor Leis, Adnan Alhomssi, Tobias Ziegler, Yannick Loeck, and Christian Dietrich. 2023. Virtual-Memory Assisted Buffer Management.Proceedings of the ACM on Management of Data1, 1 (2023), 7:1–7:25. https://doi.org/10.1145/ 3588687

  47. [48]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- täschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Genera- tion for Knowledge-Intensive NLP Tasks. InThe Thirty-fourth Annual Conference on Neural Information Processing Systems. virtual. ht...

  48. [49]

    Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, Xuanhe Zhou, Chenhao Ma, Guoliang Li, Kevin Chen-Chuan Chang, Fei Huang, Reynold Cheng, and Yongbin Li. 2023. Can LLM Already Serve as A Database Interface? A Big Bench for Large-Scale Database Grounded Text-to-SQLs. InThe Thirty-Seventh Annual ...

  49. [50]

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. 2024. SnapKV: LLM Knows What You are Looking for Before Generation. InThe Thirty- Eighth Annual Conference on Neural Information Processing Systems. Van- couver, BC, Canada. http://papers.nips.cc/paper_files/paper/2024/hash/ 28a...

  50. [51]

    Gonzalez, and Ion Stoica

    Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica

  51. [52]

    In17th USENIX Symposium on Operating Systems Design and Implementation

    AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving. In17th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, 663–679. https://www.usenix.org/ conference/osdi23/presentation/li-zhouhan

  52. [53]

    Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, and Lili Qiu. 2024. Parrot: Efficient Serving of LLM-based Applications with Semantic Variable. In18th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, Santa Clara, CA, USA, 929–945. https: //www.usenix.org/conference/osdi24/presentation/lin-chaofan

  53. [54]

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei- Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration. InProceedings of the Seventh An- nual Conference on Machine Learning and Systems. mlsys.org, Santa Clara, CA, USA. https://proc...

  54. [55]

    Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong Zhang, Bailu Ding, Kai Zhang, Chen Chen, Fan Yang, Yuqing Yang, and Lili Qiu. 2024. RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval.CoRRabs/2409.10516 (2024). https://doi. org/10.48550/ARXIV.2409.10516

  55. [56]

    Guangda Liu, Chengwei Li, Jieru Zhao, Chenqi Zhang, and Minyi Guo. 2025. ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression. In62nd ACM/IEEE Design Automation Conference. IEEE, San Francisco, CA, USA, 1–7. https://doi.org/10.1109/DAC63849.2025.11132479

  56. [57]

    Shige Liu, Zhifang Zeng, Li Chen, Adil Ainihaer, Arun Ramasami, Songting Chen, Yu Xu, Mingxi Wu, and Jianguo Wang. 2025. TigerVector: Supporting Vector Search in Graph Databases for Advanced RAGs. InCompanion of the 2025 International Conference on Management of Data. ACM, Berlin, Germany, 553–565. https://doi.org/10.1145/3722212.3724456

  57. [58]

    Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen (Henry) Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2024. KIVI: A Tuning-Free Asym- metric 2bit Quantization for KV Cache. InForty-first International Conference on Machine Learning. OpenReview.net, Vienna, Austria. https://openreview. net/forum?id=L057s2Rq8O

  58. [59]

    Kejing Lu, Mineichi Kudo, Chuan Xiao, and Yoshiharu Ishikawa. 2021. HVS: Hierarchical Graph Structure Based on Voronoi Diagrams for Solving Approx- imate Nearest Neighbor Search.Proc. VLDB Endow.15, 2 (2021), 246–258. https://doi.org/10.14778/3489496.3489506

  59. [60]

    MagicPIG. 2024. MagicPIG Code. https://github.com/Infini-AI-Lab/MagicPIG. Accessed: 2025-04-01

  60. [61]

    Malkov and Dmitry A

    Yury A. Malkov and Dmitry A. Yashunin. 2020. Efficient and Robust Approx- imate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs.IEEE Transactions on Pattern Analysis and Machine Intelligence42, 4 (2020), 824–836. https://doi.org/10.1109/TPAMI.2018.2889473

  61. [62]

    Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, and Rashmi Vinayak. 2025. Helix: Serving Large Language Models over Hetero- geneous GPUs and Network via Max-Flow. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. ACM, Rotterdam, The Netherlands, 58...

  62. [63]

    Meta. 2024. Llama-3.1-8B-Instruct. https://huggingface.co/meta-llama/Llama- 3.1-8B-Instruct. Accessed: 2024-09-25

  63. [64]

    Meta. 2025. The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation. https://ai.meta.com/blog/llama-4-multimodal-intelligence/. Ac- cessed: 2025-04-05

  64. [65]

    Ilyas, Umar Farooq Minhas, Jeffrey Pound, and Theodoros Rekatsinas

    Jason Mohoney, Anil Pacaci, Shihabur Rahman Chowdhury, Ali Mousavi, Ihab F. Ilyas, Umar Farooq Minhas, Jeffrey Pound, and Theodoros Rekatsinas. 2023. High-Throughput Vector Similarity Search in Knowledge Graphs.Proceedings of the ACM on Management of Data1, 2 (2023), 197:1–197:25. https://doi.org/ 10.1145/3589777

  65. [66]

    Jiaqi Mu and Pramod Viswanath. 2018. All-but-the-Top: Simple and Effective Postprocessing for Word Representations. InThe Sixth International Conference on Learning Representations. OpenReview.net, Vancouver, BC, Canada. https: //openreview.net/forum?id=HkuGJ3kCb

  66. [67]

    Ansong Ni, Jeevana Priya Inala, Chenglong Wang, Alex Polozov, Christopher Meek, Dragomir Radev, and Jianfeng Gao. 2023. Learning Math Reasoning from Self-Sampled Correct and Partially-Correct Solutions. InThe Eleventh International Conference on Learning Representations. OpenReview.net, Kigali, Rwanda. https://openreview.net/forum?id=4D4TSJE6-K

  67. [68]

    NVIDIA. 2020. NVIDIA A100 Tensor Core GPU. https://www.nvidia.com/en- us/data-center/a100/. Accessed: 2025-04-01

  68. [69]

    NVIDIA. 2020. NVIDIA RTX A6000 Graphics Card. https://www.nvidia.com/en- us/products/workstations/rtx-a6000/. Accessed: 2025-10-01

  69. [70]

    Art of Problem Solving. 2024. AIME Problems and Solutions. https:// artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions. Ac- cessed: 2025-08-01

  70. [71]

    Hiroyuki Ootomo, Akira Naruse, Corey Nolet, Ray Wang, Tamas Feher, and Yong Wang. 2024. CAGRA: Highly Parallel Graph Construction and Approximate Nearest Neighbor Search for GPUs. In40th IEEE International Conference on Data Engineering. IEEE, Utrecht, The Netherlands, 4236–4247. https://doi.org/ 10.1109/ICDE60146.2024.00323

  71. [72]

    OpenAI. 2025. ChatGPT. https://chat.chatbotapp.ai/. Accessed: 2025-08-01

  72. [73]

    James Jie Pan, Jianguo Wang, and Guoliang Li. 2024. Survey of vector database management systems.VLDB J.33, 5 (2024), 1591–1615. https://doi.org/10.1007/ S00778-024-00864-X

  73. [74]

    Liana Patel, Peter Kraft, Carlos Guestrin, and Matei Zaharia. 2024. ACORN: Performant and Predicate-Agnostic Search Over Vector Embeddings and Struc- tured Data.Proceedings of the ACM on Management of Data2, 3 (2024), 120. https://doi.org/10.1145/3654923

  74. [75]

    Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean

  75. [76]

    InProceedings of the Sixth Conference on Machine Learning and Systems

    Efficiently Scaling Transformer Inference. InProceedings of the Sixth Conference on Machine Learning and Systems. mlsys.org, Mi- ami, FL, USA. https://proceedings.mlsys.org/paper_files/paper/2023/hash/ c4be71ab8d24cdfb45e3d06dbfca2780-Abstract-mlsys2023.html

  76. [77]

    PQCache. 2024. PQCache. https://github.com/HugoZHL/PQCache. Accessed: 2025-04-01

  77. [78]

    Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2025. Mooncake: Trading More Storage for Less Computation - A KVCache-centric Architecture for Serving LLM Chatbot. In23rd USENIX Conference on File and Storage Technologies. USENIX Association, Santa Clara, CA, USA, 155–170. https://www.usenix...

  78. [79]

    Quest. 2024. Quest Code. https://github.com/mit-han-lab/Quest. Accessed: 2025-04-01

  79. [80]

    Qwen. 2024. Qwen2.5-72B-Instruct. https://huggingface.co/Qwen/Qwen2.5- 72B-Instruct. Accessed: 2025-01-12

  80. [81]

    Qwen. 2024. Qwen2.5-7B-Instruct. https://huggingface.co/Qwen/Qwen2.5-7B- Instruct. Accessed: 2025-01-12

Showing first 80 references.