RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference

Bailu Ding; Baotong Lu; Chen Chen; Cheng Li; Chengruidong Zhang; Di Liu; Fan Yang; Huiqiang Jiang; Jiawei Jiang; Jingjia Luo

arxiv: 2505.02922 · v3 · submitted 2025-05-05 · 💻 cs.LG

RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference

Yaoqi Chen , Jinkai Zhang , Baotong Lu , Qianxi Zhang , Chengruidong Zhang , Jing Liu , Jingjia Luo , Di Liu

show 11 more authors

Huiqiang Jiang Qi Chen Bailu Ding Xiao Yan Jiawei Jiang Chen Chen Mingxing Zhang Cheng Li Yuqing Yang Fan Yang Mao Yang

This is my paper

Pith reviewed 2026-05-22 15:53 UTC · model grok-4.3

classification 💻 cs.LG

keywords long-context LLM inferenceKV cachesparse attentionvector indexGPU-CPU memory managementattention approximationinference accelerationretrieval engine

0 comments

The pith

RetroInfer retrieves only the most relevant KV cache tokens from CPU memory using a wave index to deliver up to 4.4X higher decoding throughput at 120K contexts while matching full attention accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Long-context LLM inference slows because the KV cache grows linearly and requires scanning every entry for attention at each step. RetroInfer offloads the full KV cache to CPU memory and retrieves a small important subset using a vector index designed specifically for attention patterns. The wave index applies tripartite attention approximation, accuracy-bound estimation, and segmented clustering to keep retrieval costs low without hurting output quality. A wave buffer coordinates data movement and computation across GPU and CPU hardware. Tests across models and tasks show large speedups over both dense and prior sparse baselines at context lengths up to one million tokens.

Core claim

By building an Attention-aWare VEctor index that combines tripartite attention approximation, accuracy-bound attention estimation, and segmented clustering, RetroInfer creates a practical sparsity-based KV cache; when paired with the wave buffer for heterogeneous memory management, this yields up to 4.4X decoding throughput over full attention at 120K context and up to 12.2X over sparse baselines at 1M tokens while preserving full-attention accuracy.

What carries the argument

The wave index, an Attention-aWare VEctor index that uses tripartite attention approximation, accuracy-bound attention estimation, and segmented clustering to reduce retrieval cost while bounding accuracy loss in sparse KV cache access.

If this is right

Decoding throughput rises by up to 4.4X versus full attention when context reaches 120K tokens.
Speedups reach 12.2X over earlier sparse attention methods once context hits 1 million tokens.
Accuracy stays equivalent to full attention across tested models and workloads.
GPU memory and bandwidth demands drop enough to support contexts of at least 1 million tokens on existing hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same retrieval logic could extend to other memory-bound stages such as feed-forward layers in very long sequences.
Lower bandwidth use from sparse access may reduce total power draw when serving many long-context requests.
Buffer management between CPU and GPU could be reused in other inference systems that mix fast and slow memory tiers.
Dynamic adjustment of the wave index during generation might further improve accuracy on tasks with shifting attention patterns.

Load-bearing premise

Attention sparsity patterns across models and workloads can be captured well enough by tripartite approximation and segmented clustering to avoid accuracy loss that would require per-model or per-task tuning.

What would settle it

Running RetroInfer on a new model or task and finding that generated outputs deviate measurably from full-attention outputs at the same context length would show the sparsity approximation is insufficient.

Figures

Figures reproduced from arXiv: 2505.02922 by Bailu Ding, Baotong Lu, Chen Chen, Cheng Li, Chengruidong Zhang, Di Liu, Fan Yang, Huiqiang Jiang, Jiawei Jiang, Jingjia Luo, Jing Liu, Jinkai Zhang, Mao Yang, Mingxing Zhang, Qianxi Zhang, Qi Chen, Xiao Yan, Yaoqi Chen, Yuqing Yang.

**Figure 1.** Figure 1: RetroInfer raises the trade-off ceiling between accuracy and retrieval cost, and manages data across hardware. steps, and input tasks [9, 15]; thus, standard indexes struggle to retrieve a small number of tokens while maintaining high accuracy. As shown in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 3.** Figure 3: Dynamic sparsity in attention (Llama3-8B-1048K, [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Variety of attention sparsity across (a) model layers [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 6.** Figure 6: Attention-aware design of wave index. The wave [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: Illustration of computation cost for three zones. The [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: Centroid representativeness and estimation accu [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: Design of wave buffer. Black arrows denote pointer [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: RULER accuracy under different context lengths [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 12.** Figure 12: RULER accuracy (128K) [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗

**Figure 14.** Figure 14: Maximum decoding throughput across tasks and [PITH_FULL_IMAGE:figures/full_fig_p011_14.png] view at source ↗

**Figure 15.** Figure 15: Prefilling latency under different context lengths. RetroInfer’s prefilling latency is only slightly higher than full attention due to lightweight index building. 1 4 8 16 32 Batch size 100 200 300 Tokens/second Base W/ GPU cache W/ Async cache update [PITH_FULL_IMAGE:figures/full_fig_p011_15.png] view at source ↗

**Figure 18.** Figure 18: Impact of three zone sizes on maximum decoding throughput and task accuracy (Llama3.1-8B, 128K context). [PITH_FULL_IMAGE:figures/full_fig_p012_18.png] view at source ↗

**Figure 19.** Figure 19: (a) Estimation helps improve the model accuracy. [PITH_FULL_IMAGE:figures/full_fig_p012_19.png] view at source ↗

read the original abstract

Recent large language models (LLMs) are rapidly extending their context windows, yet inference throughput lags due to increasing GPU memory and bandwidth demands. This is because the key-value (KV) cache, an intermediate structure storing token representations, grows linearly with context length and requires an iterative linear scan for attention computation. A promising direction to accelerate long-context inference is to exploit attention's inherent sparsity by offloading the KV cache to CPU memory and retrieving only a small subset of tokens important to the current generation step. However, prior sparse attention approaches struggle to balance accuracy and retrieval cost due to varying sparsity patterns and inefficient GPU-CPU memory management. We present RetroInfer, a vector storage engine that realizes a sparsity-based KV cache for long-context inference. RetroInfer introduces an Attention-aWare VEctor index (wave index), which fundamentally improves the tradeoff between attention accuracy and retrieval cost through tripartite attention approximation, accuracy-bound attention estimation, and segmented clustering. We also design the wave buffer, a GPU-CPU buffer manager that assigns computation and manages data across heterogeneous hardware. We evaluate RetroInfer across a range of models and workloads, demonstrating up to 4.4X decoding throughput over full attention at 120K context and up to 12.2X over sparse attention baselines at 1 million tokens -- all while preserving full-attention-level accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RetroInfer's wave index combines tripartite approximation and segmented clustering for sparse KV retrieval, delivering reported speedups that could matter if accuracy generalizes without tuning.

read the letter

The main takeaway is that RetroInfer introduces a vector storage engine with a wave index to handle sparse KV cache retrieval for long-context LLM inference. The integration of tripartite attention approximation, accuracy-bound estimation, and segmented clustering is what sets the wave index apart from prior sparse attention work. The paper does well in the systems integration, particularly with the wave buffer managing data movement between GPU and CPU. It reports concrete speedups of up to 4.4X decoding throughput over full attention at 120K context and 12.2X over sparse attention baselines at 1 million tokens, while maintaining full-attention-level accuracy on the tested cases. The softer part is whether those approximations reliably extract the right sparsity patterns across different models, layers, and generation steps without any hidden accuracy loss or need for adjustments. The stress-test concern about potential underestimation of token importance in long-range dependencies is worth watching, and the abstract leaves out specifics on experimental controls and accuracy measurement details. The full paper should clarify if the results are robust or sensitive to the workloads. This paper targets systems researchers and practitioners focused on optimizing LLM inference for extended contexts. Readers dealing with memory constraints in large-scale deployments would get practical value from the design and performance numbers. I think it deserves a serious referee because the contribution is a working system with measurable improvements on a timely problem, even if more rigorous validation might be requested.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces RetroInfer, a vector storage engine for efficient long-context LLM inference by exploiting attention sparsity in the KV cache. It proposes the Attention-aWare VEctor index (wave index) using tripartite attention approximation, accuracy-bound attention estimation, and segmented clustering to retrieve important tokens, along with a wave buffer for GPU-CPU data management. Evaluations across models and workloads report up to 4.4X decoding throughput over full attention at 120K context and 12.2X over sparse attention baselines at 1M tokens while preserving full-attention-level accuracy.

Significance. If the empirical claims hold, RetroInfer could meaningfully advance practical long-context LLM deployment by reducing memory bandwidth pressure through a specialized vector index and buffer manager. The introduction of attention-specific approximations (tripartite, accuracy-bound estimation, segmented clustering) and the wave buffer represents a concrete systems contribution. The reported speedups are substantial and would be of high practical interest if shown to be robust and reproducible without hidden accuracy costs.

major comments (2)

[§4 (Evaluation)] §4 (Evaluation): The manuscript claims preservation of full-attention-level accuracy and reports concrete speedups, yet provides no details on experimental controls, number of runs, statistical significance of throughput numbers, or how accuracy was measured across all tokens, layers, and heads. This directly undermines confidence in the central claim that the tripartite approximation and segmented clustering avoid degradation.
[§3.2] §3.2 (Tripartite attention approximation and accuracy-bound estimation): The description of how the accuracy-bound estimation and segmented clustering reliably capture sparsity patterns across diverse models, layers, and generation steps lacks formal bounds or ablation evidence showing that token importance is not systematically under- or over-estimated for long-range dependencies. This is load-bearing for the 'full-attention-level accuracy' guarantee.

minor comments (2)

[Abstract] Abstract and §1: The acronym 'wave index' is used before its expansion and a brief description of its three components; adding a short parenthetical on first use would improve readability.
[§4] Figures in §4: Ensure all plots include error bars or variance indicators and explicit legends distinguishing full attention, sparse baselines, and RetroInfer across context lengths.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable comments on our work. We address each of the major comments below and outline the revisions we will make to the manuscript.

read point-by-point responses

Referee: §4 (Evaluation): The manuscript claims preservation of full-attention-level accuracy and reports concrete speedups, yet provides no details on experimental controls, number of runs, statistical significance of throughput numbers, or how accuracy was measured across all tokens, layers, and heads. This directly undermines confidence in the central claim that the tripartite approximation and segmented clustering avoid degradation.

Authors: We agree that the experimental details in the current manuscript are insufficient to fully substantiate the accuracy and performance claims. In the revised version, we will add a dedicated 'Experimental Methodology' subsection to §4. This subsection will specify the full experimental controls (hardware configuration with exact GPU/CPU models and memory sizes, software stack, and workload generation procedures), the number of independent runs performed (5 runs per configuration using different random seeds, with results reported as mean ± standard deviation), statistical significance testing (paired t-tests on throughput measurements with p-values), and the precise accuracy evaluation protocol. Accuracy is measured via (i) end-to-end perplexity and token-level match rate against full-attention outputs on standard benchmarks and (ii) layer- and head-wise comparison of approximated attention scores to full attention scores across all context tokens. revision: yes
Referee: §3.2 (Tripartite attention approximation and accuracy-bound estimation): The description of how the accuracy-bound estimation and segmented clustering reliably capture sparsity patterns across diverse models, layers, and generation steps lacks formal bounds or ablation evidence showing that token importance is not systematically under- or over-estimated for long-range dependencies. This is load-bearing for the 'full-attention-level accuracy' guarantee.

Authors: We acknowledge that stronger theoretical and empirical grounding would increase confidence in the approximation techniques. The manuscript already contains multi-model, multi-layer evaluations at long contexts that empirically support preserved accuracy, but we agree these do not constitute formal bounds or targeted long-range ablations. In the revision we will extend §3.2 with a short derivation of an error bound for the tripartite approximation (showing the per-token estimation error is upper-bounded by a term linear in the attention sparsity ratio) and add a new ablation subsection in §4 that isolates long-range dependency retrieval (tokens >50k positions) across generation steps and layers, reporting both importance-score correlation with full attention and any observed systematic bias. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on novel components and empirical measurements

full rationale

The paper presents RetroInfer as a new vector storage engine with a wave index built from tripartite attention approximation, accuracy-bound estimation, and segmented clustering, plus a wave buffer for heterogeneous memory management. These are introduced as original designs, with performance claims (4.4X throughput at 120K, 12.2X at 1M tokens) backed by direct experimental comparisons to full attention and sparse baselines while reporting preserved accuracy. No equations or sections reduce a claimed prediction or result to a fitted parameter or self-citation by construction; the derivation chain consists of system architecture choices evaluated externally rather than tautological redefinitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The paper introduces two new system components whose performance depends on internal design choices whose sensitivity is not quantified in the abstract.

invented entities (2)

wave index no independent evidence
purpose: Attention-aware vector index for efficient KV cache retrieval
Core new data structure presented as the main technical contribution.
wave buffer no independent evidence
purpose: GPU-CPU buffer manager for heterogeneous memory
New buffer design for managing data movement across hardware.

pith-pipeline@v0.9.0 · 5834 in / 1186 out tokens · 42329 ms · 2026-05-22T15:53:31.506571+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

tripartite attention approximation... steady zone, retrieval zone, and estimation zone... accuracy-bound attention estimation... segmented clustering
IndisputableMonolith/Foundation/DimensionForcing.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

segmented clustering... 8K segment size... update segment size to 1K tokens

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

KVDrive: A Holistic Multi-Tier KV Cache Management System for Long-Context LLM Inference
cs.CL 2026-05 unverdicted novelty 6.0

KVDrive introduces a multi-tier KV cache management system that achieves up to 1.74x higher throughput for long-context LLM inference through adaptive cache placement, pipeline restructuring, and cross-tier coordinati...
AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference
cs.DC 2026-05 unverdicted novelty 6.0

AB-Sparse adaptively allocates per-head block sizes for sparse attention, adds lossless centroid quantization and custom variable-block GPU kernels, and reports up to 5.43% accuracy gain over fixed-block baselines wit...
Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving
cs.LG 2026-04 unverdicted novelty 6.0

SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.

Reference graph

Works this paper leans on

130 extracted references · 130 canonical work pages · cited by 3 Pith papers · 11 internal anchors

[1]

01-ai. 2024. Yi-6B-200K. https://huggingface.co/01-ai/Yi-6B-200K. Accessed: 2024-11-11

work page 2024
[2]

01-ai. 2024. Yi-9B-200K. https://huggingface.co/01-ai/Yi-9B-200K. Accessed: 2024-11-11

work page 2024
[3]

Gulavani, Alexey Tumanov, and Ramachandran Ramjee

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwa- tra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. In18th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, Santa Clara, CA, USA, 117–134. https://www.usenix....

work page 2024
[4]

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee. 2023. SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.CoRRabs/2308.16369 (2023). https://doi.org/10.48550/ARXIV.2308.16369

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.16369 2023
[5]

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore, 4895–4901. https:...

work page 2023
[6]

doi: 10.18653/v1/2024.acl-long

Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, S. Khatamifard, Minsik Cho, Carlo C. del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar. 2024. LLM in a flash: Efficient Large Language Model Inference with Limited Memory. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for C...

work page doi:10.18653/v1/2024.acl-long 2024
[7]

Anthropic. 2025. Claude. https://www.anthropic.com/claude. Accessed: 2025- 08-01

work page 2025
[8]

C., Arun Iyer, Suresh Parthasarathy, Sriram K

Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, Vageesh D. C., Arun Iyer, Suresh Parthasarathy, Sriram K. Rajamani, Balasubramanyan Ashok, and Shashank Shet. 2024. CodePlan: Repository-Level Coding using LLMs and Planning.Proceedings of the ACM on Software Engineering1, FSE (2024), 675–

work page 2024
[9]

https://doi.org/10.1145/3643757

work page doi:10.1145/3643757
[10]

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, and Wen Xiao. 2024. PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling. CoRRabs/2406.02069 (2024). https://doi.org/10.48550/ARXIV.2406.02069

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.02069 2024
[11]

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. 2023. Accelerating Large Language Model Decoding with Speculative Sampling.CoRRabs/2302.01318 (2023). https: //doi.org/10.48550/ARXIV.2302.01318

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.01318 2023
[12]

Cheng Chen, Chenzhe Jin, Yunan Zhang, Sasha Podolsky, Chun Wu, Szu- Po Wang, Eric Hanson, Zhou Sun, Robert Walzer, and Jianguo Wang. 2024. SingleStore-V: An Integrated Vector Database System in SingleStore.Proc. VLDB Endow.17, 12 (2024), 3772–3785. https://doi.org/10.14778/3685800.3685805

work page doi:10.14778/3685800.3685805 2024
[13]

Sean Wang

Meng Chen, Kai Zhang, Zhenying He, Yinan Jing, and X. Sean Wang. 2024. RoarGraph: A Projected Bipartite Graph for Efficient Cross-Modal Approximate Nearest Neighbor Search.Proc. VLDB Endow.17, 11 (2024), 2735–2749. https: //doi.org/10.14778/3681954.3681959

work page doi:10.14778/3681954.3681959 2024
[14]

Weijian Chen, Shuibing He, Haoyang Qu, Ruidong Zhang, Siling Yang, Ping Chen, Yi Zheng, Baoxing Huai, and Gang Chen. 2025. IMPRESS: An Importance- Informed Multi-Tier Prefix KV Storage System for Large Language Model Infer- ence. In23rd USENIX Conference on File and Storage Technologies. USENIX As- sociation, Santa Clara, CA, USA, 187–201. https://www.use...

work page 2025
[15]

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. 2024. Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs.CoRRabs/2412.21187 (2024). https: //doi.org/10.48550/ARXIV.2412.21187

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.21187 2024
[16]

Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Léon Bottou, Zhihao Jia, and Beidi Chen. 2025. MagicPIG: LSH Sampling for Efficient LLM Generation. In The Thirteenth International Conference on Learning Representations. OpenRe- view.net, Singapore. https://openreview.net/forum?id=ALzTQUgW8a

work page 2025
[17]

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating Long Sequences with Sparse Transformers.CoRRabs/1904.10509 (2019). http: //arxiv.org/abs/1904.10509

work page internal anchor Pith review Pith/arXiv arXiv 2019
[18]

and Niculae, Vlad and Martins, André F.T

Gonçalo M. Correia, Vlad Niculae, and André F. T. Martins. 2019. Adaptively Sparse Transformers. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Hong Kong, China, 2174–2184. https://doi.org/10.18653...

work page doi:10.18653/v1/d19-1223 2019
[19]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InThe Thirty-Sixth Annual Conference on Neural Information Processing Systems. New Orleans, LA, USA. http://papers.nips.cc/paper_files/paper/2022/hash/ 67d57c32e20fd0a7a302cb81d36e40d5-Abstract-Conference.html

work page 2022
[20]

DeepSeek. 2025. DeepSeek-R1-Distill-Llama-8B. https://huggingface.co/ deepseek-ai/DeepSeek-R1-Distill-Llama-8B. Accessed: 2025-08-01

work page 2025
[21]

DeepSeek. 2025. DeepSeek-R1-Distill-Qwen-7B. https://huggingface.co/ deepseek-ai/DeepSeek-R1-Distill-Qwen-7B. Accessed: 2025-08-01

work page 2025
[22]

DeepSeek-AI. 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.CoRRabs/2501.12948 (2025). https://doi.org/10. 48550/ARXIV.2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Yichuan Deng, Zhao Song, Jing Xiong, and Chiwun Yang. 2024. How Sparse Attention Approximates Exact Attention? Your Attention is Naturally 𝑛𝐶 - Sparse.arXiv preprint arXiv:2404.02690(2024)

work page arXiv 2024
[24]

Yangshen Deng, Zhengxin You, Long Xiang, Qilong Li, Peiqi Yuan, Zhaoyang Hong, Yitao Zheng, Wanting Li, Runzhong Li, Haotian Liu, Kyriakos Mouratidis, Man Lung Yiu, Huan Li, Qiaomu Shen, Rui Mao, and Bo Tang. 2025. AlayaDB: The Data Foundation for Efficient and Effective Long-context LLM Inference. InCompanion of the 2025 International Conference on Manag...

work page arXiv 2025
[26]

Cong Fu, Chao Xiang, Changxu Wang, and Deng Cai. 2019. Fast Approximate Nearest Neighbor Search With The Navigating Spreading-out Graph.Proc. VLDB Endow.12, 5 (2019), 461–474. https://doi.org/10.14778/3303753.3303754

work page doi:10.14778/3303753.3303754 2019
[27]

Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. 2024. ServerlessLLM: Low-Latency Serverless In- ference for Large Language Models. In18th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, Santa Clara, CA, USA, 135–153. https://www.usenix.org/conference/osdi24/presentation/fu

work page 2024
[28]

Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2024. Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention. InProceedings of the 2024 USENIX Annual Technical Conference. USENIX Asso- ciation, Santa Clara, CA, USA, 111–126. https://www.usenix.org/...

work page 2024
[29]

Shiwei Gao, Youmin Chen, and Jiwu Shu. 2025. Fast State Restoration in LLM Serving with HCache. InProceedings of the Twentieth European Conference on Computer Systems. ACM, Rotterdam, The Netherlands, 128–143. https: //doi.org/10.1145/3689031.3696072

work page doi:10.1145/3689031.3696072 2025
[30]

Shihong Gao, Xin Zhang, Yanyan Shen, and Lei Chen. 2025. Apt-Serve: Adap- tive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serv- ing.Proceedings of the ACM on Management of Data3, 3 (2025), 130:1–130:28. https://doi.org/10.1145/3725394

work page doi:10.1145/3725394 2025
[31]

Yizhao Gao, Zhichen Zeng, Dayou Du, Shijie Cao, Hayden Kwok-Hay So, Ting Cao, Fan Yang, and Mao Yang. 2024. SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs.CoRRabs/2410.13276 (2024). https://doi.org/10.48550/ ARXIV.2410.13276

work page arXiv 2024
[32]

Google. 2025. Gemini. https://gemini.google.com/app. Accessed: 2025-08-01

work page 2025
[33]

gradientai. 2024. Llama-3-8B-Instruct-Gradient-1048k. https://huggingface.co/ gradientai/Llama-3-8B-Instruct-Gradient-1048k. Accessed: 2024-10-29

work page 2024
[34]

Greg Kamradt. 2023. Needle in a haystack - pressure testing llms. https: //github.com/gkamradt/LLMTest_NeedleInAHaystack. Accessed: 2024-08-12

work page 2023
[35]

Rentong Guo, Xiaofan Luan, Long Xiang, Xiao Yan, Xiaomeng Yi, Jigao Luo, Qianya Cheng, Weizhi Xu, Jiarui Luo, Frank Liu, Zhenshan Cao, Yanliang Qiao, Ting Wang, Bo Tang, and Charles Xie. 2022. Manu: A Cloud Native Vector Database Management System.Proc. VLDB Endow.15, 12 (2022), 3548–3561. https://doi.org/10.14778/3554821.3554843

work page doi:10.14778/3554821.3554843 2022
[36]

Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. 2024. KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization. In The Thirty-Eighth Annual Conference on Neural Information Processing Systems. Vancouver, BC, Canada. http://papers.nips.cc/paper_files/paper/202...

work page 2024
[37]

Mahoney, Kurt Keutzer, and Amir Gholami

Coleman Richard Charles Hooper, Sehoon Kim, Hiva Mohammadzadeh, Mon- ishwaran Maheswaran, Sebastian Zhao, June Paik, Michael W. Mahoney, Kurt Keutzer, and Amir Gholami. 2025. Squeezed Attention: Accelerating Long Context Length LLM Inference. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)....

work page 2025
[38]

Kurt Hornik, Ingo Feinerer, Martin Kober, and Christian Buchta. 2012. Spherical k-means clustering.Journal of statistical software50 (2012), 1–22

work page 2012
[39]

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. 2024. RULER: What’s the Real Context Size of Your Long-Context Language Models?CoRRabs/2404.06654 (2024). https://doi.org/10.48550/ARXIV.2404.06654

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.06654 2024
[40]

Piotr Indyk and Rajeev Motwani. 1998. Approximate Nearest Neighbors: To- wards Removing the Curse of Dimensionality. InProceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing. ACM, Dallas, Texas, USA, 604–613. https://doi.org/10.1145/276698.276876

work page doi:10.1145/276698.276876 1998
[41]

InfiniGen. 2024. InfiniGen Code. https://github.com/snu-comparch/InfiniGen. Accessed: 2025-04-01

work page 2024
[42]

Johan Ludwig William Valdemar Jensen. 1906. Sur les fonctions convexes et les inégalités entre les valeurs moyennes.Acta mathematica30, 1 (1906), 175–193

work page 1906
[43]

Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu

Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2024. MInference 1.0: Accelerating Pre- filling for Long-Context LLMs via Dynamic Sparse Attention. InThe Thirty- Eighth Annual Conference on Neural Information Processing Systems. Van- couver...

work page 2024
[44]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles. ACM, Koblenz, Germany, 611–626. https://doi.org/10.1145/3600006.3613165

work page doi:10.1145/3600006.3613165 2023
[45]

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management. In18th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, Santa Clara, CA, 155–172. https: //www.usenix.org/conference/osdi24/presentation/lee

work page 2024
[46]

Viktor Leis. 2024. LeanStore: A High-Performance Storage Engine for NVMe SSDs.Proc. VLDB Endow.17, 12 (2024), 4536–4545. https://doi.org/10.14778/ 3685800.3685915

work page arXiv 2024
[47]

Viktor Leis, Adnan Alhomssi, Tobias Ziegler, Yannick Loeck, and Christian Dietrich. 2023. Virtual-Memory Assisted Buffer Management.Proceedings of the ACM on Management of Data1, 1 (2023), 7:1–7:25. https://doi.org/10.1145/ 3588687

work page 2023
[48]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- täschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Genera- tion for Knowledge-Intensive NLP Tasks. InThe Thirty-fourth Annual Conference on Neural Information Processing Systems. virtual. ht...

work page 2020
[49]

Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, Xuanhe Zhou, Chenhao Ma, Guoliang Li, Kevin Chen-Chuan Chang, Fei Huang, Reynold Cheng, and Yongbin Li. 2023. Can LLM Already Serve as A Database Interface? A Big Bench for Large-Scale Database Grounded Text-to-SQLs. InThe Thirty-Seventh Annual ...

work page 2023
[50]

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. 2024. SnapKV: LLM Knows What You are Looking for Before Generation. InThe Thirty- Eighth Annual Conference on Neural Information Processing Systems. Van- couver, BC, Canada. http://papers.nips.cc/paper_files/paper/2024/hash/ 28a...

work page 2024
[51]

Gonzalez, and Ion Stoica

Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica

work page
[52]

In17th USENIX Symposium on Operating Systems Design and Implementation

AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving. In17th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, 663–679. https://www.usenix.org/ conference/osdi23/presentation/li-zhouhan

work page
[53]

Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, and Lili Qiu. 2024. Parrot: Efficient Serving of LLM-based Applications with Semantic Variable. In18th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, Santa Clara, CA, USA, 929–945. https: //www.usenix.org/conference/osdi24/presentation/lin-chaofan

work page 2024
[54]

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei- Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration. InProceedings of the Seventh An- nual Conference on Machine Learning and Systems. mlsys.org, Santa Clara, CA, USA. https://proc...

work page 2024
[55]

Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong Zhang, Bailu Ding, Kai Zhang, Chen Chen, Fan Yang, Yuqing Yang, and Lili Qiu. 2024. RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval.CoRRabs/2409.10516 (2024). https://doi. org/10.48550/ARXIV.2409.10516

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2409.10516 2024
[56]

Guangda Liu, Chengwei Li, Jieru Zhao, Chenqi Zhang, and Minyi Guo. 2025. ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression. In62nd ACM/IEEE Design Automation Conference. IEEE, San Francisco, CA, USA, 1–7. https://doi.org/10.1109/DAC63849.2025.11132479

work page doi:10.1109/dac63849.2025.11132479 2025
[57]

Shige Liu, Zhifang Zeng, Li Chen, Adil Ainihaer, Arun Ramasami, Songting Chen, Yu Xu, Mingxi Wu, and Jianguo Wang. 2025. TigerVector: Supporting Vector Search in Graph Databases for Advanced RAGs. InCompanion of the 2025 International Conference on Management of Data. ACM, Berlin, Germany, 553–565. https://doi.org/10.1145/3722212.3724456

work page doi:10.1145/3722212.3724456 2025
[58]

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen (Henry) Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2024. KIVI: A Tuning-Free Asym- metric 2bit Quantization for KV Cache. InForty-first International Conference on Machine Learning. OpenReview.net, Vienna, Austria. https://openreview. net/forum?id=L057s2Rq8O

work page 2024
[59]

Kejing Lu, Mineichi Kudo, Chuan Xiao, and Yoshiharu Ishikawa. 2021. HVS: Hierarchical Graph Structure Based on Voronoi Diagrams for Solving Approx- imate Nearest Neighbor Search.Proc. VLDB Endow.15, 2 (2021), 246–258. https://doi.org/10.14778/3489496.3489506

work page doi:10.14778/3489496.3489506 2021
[60]

MagicPIG. 2024. MagicPIG Code. https://github.com/Infini-AI-Lab/MagicPIG. Accessed: 2025-04-01

work page 2024
[61]

Malkov and Dmitry A

Yury A. Malkov and Dmitry A. Yashunin. 2020. Efficient and Robust Approx- imate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs.IEEE Transactions on Pattern Analysis and Machine Intelligence42, 4 (2020), 824–836. https://doi.org/10.1109/TPAMI.2018.2889473

work page doi:10.1109/tpami.2018.2889473 2020
[62]

Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, and Rashmi Vinayak. 2025. Helix: Serving Large Language Models over Hetero- geneous GPUs and Network via Max-Flow. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. ACM, Rotterdam, The Netherlands, 58...

work page doi:10.1145/3669940.3707215 2025
[63]

Meta. 2024. Llama-3.1-8B-Instruct. https://huggingface.co/meta-llama/Llama- 3.1-8B-Instruct. Accessed: 2024-09-25

work page 2024
[64]

Meta. 2025. The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation. https://ai.meta.com/blog/llama-4-multimodal-intelligence/. Ac- cessed: 2025-04-05

work page 2025
[65]

Ilyas, Umar Farooq Minhas, Jeffrey Pound, and Theodoros Rekatsinas

Jason Mohoney, Anil Pacaci, Shihabur Rahman Chowdhury, Ali Mousavi, Ihab F. Ilyas, Umar Farooq Minhas, Jeffrey Pound, and Theodoros Rekatsinas. 2023. High-Throughput Vector Similarity Search in Knowledge Graphs.Proceedings of the ACM on Management of Data1, 2 (2023), 197:1–197:25. https://doi.org/ 10.1145/3589777

work page doi:10.1145/3589777 2023
[66]

Jiaqi Mu and Pramod Viswanath. 2018. All-but-the-Top: Simple and Effective Postprocessing for Word Representations. InThe Sixth International Conference on Learning Representations. OpenReview.net, Vancouver, BC, Canada. https: //openreview.net/forum?id=HkuGJ3kCb

work page 2018
[67]

Ansong Ni, Jeevana Priya Inala, Chenglong Wang, Alex Polozov, Christopher Meek, Dragomir Radev, and Jianfeng Gao. 2023. Learning Math Reasoning from Self-Sampled Correct and Partially-Correct Solutions. InThe Eleventh International Conference on Learning Representations. OpenReview.net, Kigali, Rwanda. https://openreview.net/forum?id=4D4TSJE6-K

work page 2023
[68]

NVIDIA. 2020. NVIDIA A100 Tensor Core GPU. https://www.nvidia.com/en- us/data-center/a100/. Accessed: 2025-04-01

work page 2020
[69]

NVIDIA. 2020. NVIDIA RTX A6000 Graphics Card. https://www.nvidia.com/en- us/products/workstations/rtx-a6000/. Accessed: 2025-10-01

work page 2020
[70]

Art of Problem Solving. 2024. AIME Problems and Solutions. https:// artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions. Ac- cessed: 2025-08-01

work page 2024
[71]

Hiroyuki Ootomo, Akira Naruse, Corey Nolet, Ray Wang, Tamas Feher, and Yong Wang. 2024. CAGRA: Highly Parallel Graph Construction and Approximate Nearest Neighbor Search for GPUs. In40th IEEE International Conference on Data Engineering. IEEE, Utrecht, The Netherlands, 4236–4247. https://doi.org/ 10.1109/ICDE60146.2024.00323

work page doi:10.1109/icde60146.2024.00323 2024
[72]

OpenAI. 2025. ChatGPT. https://chat.chatbotapp.ai/. Accessed: 2025-08-01

work page 2025
[73]

James Jie Pan, Jianguo Wang, and Guoliang Li. 2024. Survey of vector database management systems.VLDB J.33, 5 (2024), 1591–1615. https://doi.org/10.1007/ S00778-024-00864-X

work page 2024
[74]

Liana Patel, Peter Kraft, Carlos Guestrin, and Matei Zaharia. 2024. ACORN: Performant and Predicate-Agnostic Search Over Vector Embeddings and Struc- tured Data.Proceedings of the ACM on Management of Data2, 3 (2024), 120. https://doi.org/10.1145/3654923

work page doi:10.1145/3654923 2024
[75]

Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean

work page
[76]

InProceedings of the Sixth Conference on Machine Learning and Systems

Efficiently Scaling Transformer Inference. InProceedings of the Sixth Conference on Machine Learning and Systems. mlsys.org, Mi- ami, FL, USA. https://proceedings.mlsys.org/paper_files/paper/2023/hash/ c4be71ab8d24cdfb45e3d06dbfca2780-Abstract-mlsys2023.html

work page 2023
[77]

PQCache. 2024. PQCache. https://github.com/HugoZHL/PQCache. Accessed: 2025-04-01

work page 2024
[78]

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2025. Mooncake: Trading More Storage for Less Computation - A KVCache-centric Architecture for Serving LLM Chatbot. In23rd USENIX Conference on File and Storage Technologies. USENIX Association, Santa Clara, CA, USA, 155–170. https://www.usenix...

work page 2025
[79]

Quest. 2024. Quest Code. https://github.com/mit-han-lab/Quest. Accessed: 2025-04-01

work page 2024
[80]

Qwen. 2024. Qwen2.5-72B-Instruct. https://huggingface.co/Qwen/Qwen2.5- 72B-Instruct. Accessed: 2025-01-12

work page 2024
[81]

Qwen. 2024. Qwen2.5-7B-Instruct. https://huggingface.co/Qwen/Qwen2.5-7B- Instruct. Accessed: 2025-01-12

work page 2024

Showing first 80 references.

[1] [1]

01-ai. 2024. Yi-6B-200K. https://huggingface.co/01-ai/Yi-6B-200K. Accessed: 2024-11-11

work page 2024

[2] [2]

01-ai. 2024. Yi-9B-200K. https://huggingface.co/01-ai/Yi-9B-200K. Accessed: 2024-11-11

work page 2024

[3] [3]

Gulavani, Alexey Tumanov, and Ramachandran Ramjee

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwa- tra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. In18th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, Santa Clara, CA, USA, 117–134. https://www.usenix....

work page 2024

[4] [4]

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee. 2023. SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.CoRRabs/2308.16369 (2023). https://doi.org/10.48550/ARXIV.2308.16369

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.16369 2023

[5] [5]

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore, 4895–4901. https:...

work page 2023

[6] [6]

doi: 10.18653/v1/2024.acl-long

Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, S. Khatamifard, Minsik Cho, Carlo C. del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar. 2024. LLM in a flash: Efficient Large Language Model Inference with Limited Memory. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for C...

work page doi:10.18653/v1/2024.acl-long 2024

[7] [7]

Anthropic. 2025. Claude. https://www.anthropic.com/claude. Accessed: 2025- 08-01

work page 2025

[8] [8]

C., Arun Iyer, Suresh Parthasarathy, Sriram K

Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, Vageesh D. C., Arun Iyer, Suresh Parthasarathy, Sriram K. Rajamani, Balasubramanyan Ashok, and Shashank Shet. 2024. CodePlan: Repository-Level Coding using LLMs and Planning.Proceedings of the ACM on Software Engineering1, FSE (2024), 675–

work page 2024

[9] [9]

https://doi.org/10.1145/3643757

work page doi:10.1145/3643757

[10] [10]

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, and Wen Xiao. 2024. PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling. CoRRabs/2406.02069 (2024). https://doi.org/10.48550/ARXIV.2406.02069

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.02069 2024

[11] [11]

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. 2023. Accelerating Large Language Model Decoding with Speculative Sampling.CoRRabs/2302.01318 (2023). https: //doi.org/10.48550/ARXIV.2302.01318

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.01318 2023

[12] [12]

Cheng Chen, Chenzhe Jin, Yunan Zhang, Sasha Podolsky, Chun Wu, Szu- Po Wang, Eric Hanson, Zhou Sun, Robert Walzer, and Jianguo Wang. 2024. SingleStore-V: An Integrated Vector Database System in SingleStore.Proc. VLDB Endow.17, 12 (2024), 3772–3785. https://doi.org/10.14778/3685800.3685805

work page doi:10.14778/3685800.3685805 2024

[13] [13]

Sean Wang

Meng Chen, Kai Zhang, Zhenying He, Yinan Jing, and X. Sean Wang. 2024. RoarGraph: A Projected Bipartite Graph for Efficient Cross-Modal Approximate Nearest Neighbor Search.Proc. VLDB Endow.17, 11 (2024), 2735–2749. https: //doi.org/10.14778/3681954.3681959

work page doi:10.14778/3681954.3681959 2024

[14] [14]

Weijian Chen, Shuibing He, Haoyang Qu, Ruidong Zhang, Siling Yang, Ping Chen, Yi Zheng, Baoxing Huai, and Gang Chen. 2025. IMPRESS: An Importance- Informed Multi-Tier Prefix KV Storage System for Large Language Model Infer- ence. In23rd USENIX Conference on File and Storage Technologies. USENIX As- sociation, Santa Clara, CA, USA, 187–201. https://www.use...

work page 2025

[15] [15]

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. 2024. Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs.CoRRabs/2412.21187 (2024). https: //doi.org/10.48550/ARXIV.2412.21187

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.21187 2024

[16] [16]

Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Léon Bottou, Zhihao Jia, and Beidi Chen. 2025. MagicPIG: LSH Sampling for Efficient LLM Generation. In The Thirteenth International Conference on Learning Representations. OpenRe- view.net, Singapore. https://openreview.net/forum?id=ALzTQUgW8a

work page 2025

[17] [17]

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating Long Sequences with Sparse Transformers.CoRRabs/1904.10509 (2019). http: //arxiv.org/abs/1904.10509

work page internal anchor Pith review Pith/arXiv arXiv 2019

[18] [18]

and Niculae, Vlad and Martins, André F.T

Gonçalo M. Correia, Vlad Niculae, and André F. T. Martins. 2019. Adaptively Sparse Transformers. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Hong Kong, China, 2174–2184. https://doi.org/10.18653...

work page doi:10.18653/v1/d19-1223 2019

[19] [19]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InThe Thirty-Sixth Annual Conference on Neural Information Processing Systems. New Orleans, LA, USA. http://papers.nips.cc/paper_files/paper/2022/hash/ 67d57c32e20fd0a7a302cb81d36e40d5-Abstract-Conference.html

work page 2022

[20] [20]

DeepSeek. 2025. DeepSeek-R1-Distill-Llama-8B. https://huggingface.co/ deepseek-ai/DeepSeek-R1-Distill-Llama-8B. Accessed: 2025-08-01

work page 2025

[21] [21]

DeepSeek. 2025. DeepSeek-R1-Distill-Qwen-7B. https://huggingface.co/ deepseek-ai/DeepSeek-R1-Distill-Qwen-7B. Accessed: 2025-08-01

work page 2025

[22] [22]

DeepSeek-AI. 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.CoRRabs/2501.12948 (2025). https://doi.org/10. 48550/ARXIV.2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Yichuan Deng, Zhao Song, Jing Xiong, and Chiwun Yang. 2024. How Sparse Attention Approximates Exact Attention? Your Attention is Naturally 𝑛𝐶 - Sparse.arXiv preprint arXiv:2404.02690(2024)

work page arXiv 2024

[24] [24]

Yangshen Deng, Zhengxin You, Long Xiang, Qilong Li, Peiqi Yuan, Zhaoyang Hong, Yitao Zheng, Wanting Li, Runzhong Li, Haotian Liu, Kyriakos Mouratidis, Man Lung Yiu, Huan Li, Qiaomu Shen, Rui Mao, and Bo Tang. 2025. AlayaDB: The Data Foundation for Efficient and Effective Long-context LLM Inference. InCompanion of the 2025 International Conference on Manag...

work page arXiv 2025

[25] [26]

Cong Fu, Chao Xiang, Changxu Wang, and Deng Cai. 2019. Fast Approximate Nearest Neighbor Search With The Navigating Spreading-out Graph.Proc. VLDB Endow.12, 5 (2019), 461–474. https://doi.org/10.14778/3303753.3303754

work page doi:10.14778/3303753.3303754 2019

[26] [27]

Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. 2024. ServerlessLLM: Low-Latency Serverless In- ference for Large Language Models. In18th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, Santa Clara, CA, USA, 135–153. https://www.usenix.org/conference/osdi24/presentation/fu

work page 2024

[27] [28]

Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2024. Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention. InProceedings of the 2024 USENIX Annual Technical Conference. USENIX Asso- ciation, Santa Clara, CA, USA, 111–126. https://www.usenix.org/...

work page 2024

[28] [29]

Shiwei Gao, Youmin Chen, and Jiwu Shu. 2025. Fast State Restoration in LLM Serving with HCache. InProceedings of the Twentieth European Conference on Computer Systems. ACM, Rotterdam, The Netherlands, 128–143. https: //doi.org/10.1145/3689031.3696072

work page doi:10.1145/3689031.3696072 2025

[29] [30]

Shihong Gao, Xin Zhang, Yanyan Shen, and Lei Chen. 2025. Apt-Serve: Adap- tive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serv- ing.Proceedings of the ACM on Management of Data3, 3 (2025), 130:1–130:28. https://doi.org/10.1145/3725394

work page doi:10.1145/3725394 2025

[30] [31]

Yizhao Gao, Zhichen Zeng, Dayou Du, Shijie Cao, Hayden Kwok-Hay So, Ting Cao, Fan Yang, and Mao Yang. 2024. SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs.CoRRabs/2410.13276 (2024). https://doi.org/10.48550/ ARXIV.2410.13276

work page arXiv 2024

[31] [32]

Google. 2025. Gemini. https://gemini.google.com/app. Accessed: 2025-08-01

work page 2025

[32] [33]

gradientai. 2024. Llama-3-8B-Instruct-Gradient-1048k. https://huggingface.co/ gradientai/Llama-3-8B-Instruct-Gradient-1048k. Accessed: 2024-10-29

work page 2024

[33] [34]

Greg Kamradt. 2023. Needle in a haystack - pressure testing llms. https: //github.com/gkamradt/LLMTest_NeedleInAHaystack. Accessed: 2024-08-12

work page 2023

[34] [35]

Rentong Guo, Xiaofan Luan, Long Xiang, Xiao Yan, Xiaomeng Yi, Jigao Luo, Qianya Cheng, Weizhi Xu, Jiarui Luo, Frank Liu, Zhenshan Cao, Yanliang Qiao, Ting Wang, Bo Tang, and Charles Xie. 2022. Manu: A Cloud Native Vector Database Management System.Proc. VLDB Endow.15, 12 (2022), 3548–3561. https://doi.org/10.14778/3554821.3554843

work page doi:10.14778/3554821.3554843 2022

[35] [36]

Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. 2024. KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization. In The Thirty-Eighth Annual Conference on Neural Information Processing Systems. Vancouver, BC, Canada. http://papers.nips.cc/paper_files/paper/202...

work page 2024

[36] [37]

Mahoney, Kurt Keutzer, and Amir Gholami

Coleman Richard Charles Hooper, Sehoon Kim, Hiva Mohammadzadeh, Mon- ishwaran Maheswaran, Sebastian Zhao, June Paik, Michael W. Mahoney, Kurt Keutzer, and Amir Gholami. 2025. Squeezed Attention: Accelerating Long Context Length LLM Inference. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)....

work page 2025

[37] [38]

Kurt Hornik, Ingo Feinerer, Martin Kober, and Christian Buchta. 2012. Spherical k-means clustering.Journal of statistical software50 (2012), 1–22

work page 2012

[38] [39]

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. 2024. RULER: What’s the Real Context Size of Your Long-Context Language Models?CoRRabs/2404.06654 (2024). https://doi.org/10.48550/ARXIV.2404.06654

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.06654 2024

[39] [40]

Piotr Indyk and Rajeev Motwani. 1998. Approximate Nearest Neighbors: To- wards Removing the Curse of Dimensionality. InProceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing. ACM, Dallas, Texas, USA, 604–613. https://doi.org/10.1145/276698.276876

work page doi:10.1145/276698.276876 1998

[40] [41]

InfiniGen. 2024. InfiniGen Code. https://github.com/snu-comparch/InfiniGen. Accessed: 2025-04-01

work page 2024

[41] [42]

Johan Ludwig William Valdemar Jensen. 1906. Sur les fonctions convexes et les inégalités entre les valeurs moyennes.Acta mathematica30, 1 (1906), 175–193

work page 1906

[42] [43]

Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu

Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2024. MInference 1.0: Accelerating Pre- filling for Long-Context LLMs via Dynamic Sparse Attention. InThe Thirty- Eighth Annual Conference on Neural Information Processing Systems. Van- couver...

work page 2024

[43] [44]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles. ACM, Koblenz, Germany, 611–626. https://doi.org/10.1145/3600006.3613165

work page doi:10.1145/3600006.3613165 2023

[44] [45]

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management. In18th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, Santa Clara, CA, 155–172. https: //www.usenix.org/conference/osdi24/presentation/lee

work page 2024

[45] [46]

Viktor Leis. 2024. LeanStore: A High-Performance Storage Engine for NVMe SSDs.Proc. VLDB Endow.17, 12 (2024), 4536–4545. https://doi.org/10.14778/ 3685800.3685915

work page arXiv 2024

[46] [47]

Viktor Leis, Adnan Alhomssi, Tobias Ziegler, Yannick Loeck, and Christian Dietrich. 2023. Virtual-Memory Assisted Buffer Management.Proceedings of the ACM on Management of Data1, 1 (2023), 7:1–7:25. https://doi.org/10.1145/ 3588687

work page 2023

[47] [48]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- täschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Genera- tion for Knowledge-Intensive NLP Tasks. InThe Thirty-fourth Annual Conference on Neural Information Processing Systems. virtual. ht...

work page 2020

[48] [49]

Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, Xuanhe Zhou, Chenhao Ma, Guoliang Li, Kevin Chen-Chuan Chang, Fei Huang, Reynold Cheng, and Yongbin Li. 2023. Can LLM Already Serve as A Database Interface? A Big Bench for Large-Scale Database Grounded Text-to-SQLs. InThe Thirty-Seventh Annual ...

work page 2023

[49] [50]

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. 2024. SnapKV: LLM Knows What You are Looking for Before Generation. InThe Thirty- Eighth Annual Conference on Neural Information Processing Systems. Van- couver, BC, Canada. http://papers.nips.cc/paper_files/paper/2024/hash/ 28a...

work page 2024

[50] [51]

Gonzalez, and Ion Stoica

Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica

work page

[51] [52]

In17th USENIX Symposium on Operating Systems Design and Implementation

AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving. In17th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, 663–679. https://www.usenix.org/ conference/osdi23/presentation/li-zhouhan

work page

[52] [53]

Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, and Lili Qiu. 2024. Parrot: Efficient Serving of LLM-based Applications with Semantic Variable. In18th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, Santa Clara, CA, USA, 929–945. https: //www.usenix.org/conference/osdi24/presentation/lin-chaofan

work page 2024

[53] [54]

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei- Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration. InProceedings of the Seventh An- nual Conference on Machine Learning and Systems. mlsys.org, Santa Clara, CA, USA. https://proc...

work page 2024

[54] [55]

Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong Zhang, Bailu Ding, Kai Zhang, Chen Chen, Fan Yang, Yuqing Yang, and Lili Qiu. 2024. RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval.CoRRabs/2409.10516 (2024). https://doi. org/10.48550/ARXIV.2409.10516

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2409.10516 2024

[55] [56]

Guangda Liu, Chengwei Li, Jieru Zhao, Chenqi Zhang, and Minyi Guo. 2025. ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression. In62nd ACM/IEEE Design Automation Conference. IEEE, San Francisco, CA, USA, 1–7. https://doi.org/10.1109/DAC63849.2025.11132479

work page doi:10.1109/dac63849.2025.11132479 2025

[56] [57]

Shige Liu, Zhifang Zeng, Li Chen, Adil Ainihaer, Arun Ramasami, Songting Chen, Yu Xu, Mingxi Wu, and Jianguo Wang. 2025. TigerVector: Supporting Vector Search in Graph Databases for Advanced RAGs. InCompanion of the 2025 International Conference on Management of Data. ACM, Berlin, Germany, 553–565. https://doi.org/10.1145/3722212.3724456

work page doi:10.1145/3722212.3724456 2025

[57] [58]

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen (Henry) Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2024. KIVI: A Tuning-Free Asym- metric 2bit Quantization for KV Cache. InForty-first International Conference on Machine Learning. OpenReview.net, Vienna, Austria. https://openreview. net/forum?id=L057s2Rq8O

work page 2024

[58] [59]

Kejing Lu, Mineichi Kudo, Chuan Xiao, and Yoshiharu Ishikawa. 2021. HVS: Hierarchical Graph Structure Based on Voronoi Diagrams for Solving Approx- imate Nearest Neighbor Search.Proc. VLDB Endow.15, 2 (2021), 246–258. https://doi.org/10.14778/3489496.3489506

work page doi:10.14778/3489496.3489506 2021

[59] [60]

MagicPIG. 2024. MagicPIG Code. https://github.com/Infini-AI-Lab/MagicPIG. Accessed: 2025-04-01

work page 2024

[60] [61]

Malkov and Dmitry A

Yury A. Malkov and Dmitry A. Yashunin. 2020. Efficient and Robust Approx- imate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs.IEEE Transactions on Pattern Analysis and Machine Intelligence42, 4 (2020), 824–836. https://doi.org/10.1109/TPAMI.2018.2889473

work page doi:10.1109/tpami.2018.2889473 2020

[61] [62]

Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, and Rashmi Vinayak. 2025. Helix: Serving Large Language Models over Hetero- geneous GPUs and Network via Max-Flow. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. ACM, Rotterdam, The Netherlands, 58...

work page doi:10.1145/3669940.3707215 2025

[62] [63]

Meta. 2024. Llama-3.1-8B-Instruct. https://huggingface.co/meta-llama/Llama- 3.1-8B-Instruct. Accessed: 2024-09-25

work page 2024

[63] [64]

Meta. 2025. The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation. https://ai.meta.com/blog/llama-4-multimodal-intelligence/. Ac- cessed: 2025-04-05

work page 2025

[64] [65]

Ilyas, Umar Farooq Minhas, Jeffrey Pound, and Theodoros Rekatsinas

Jason Mohoney, Anil Pacaci, Shihabur Rahman Chowdhury, Ali Mousavi, Ihab F. Ilyas, Umar Farooq Minhas, Jeffrey Pound, and Theodoros Rekatsinas. 2023. High-Throughput Vector Similarity Search in Knowledge Graphs.Proceedings of the ACM on Management of Data1, 2 (2023), 197:1–197:25. https://doi.org/ 10.1145/3589777

work page doi:10.1145/3589777 2023

[65] [66]

Jiaqi Mu and Pramod Viswanath. 2018. All-but-the-Top: Simple and Effective Postprocessing for Word Representations. InThe Sixth International Conference on Learning Representations. OpenReview.net, Vancouver, BC, Canada. https: //openreview.net/forum?id=HkuGJ3kCb

work page 2018

[66] [67]

Ansong Ni, Jeevana Priya Inala, Chenglong Wang, Alex Polozov, Christopher Meek, Dragomir Radev, and Jianfeng Gao. 2023. Learning Math Reasoning from Self-Sampled Correct and Partially-Correct Solutions. InThe Eleventh International Conference on Learning Representations. OpenReview.net, Kigali, Rwanda. https://openreview.net/forum?id=4D4TSJE6-K

work page 2023

[67] [68]

NVIDIA. 2020. NVIDIA A100 Tensor Core GPU. https://www.nvidia.com/en- us/data-center/a100/. Accessed: 2025-04-01

work page 2020

[68] [69]

NVIDIA. 2020. NVIDIA RTX A6000 Graphics Card. https://www.nvidia.com/en- us/products/workstations/rtx-a6000/. Accessed: 2025-10-01

work page 2020

[69] [70]

Art of Problem Solving. 2024. AIME Problems and Solutions. https:// artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions. Ac- cessed: 2025-08-01

work page 2024

[70] [71]

Hiroyuki Ootomo, Akira Naruse, Corey Nolet, Ray Wang, Tamas Feher, and Yong Wang. 2024. CAGRA: Highly Parallel Graph Construction and Approximate Nearest Neighbor Search for GPUs. In40th IEEE International Conference on Data Engineering. IEEE, Utrecht, The Netherlands, 4236–4247. https://doi.org/ 10.1109/ICDE60146.2024.00323

work page doi:10.1109/icde60146.2024.00323 2024

[71] [72]

OpenAI. 2025. ChatGPT. https://chat.chatbotapp.ai/. Accessed: 2025-08-01

work page 2025

[72] [73]

James Jie Pan, Jianguo Wang, and Guoliang Li. 2024. Survey of vector database management systems.VLDB J.33, 5 (2024), 1591–1615. https://doi.org/10.1007/ S00778-024-00864-X

work page 2024

[73] [74]

Liana Patel, Peter Kraft, Carlos Guestrin, and Matei Zaharia. 2024. ACORN: Performant and Predicate-Agnostic Search Over Vector Embeddings and Struc- tured Data.Proceedings of the ACM on Management of Data2, 3 (2024), 120. https://doi.org/10.1145/3654923

work page doi:10.1145/3654923 2024

[74] [75]

Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean

work page

[75] [76]

InProceedings of the Sixth Conference on Machine Learning and Systems

Efficiently Scaling Transformer Inference. InProceedings of the Sixth Conference on Machine Learning and Systems. mlsys.org, Mi- ami, FL, USA. https://proceedings.mlsys.org/paper_files/paper/2023/hash/ c4be71ab8d24cdfb45e3d06dbfca2780-Abstract-mlsys2023.html

work page 2023

[76] [77]

PQCache. 2024. PQCache. https://github.com/HugoZHL/PQCache. Accessed: 2025-04-01

work page 2024

[77] [78]

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2025. Mooncake: Trading More Storage for Less Computation - A KVCache-centric Architecture for Serving LLM Chatbot. In23rd USENIX Conference on File and Storage Technologies. USENIX Association, Santa Clara, CA, USA, 155–170. https://www.usenix...

work page 2025

[78] [79]

Quest. 2024. Quest Code. https://github.com/mit-han-lab/Quest. Accessed: 2025-04-01

work page 2024

[79] [80]

Qwen. 2024. Qwen2.5-72B-Instruct. https://huggingface.co/Qwen/Qwen2.5- 72B-Instruct. Accessed: 2025-01-12

work page 2024

[80] [81]

Qwen. 2024. Qwen2.5-7B-Instruct. https://huggingface.co/Qwen/Qwen2.5-7B- Instruct. Accessed: 2025-01-12

work page 2024