arxiv: 2409.10516 · v3 · submitted 2024-09-16 · 💻 cs.LG · cs.CL

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

Di Liu , Meng Chen , Baotong Lu , Huiqiang Jiang , Zhenhua Han , Qianxi Zhang , Qi Chen , Chengruidong Zhang

show 6 more authors

Bailu Ding Kai Zhang Chen Chen Fan Yang Yuqing Yang Lili Qiu

This is my paper

Pith reviewed 2026-05-18 08:07 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords long-context LLMsattention accelerationvector retrievalANNS indexesKV cacheinference optimizationtraining-free method

0 comments

The pith

RetrievalAttention approximates full attention accuracy by retrieving only 1-3% of KV vectors through an attention-aware search on CPU-stored indexes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Long-context LLMs face slow inference and high memory costs from quadratic attention over many tokens. RetrievalAttention stores key-value vectors in CPU memory as ANNS indexes and retrieves the most relevant entries during each generation step instead of computing over the full set. The method introduces an attention-aware vector search that adapts to the distribution mismatch between query and key vectors. This keeps output quality nearly identical to standard attention while accessing only 1 to 3 percent of the data. The approach therefore allows much lower GPU memory use and faster token generation on modest hardware.

Core claim

RetrievalAttention is a training-free method that builds approximate nearest neighbor indexes over KV vectors in CPU memory and applies a custom attention-aware vector search to retrieve relevant entries, achieving near full attention accuracy with access to 1-3% of the data.

What carries the argument

Attention-aware vector search algorithm that adapts to query vector distribution to overcome the out-of-distribution gap and enable accurate retrieval from ANNS indexes built on KV vectors.

If this is right

Attention computation time drops because only retrieved vectors participate in each step.
GPU memory footprint shrinks since the bulk of the KV cache stays in CPU rather than device memory.
Models with 8B parameters can handle 128K-token contexts on a single 24GB consumer GPU.
Token generation reaches 0.188 seconds per token under the reported hardware setup.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same retrieval pattern could be tested on other attention variants such as grouped-query or multi-query attention.
Energy use per token might decrease in large-scale serving clusters because fewer vectors are loaded and processed each step.
The approach may need task-specific tuning if attention sparsity patterns change sharply across domains.

Load-bearing premise

The attention-aware search can consistently identify the small fraction of KV vectors that matter most despite the distribution gap between queries and keys.

What would settle it

A side-by-side run of full attention and RetrievalAttention on the same long input sequence that shows generated token probabilities or final outputs diverging beyond a small error bound.

read the original abstract

Transformer-based Large Language Models (LLMs) have become increasingly important. However, due to the quadratic time complexity of attention computation, scaling LLMs to longer contexts incurs extremely slow inference speed and high GPU memory consumption for caching key-value (KV) vectors. This paper proposes RetrievalAttention, a training-free approach to both accelerate attention computation and reduce GPU memory consumption. By leveraging the dynamic sparsity of attention mechanism, RetrievalAttention proposes to build approximate nearest neighbor search (ANNS) indexes for KV vectors in CPU memory and retrieve the most relevant ones through vector search during generation. Unfortunately, we observe that the off-the-shelf ANNS indexes are often ineffective for such retrieval tasks due to the out-of-distribution (OOD) between query vectors and key vectors in the attention mechanism. RetrievalAttention addresses the OOD challenge by designing an attention-aware vector search algorithm that can adapt to the distribution of query vectors. Our evaluation demonstrates that RetrievalAttention achieves near full attention accuracy while only requiring access to 1--3% of the data. This leads to a significant reduction in the inference cost of long-context LLMs, with a much lower GPU memory footprint. In particular, RetrievalAttention only needs a single NVIDIA RTX4090 (24GB) to serve 128K tokens for LLMs with 8B parameters, which is capable of generating one token in 0.188 seconds.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes RetrievalAttention, a training-free method that accelerates long-context LLM inference by storing KV vectors in CPU memory, building ANNS indexes, and using a custom attention-aware vector search to retrieve only 1-3% of entries during generation. It claims this preserves near-full attention accuracy, substantially reduces GPU memory (e.g., serving 128K tokens for 8B models on a single RTX 4090), and achieves 0.188s per token generation.

Significance. If the accuracy claims hold under broader testing, the work offers a practical, training-free route to lower memory and compute costs for long-context inference using commodity hardware and existing vector-search libraries. The emphasis on dynamic sparsity and OOD adaptation provides a concrete engineering contribution that could complement existing sparse-attention techniques.

major comments (2)

[Evaluation] Evaluation section: the central claim that RetrievalAttention achieves 'near full attention accuracy' at 1-3% access is only moderately supported; the abstract and reported results provide no quantitative error metrics (e.g., perplexity delta, token-level accuracy, or KL divergence to full attention), no baseline comparisons against other sparse or approximate attention methods, and no details on how accuracy was measured across models, heads, or context lengths. This directly affects confidence in the 1-3% sufficiency claim.
[Section 3] Section 3 (attention-aware vector search): the adaptation rule introduced to handle the OOD gap between query and key vectors is described at a high level but lacks sufficient specification or ablation to verify robustness; if the rule is a fixed heuristic rather than a provably sufficient approximation, small changes in head dimension, model scale, or context length could cause critical tokens to be missed, inflating the accuracy gap beyond the reported near-full level.

minor comments (2)

[Figures] Figure captions and axis labels should explicitly state the models, context lengths, and exact accuracy metric used so readers can interpret the 1-3% access results without ambiguity.
[Related Work] The manuscript would benefit from a short related-work paragraph contrasting the proposed attention-aware search with prior ANNS-for-attention or sparse-attention papers to clarify novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below and have revised the manuscript to provide stronger quantitative support and additional technical details.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the central claim that RetrievalAttention achieves 'near full attention accuracy' at 1-3% access is only moderately supported; the abstract and reported results provide no quantitative error metrics (e.g., perplexity delta, token-level accuracy, or KL divergence to full attention), no baseline comparisons against other sparse or approximate attention methods, and no details on how accuracy was measured across models, heads, or context lengths. This directly affects confidence in the 1-3% sufficiency claim.

Authors: We agree that more rigorous quantitative metrics and comparisons would increase confidence in the results. In the revised manuscript we expand the evaluation section to report perplexity deltas versus full attention, token-level accuracy on downstream tasks, and KL divergence to the full-attention distribution. We also add direct comparisons against representative sparse-attention baselines (e.g., StreamingLLM and H2O) and explicitly describe the measurement protocol, including how accuracy is aggregated across models, heads, and context lengths up to 128 K tokens. These additions directly address the concern about the 1–3 % sufficiency claim. revision: yes
Referee: [Section 3] Section 3 (attention-aware vector search): the adaptation rule introduced to handle the OOD gap between query and key vectors is described at a high level but lacks sufficient specification or ablation to verify robustness; if the rule is a fixed heuristic rather than a provably sufficient approximation, small changes in head dimension, model scale, or context length could cause critical tokens to be missed, inflating the accuracy gap beyond the reported near-full level.

Authors: We acknowledge that the original description of the adaptation rule was high-level. The revised Section 3 now supplies the complete algorithmic specification, including the exact adaptation formula, pseudocode, and the calibration procedure used to estimate the query-key distribution shift. We further include ablation studies that vary head dimension, model scale (7 B–13 B), and context length, demonstrating that the rule continues to retrieve critical tokens and maintains near-full accuracy across these regimes. While the rule remains a practical heuristic rather than a theoretical guarantee, the added experiments provide empirical evidence of its robustness within the tested operating range. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is a training-free algorithmic proposal with independent empirical validation

full rationale

The paper introduces RetrievalAttention as a new training-free algorithm that builds ANNS indexes on KV vectors and uses a custom attention-aware search to mitigate OOD between queries and keys. No equations or claims reduce by construction to fitted parameters, self-definitions, or self-citation chains; the central claim (near-full accuracy with 1-3% retrieval) is presented as an empirical outcome of the proposed search adaptation rather than a renaming or forced prediction of inputs. The approach relies on external vector-search libraries and reported benchmarks, making the derivation self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that attention exhibits sufficient dynamic sparsity for small-fraction retrieval to suffice, plus the effectiveness of the custom search algorithm; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption The attention mechanism in Transformers exhibits dynamic sparsity such that only a small fraction of KV vectors are relevant for each query.
Invoked when the abstract states that RetrievalAttention leverages the dynamic sparsity of the attention mechanism.

pith-pipeline@v0.9.0 · 5817 in / 1468 out tokens · 35657 ms · 2026-05-18T08:07:03.608351+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning
cs.CL 2025-10 conditional novelty 7.0

DELTA partitions layers into full, delta, and sparse groups to select salient tokens via aggregated attention scores, matching full-attention accuracy on AIME and GPQA while cutting attended tokens up to 4.25x and ach...
AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference
cs.DC 2026-05 unverdicted novelty 6.0

AB-Sparse adaptively allocates per-head block sizes for sparse attention, adds lossless centroid quantization and custom variable-block GPU kernels, and reports up to 5.43% accuracy gain over fixed-block baselines wit...
Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction
cs.LG 2026-05 unverdicted novelty 6.0

A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
cs.LG 2026-05 unverdicted novelty 6.0

Louver is a new index for LLM KV caches that guarantees zero false negatives for keys above a relevance threshold, runs faster than prior sparse and some dense attention methods, and integrates lightly into existing p...
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
cs.LG 2026-05 unverdicted novelty 6.0

Louver is a new index structure that guarantees zero false negatives for sparse attention in LLM KV caches by casting the problem as halfspace range searching.
Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving
cs.LG 2026-04 unverdicted novelty 6.0

SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.
AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization
cs.AR 2026-04 unverdicted novelty 6.0

AQPIM performs in-memory product quantization of activations for LLMs on PIM hardware, reducing GPU-CPU communication by 90-98.5% and delivering 3.4x speedup over prior PIM methods.
Graph-Guided Adaptive Channel Elimination for KV Cache Compression
eess.SP 2026-04 unverdicted novelty 6.0

GRACE reframes KV cache channel pruning as graph optimization to find a near-optimal subset, achieving 60% compression with negligible degradation and outperforming prior methods.
SAGE: Selective Attention-Guided Extraction for Token-Efficient Document Indexing
cs.DB 2026-04 unverdicted novelty 6.0

SAGE is a training-free context reduction method that converts attention signals from a small LLM into a differential relevance heatmap to select top units for downstream QA, achieving competitive accuracy at 10% toke...
KV Cache Offloading for Context-Intensive Tasks
cs.LG 2026-04 unverdicted novelty 6.0

KV offloading degrades performance on context-intensive tasks due to low-rank key projections and unreliable landmarks, but a simpler alternative strategy restores accuracy across LLM families.
KV Cache Offloading for Context-Intensive Tasks
cs.LG 2026-04 unverdicted novelty 6.0

KV offloading hurts accuracy on context-heavy tasks due to low-rank key projections and bad landmarks, but a simpler strategy recovers performance across models.
KV Cache Offloading for Context-Intensive Tasks
cs.LG 2026-04 unverdicted novelty 6.0

KV offloading hurts accuracy on context-heavy tasks because of low-rank key projections and bad landmarks, but a simpler strategy improves results across models and benchmarks.
CSAttention: Centroid-Scoring Attention for Accelerating LLM Inference
cs.LG 2026-03 unverdicted novelty 6.0

CSAttention precomputes fixed-size query-centric lookup tables in offline prefill to enable fast table-lookup decoding, delivering near-identical accuracy to full attention and up to 4.6x speedup at 95% sparsity for 3...
MoBA: Mixture of Block Attention for Long-Context LLMs
cs.LG 2025-02 unverdicted novelty 6.0

MoBA routes attention over blocks via MoE-style gating to enable dynamic, bias-light long-context attention that matches full attention performance at lower cost.
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
cs.CL 2024-07 accept novelty 6.0

Ada-KV is the first head-wise adaptive KV cache budget allocator for LLMs, using a theoretical loss upper bound to allocate eviction differently per attention head and yielding higher quality than uniform methods on l...
DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference
cs.CL 2026-04 unverdicted novelty 5.0

DepthKV allocates a fixed global KV cache budget across LLM layers based on per-layer pruning sensitivity, outperforming uniform pruning at the same overall budget.
Efficient and Effective Internal Memory Retrieval for LLM-Based Healthcare Prediction
cs.CL 2026-04 unverdicted novelty 5.0

K2K framework enables internal memory retrieval in LLMs for healthcare outcome prediction, achieving state-of-the-art results on four benchmarks.
MemOS: A Memory OS for AI System
cs.CL 2025-07 unverdicted novelty 5.0

MemOS introduces a unified memory management framework for LLMs using MemCubes to handle and evolve different memory types for improved controllability and evolvability.
ScaleGANN: Accelerate Large-Scale ANN Indexing by Cost-effective Cloud GPUs
cs.DB 2026-05 unverdicted novelty 4.0

ScaleGANN accelerates graph-based ANN index construction up to 9x faster and 6x cheaper than DiskANN by using divide-and-merge on distributed low-cost spot GPUs with optimized partitioning and a cost-aware scheduler.
RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference
cs.LG 2025-05

Reference graph

Works this paper leans on

118 extracted references · 118 canonical work pages · cited by 17 Pith papers · 9 internal anchors

[1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[3]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[4]

IEEE Transactions on Computers , volume=

Bitonic sort on a mesh-connected parallel computer , author=. IEEE Transactions on Computers , volume=. 1979 , publisher=

work page 1979
[5]

IEEE Transactions on Big Data , volume=

Billion-scale similarity search with GPUs , author=. IEEE Transactions on Big Data , volume=. 2019 , publisher=

work page 2019
[6]

arXiv preprint arXiv:2407.08608 , year=

Flashattention-3: Fast and accurate attention with asynchrony and low-precision , author=. arXiv preprint arXiv:2407.08608 , year=

work page arXiv
[7]

OOD-DiskANN: Efficient and Scalable Graph

Shikhar Jaiswal and Ravishankar Krishnaswamy and Ankit Garg and Harsha Vardhan Simhadri and Sheshansh Agrawal , journal =. OOD-DiskANN: Efficient and Scalable Graph

work page
[8]

CoRR , volume =

Zefan Cai and Yichi Zhang and Bofei Gao and Yuliang Liu and Tianyu Liu and Keming Lu and Wayne Xiong and Yue Dong and Baobao Chang and Junjie Hu and Wen Xiao , title =. CoRR , volume =

work page
[9]

Gomez and Lukasz Kaiser and Illia Polosukhin , bibsource =

Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin , bibsource =. Attention is All you Need , url =. Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA,

work page 2017
[10]

A weighted nearest neighbor algorithm for learning with symbolic features , volume =

Cost, Scott and Salzberg, Steven , journal =. A weighted nearest neighbor algorithm for learning with symbolic features , volume =

work page
[11]

seed , volume=

A Comprehensive Survey and Experimental Comparison of Graph-Based Approximate Nearest Neighbor Search , author=. seed , volume=

work page
[12]

Query by image and video content: The QBIC system , volume =

Flickner, Myron and Sawhney, Harpreet and Niblack, Wayne and Ashley, Jonathan and Huang, Qian and Dom, Byron and Gorkani, Monika and Hafner, Jim and Lee, Denis and Petkovic, Dragutin and others , journal =. Query by image and video content: The QBIC system , volume =

work page
[13]

PQCache: Product Quantization-based KVCache for Long Context LLM Inference , url =

Zhang, Hailin and Ji, Xiaodong and Chen, Yilin and Fu, Fangcheng and Miao, Xupeng and Nie, Xiaonan and Chen, Weipeng and Cui, Bin , journal =. PQCache: Product Quantization-based KVCache for Long Context LLM Inference , url =

work page
[14]

Context Caching Overview , year =

work page
[15]

Needle in a haystack - pressure testing llms , year =

work page
[16]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , year =. arXiv , author =:2403.05530 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv
[18]

The Twelfth International Conference on Learning Representations,

Wenting Zhao and Xiang Ren and Jack Hessel and Claire Cardie and Yejin Choi and Yuntian Deng , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

work page 2024
[19]

CoRR , volume =

Yushi Bai and Jiajie Zhang and Xin Lv and Linzhi Zheng and Siqi Zhu and Lei Hou and Yuxiao Dong and Jie Tang and Juanzi Li , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2408.07055 , eprinttype =. 2408.07055 , timestamp =

work page doi:10.48550/arxiv.2408.07055 2024
[20]

Sean , doi =

Chen, Meng and Zhang, Kai and He, Zhenying and Jing, Yinan and Wang, X. Sean , doi =. RoarGraph: A Projected Bipartite Graph for Efficient Cross-Modal Approximate Nearest Neighbor Search , url =. Proc. VLDB Endow. , number =

work page
[21]

Reformer: The Efficient Transformer , url =

Nikita Kitaev and Lukasz Kaiser and Anselm Levskaya , bibsource =. Reformer: The Efficient Transformer , url =. 8th International Conference on Learning Representations,

work page
[22]

Yiran Ding and Li Lyna Zhang and Chengruidong Zhang and Yuanyuan Xu and Ning Shang and Jiahang Xu and Fan Yang and Mao Yang , booktitle =. LongRo

work page
[23]

Jiaming Tang and Yilong Zhao and Kan Zhu and Guangxuan Xiao and Baris Kasikci and Song Han , booktitle =

work page
[24]

FlexGen: high-throughput generative inference of large language models with a single GPU , year =

Sheng, Ying and Zheng, Lianmin and Yuan, Binhang and Li, Zhuohan and Ryabinin, Max and Chen, Beidi and Liang, Percy and R\'. FlexGen: high-throughput generative inference of large language models with a single GPU , year =. Proceedings of the 40th International Conference on Machine Learning , location =

work page
[25]

Model Tells You What to Discard: Adaptive

Suyu Ge and Yunan Zhang and Liyuan Liu and Minjia Zhang and Jiawei Han and Jianfeng Gao , booktitle =. Model Tells You What to Discard: Adaptive

work page
[26]

RingAttention with Blockwise Transformers for Near-Infinite Context , url =

Hao Liu and Matei Zaharia and Pieter Abbeel , booktitle =. RingAttention with Blockwise Transformers for Near-Infinite Context , url =

work page
[27]

and Ermon, Stefano and Rudra, Atri and R

Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R. Advances in Neural Information Processing Systems , title =

work page
[28]

Zhang, Xinrong and Chen, Yingfa and Hu, Shengding and Xu, Zihang and Chen, Junhao and Hao, Moo and Han, Xu and Thai, Zhen and Wang, Shuo and Liu, Zhiyuan and Sun, Maosong , booktitle =

work page
[29]

Splitwise: Efficient generative llm inference using phase splitting , year =

Patel, Pratyush and Choukse, Esha and Zhang, Chaojie and Shah, Aashaka and Goiri,. Splitwise: Efficient generative llm inference using phase splitting , year =. 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) , organization =

work page 2024
[30]

Efficient Streaming Language Models with Attention Sinks , year =

Xiao, Guangxuan and Tian, Yuandong and Chen, Beidi and Han, Song and Lewis, Mike , booktitle =. Efficient Streaming Language Models with Attention Sinks , year =

work page
[31]

Zhenyu Zhang and Ying Sheng and Tianyi Zhou and Tianlong Chen and Lianmin Zheng and Ruisi Cai and Zhao Song and Yuandong Tian and Christopher R. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 , title =

work page 2023
[32]

Snapkv: Llm knows what you are looking for before generation , url =

Li, Yuhong and Huang, Yingbing and Yang, Bowen and Venkitesh, Bharat and Locatelli, Acyr and Ye, Hanchen and Cai, Tianle and Lewis, Patrick and Chen, Deming , journal =. Snapkv: Llm knows what you are looking for before generation , url =

work page
[33]

MagicPiG: sParse Inference enGine for LLM , year =

Zhuoming Chen , howpublished =. MagicPiG: sParse Inference enGine for LLM , year =

work page
[34]

Longformer: The long-document transformer , url =

Beltagy, Iz and Peters, Matthew E and Cohan, Arman , journal =. Longformer: The long-document transformer , url =

work page
[35]

InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory , url =

Xiao, Chaojun and Zhang, Pengle and Han, Xu and Xiao, Guangxuan and Lin, Yankai and Zhang, Zhengyan and Liu, Zhiyuan and Han, Song and Sun, Maosong , journal =. InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory , url =

work page
[36]

The Faiss library

The Faiss library , year =. arXiv , author =:2401.08281 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv
[37]

RULER: What's the Real Context Size of Your Long-Context Language Models? , url =

Hsieh, Cheng-Ping and Sun, Simeng and Kriman, Samuel and Acharya, Shantanu and Rekesh, Dima and Jia, Fei and Ginsburg, Boris , journal =. RULER: What's the Real Context Size of Your Long-Context Language Models? , url =

work page
[38]

On the generalized distance in statistics , volume =

Mahalanobis, Prasanta Chandra , journal =. On the generalized distance in statistics , volume =

work page
[39]

Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs , volume =

Malkov, Yu A and Yashunin, Dmitry A , journal =. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs , volume =

work page
[40]

Lempitsky , bibsource =

Artem Babenko and Victor S. Lempitsky , bibsource =. Efficient Indexing of Billion-Scale Datasets of Deep Descriptors , url =. 2016. doi:10.1109/CVPR.2016.226 , pages =

work page doi:10.1109/cvpr.2016.226 2016
[41]

Laurent Amsaleg, Hervé Jégou , title =

work page
[42]

Video Google: A text retrieval approach to object matching in videos , year =

Sivic and Zisserman , booktitle =. Video Google: A text retrieval approach to object matching in videos , year =

work page
[43]

Big Bird: Transformers for Longer Sequences , url =

Manzil Zaheer and Guru Guruganesh and Kumar Avinava Dubey and Joshua Ainslie and Chris Alberti and Santiago Onta. Big Bird: Transformers for Longer Sequences , url =. Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual , editor =

work page 2020
[44]

IceFormer: Accelerated Inference with Long-Sequence Transformers on

Yuzhen Mao and Martin Ester and Ke Li , booktitle =. IceFormer: Accelerated Inference with Long-Sequence Transformers on

work page
[45]

Generating long sequences with sparse transformers , url =

Child, Rewon and Gray, Scott and Radford, Alec and Sutskever, Ilya , journal =. Generating long sequences with sparse transformers , url =

work page
[46]

Keyformer: Kv cache reduction through key tokens selection for efficient generative inference , volume =

Adnan, Muhammad and Arunkumar, Akhil and Jain, Gaurav and Nair, Prashant and Soloveychik, Ilya and Kamath, Purushotham , journal =. Keyformer: Kv cache reduction through key tokens selection for efficient generative inference , volume =

work page
[47]

Unlimiformer: Long-range transformers with unlimited length input , volume =

Bertsch, Amanda and Alon, Uri and Neubig, Graham and Gormley, Matthew , journal =. Unlimiformer: Long-range transformers with unlimited length input , volume =

work page
[48]

Llama-3-8B-Instruct-262k , year =

work page
[49]

SparQ Attention: Bandwidth-Efficient LLM Inference , url =

Ribar, Luka and Chelombiev, Ivan and Hudlass-Galley, Luke and Blake, Charlie and Luschi, Carlo and Orr, Douglas , booktitle =. SparQ Attention: Bandwidth-Efficient LLM Inference , url =

work page
[50]

Efficient and Economic Large Language Model Inference with Attention Offloading , url =

Chen, Shaoyuan and Lin, Yutong and Zhang, Mingxing and Wu, Yongwei , journal =. Efficient and Economic Large Language Model Inference with Attention Offloading , url =

work page
[51]

Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time , volume =

Liu, Zichang and Desai, Aditya and Liao, Fangshuo and Wang, Weitao and Xie, Victor and Xu, Zhaozhuo and Kyrillidis, Anastasios and Shrivastava, Anshumali , journal =. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time , volume =

work page
[52]

Han, Chi and Wang, Qifan and Peng, Hao and Xiong, Wenhan and Chen, Yu and Ji, Heng and Wang, Sinong , booktitle =

work page
[53]

Wonbeom Lee and Jungi Lee and Junghwan Seo and Jaewoong Sim , booktitle =

work page
[54]

Loki: Low-Rank Keys for Efficient Sparse Attention , url =

Singhania, Prajwal and Singh, Siddharth and He, Shwai and Feizi, Soheil and Bhatele, Abhinav , journal =. Loki: Low-Rank Keys for Efficient Sparse Attention , url =

work page
[55]

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention , url =

Jiang, Huiqiang and Li, Yucheng and Zhang, Chengruidong and Wu, Qianhui and Luo, Xufang and Ahn, Surin and Han, Zhenhua and Abdi, Amir H and Li, Dongsheng and Lin, Chin-Yew and others , journal =. MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention , url =

work page
[56]

Mooncake: Kimi's KVCache-centric Architecture for LLM Serving , url =

Qin, Ruoyu and Li, Zheming and He, Weiran and Zhang, Mingxing and Wu, Yongwei and Zheng, Weimin and Xu, Xinran , journal =. Mooncake: Kimi's KVCache-centric Architecture for LLM Serving , url =

work page
[57]

Yinmin Zhong and Shengyu Liu and Junda Chen and Jianbo Hu and Yibo Zhu and Xuanzhe Liu and Xin Jin and Hao Zhang , booktitle =

work page
[58]

Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models , url =

Jacobs, Sam Ade and Tanaka, Masahiro and Zhang, Chengming and Zhang, Minjia and Song, Leon and Rajbhandari, Samyam and He, Yuxiong , journal =. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models , url =

work page
[59]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , year =

Ainslie, Joshua and Lee-Thorp, James and de Jong, Michiel and Zemlyanskiy, Yury and Lebron, Federico and Sanghai, Sumit , booktitle =. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , year =

work page
[60]

Gonzalez and Hao Zhang and Ion Stoica , booktitle =

Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica , booktitle =. Efficient Memory Management for Large Language Model Serving with PagedAttention , year =

work page
[61]

arXiv , author =:2404.02690 , primaryclass =

Attention is Naturally Sparse with Gaussian Distributed Input , year =. arXiv , author =:2404.02690 , primaryclass =

work page arXiv
[62]

Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval , url =

Lee Xiong and Chenyan Xiong and Ye Li and Kwok. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval , url =. 9th International Conference on Learning Representations,

work page
[64]

PinnerSage: Multi-Modal User Embedding Framework for Recommendations at Pinterest , url =

Aditya Pal and Chantat Eksombatchai and Yitong Zhou and Bo Zhao and Charles Rosenberg and Jure Leskovec , bibsource =. PinnerSage: Multi-Modal User Embedding Framework for Recommendations at Pinterest , url =

work page
[65]

Non-metric Similarity Graphs for Maximum Inner Product Search , url =

Stanislav Morozov and Artem Babenko , bibsource =. Non-metric Similarity Graphs for Maximum Inner Product Search , url =. Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montr

work page 2018
[67]

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Longbench: A bilingual, multitask benchmark for long context understanding , author=. arXiv preprint arXiv:2308.14508 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[68]

Yi-6b-200k

01-ai . Yi-6b-200k. https://huggingface.co/01-ai/Yi-6B-200K, 2024 a . Accessed: 2024-07-01

work page 2024
[69]

Yi-9b-200k

01-ai . Yi-9b-200k. https://huggingface.co/01-ai/Yi-9B-200K, 2024 b . Accessed: 2024-07-01

work page 2024
[70]

ETC : Encoding long and structured inputs in transformers

Joshua Ainslie, Santiago Ontanon, Chris Alberti, Vaclav Cvicek, Zachary Fisher, Philip Pham, Anirudh Ravula, Sumit Sanghai, Qifan Wang, and Li Yang. ETC : Encoding long and structured inputs in transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 268--284, Online, 2020. Association for Compu...

work page doi:10.18653/v1/2020.emnlp-main.19 2020
[71]

Gqa: Training generalized multi-query transformer models from multi-head checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023

work page 2023
[72]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. ArXiv preprint, abs/2004.05150, 2020. URL https://arxiv.org/abs/2004.05150

work page internal anchor Pith review Pith/arXiv arXiv 2004
[73]

Unlimiformer: Long-range transformers with unlimited length input

Amanda Bertsch, Uri Alon, Graham Neubig, and Matthew Gormley. Unlimiformer: Long-range transformers with unlimited length input. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[74]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, and Wen Xiao. Pyramidkv: Dynamic KV cache compression based on pyramidal information funneling. CoRR, abs/2406.02069, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[75]

Sean Wang

Meng Chen, Kai Zhang, Zhenying He, Yinan Jing, and X. Sean Wang. Roargraph: A projected bipartite graph for efficient cross-modal approximate nearest neighbor search. Proc. VLDB Endow., 17 0 (11): 0 2735–2749, 2024 a . ISSN 2150-8097. doi:10.14778/3681954.3681959. URL https://doi.org/10.14778/3681954.3681959

work page doi:10.14778/3681954.3681959 2024
[76]

Efficient and economic large language model inference with attention offloading

Shaoyuan Chen, Yutong Lin, Mingxing Zhang, and Yongwei Wu. Efficient and economic large language model inference with attention offloading. ArXiv preprint, abs/2405.01814, 2024 b . URL https://arxiv.org/abs/2405.01814

work page arXiv 2024
[77]

Magicpig: sparse inference engine for llm

Zhuoming Chen. Magicpig: sparse inference engine for llm. https://github.com/Infini-AI-Lab/MagicPiG/, 2024. Accessed: 2024-08-01

work page 2024
[78]

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. ArXiv preprint, abs/1904.10509, 2019. URL https://arxiv.org/abs/1904.10509

work page internal anchor Pith review Pith/arXiv arXiv 1904
[79]

A weighted nearest neighbor algorithm for learning with symbolic features

Scott Cost and Steven Salzberg. A weighted nearest neighbor algorithm for learning with symbolic features. Machine learning, 10: 0 57--78, 1993

work page 1993
[80]

Deep neural networks for youtube recommendations

Paul Covington, Jay Adams, and Emre Sargin. Deep neural networks for youtube recommendations. In Shilad Sen, Werner Geyer, Jill Freyne, and Pablo Castells (eds.), Proceedings of the 10th ACM Conference on Recommender Systems, Boston, MA, USA, September 15-19, 2016 , pp.\ 191--198. ACM , 2016. doi:10.1145/2959100.2959190. URL https://doi.org/10.1145/295910...

work page doi:10.1145/2959100.2959190 2016
[81]

Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e . Flash A ttention: Fast and memory-efficient exact attention with IO -awareness. In Advances in Neural Information Processing Systems, 2022

work page 2022
[82]

Attention is naturally sparse with gaussian distributed input, 2024

Yichuan Deng, Zhao Song, and Chiwun Yang. Attention is naturally sparse with gaussian distributed input, 2024

work page 2024
[83]

The faiss library

Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. The faiss library. 2024

work page 2024

Showing first 80 references.