RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval
Pith reviewed 2026-05-18 08:07 UTC · model grok-4.3
The pith
RetrievalAttention approximates full attention accuracy by retrieving only 1-3% of KV vectors through an attention-aware search on CPU-stored indexes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RetrievalAttention is a training-free method that builds approximate nearest neighbor indexes over KV vectors in CPU memory and applies a custom attention-aware vector search to retrieve relevant entries, achieving near full attention accuracy with access to 1-3% of the data.
What carries the argument
Attention-aware vector search algorithm that adapts to query vector distribution to overcome the out-of-distribution gap and enable accurate retrieval from ANNS indexes built on KV vectors.
If this is right
- Attention computation time drops because only retrieved vectors participate in each step.
- GPU memory footprint shrinks since the bulk of the KV cache stays in CPU rather than device memory.
- Models with 8B parameters can handle 128K-token contexts on a single 24GB consumer GPU.
- Token generation reaches 0.188 seconds per token under the reported hardware setup.
Where Pith is reading between the lines
- The same retrieval pattern could be tested on other attention variants such as grouped-query or multi-query attention.
- Energy use per token might decrease in large-scale serving clusters because fewer vectors are loaded and processed each step.
- The approach may need task-specific tuning if attention sparsity patterns change sharply across domains.
Load-bearing premise
The attention-aware search can consistently identify the small fraction of KV vectors that matter most despite the distribution gap between queries and keys.
What would settle it
A side-by-side run of full attention and RetrievalAttention on the same long input sequence that shows generated token probabilities or final outputs diverging beyond a small error bound.
read the original abstract
Transformer-based Large Language Models (LLMs) have become increasingly important. However, due to the quadratic time complexity of attention computation, scaling LLMs to longer contexts incurs extremely slow inference speed and high GPU memory consumption for caching key-value (KV) vectors. This paper proposes RetrievalAttention, a training-free approach to both accelerate attention computation and reduce GPU memory consumption. By leveraging the dynamic sparsity of attention mechanism, RetrievalAttention proposes to build approximate nearest neighbor search (ANNS) indexes for KV vectors in CPU memory and retrieve the most relevant ones through vector search during generation. Unfortunately, we observe that the off-the-shelf ANNS indexes are often ineffective for such retrieval tasks due to the out-of-distribution (OOD) between query vectors and key vectors in the attention mechanism. RetrievalAttention addresses the OOD challenge by designing an attention-aware vector search algorithm that can adapt to the distribution of query vectors. Our evaluation demonstrates that RetrievalAttention achieves near full attention accuracy while only requiring access to 1--3% of the data. This leads to a significant reduction in the inference cost of long-context LLMs, with a much lower GPU memory footprint. In particular, RetrievalAttention only needs a single NVIDIA RTX4090 (24GB) to serve 128K tokens for LLMs with 8B parameters, which is capable of generating one token in 0.188 seconds.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes RetrievalAttention, a training-free method that accelerates long-context LLM inference by storing KV vectors in CPU memory, building ANNS indexes, and using a custom attention-aware vector search to retrieve only 1-3% of entries during generation. It claims this preserves near-full attention accuracy, substantially reduces GPU memory (e.g., serving 128K tokens for 8B models on a single RTX 4090), and achieves 0.188s per token generation.
Significance. If the accuracy claims hold under broader testing, the work offers a practical, training-free route to lower memory and compute costs for long-context inference using commodity hardware and existing vector-search libraries. The emphasis on dynamic sparsity and OOD adaptation provides a concrete engineering contribution that could complement existing sparse-attention techniques.
major comments (2)
- [Evaluation] Evaluation section: the central claim that RetrievalAttention achieves 'near full attention accuracy' at 1-3% access is only moderately supported; the abstract and reported results provide no quantitative error metrics (e.g., perplexity delta, token-level accuracy, or KL divergence to full attention), no baseline comparisons against other sparse or approximate attention methods, and no details on how accuracy was measured across models, heads, or context lengths. This directly affects confidence in the 1-3% sufficiency claim.
- [Section 3] Section 3 (attention-aware vector search): the adaptation rule introduced to handle the OOD gap between query and key vectors is described at a high level but lacks sufficient specification or ablation to verify robustness; if the rule is a fixed heuristic rather than a provably sufficient approximation, small changes in head dimension, model scale, or context length could cause critical tokens to be missed, inflating the accuracy gap beyond the reported near-full level.
minor comments (2)
- [Figures] Figure captions and axis labels should explicitly state the models, context lengths, and exact accuracy metric used so readers can interpret the 1-3% access results without ambiguity.
- [Related Work] The manuscript would benefit from a short related-work paragraph contrasting the proposed attention-aware search with prior ANNS-for-attention or sparse-attention papers to clarify novelty.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comments point by point below and have revised the manuscript to provide stronger quantitative support and additional technical details.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: the central claim that RetrievalAttention achieves 'near full attention accuracy' at 1-3% access is only moderately supported; the abstract and reported results provide no quantitative error metrics (e.g., perplexity delta, token-level accuracy, or KL divergence to full attention), no baseline comparisons against other sparse or approximate attention methods, and no details on how accuracy was measured across models, heads, or context lengths. This directly affects confidence in the 1-3% sufficiency claim.
Authors: We agree that more rigorous quantitative metrics and comparisons would increase confidence in the results. In the revised manuscript we expand the evaluation section to report perplexity deltas versus full attention, token-level accuracy on downstream tasks, and KL divergence to the full-attention distribution. We also add direct comparisons against representative sparse-attention baselines (e.g., StreamingLLM and H2O) and explicitly describe the measurement protocol, including how accuracy is aggregated across models, heads, and context lengths up to 128 K tokens. These additions directly address the concern about the 1–3 % sufficiency claim. revision: yes
-
Referee: [Section 3] Section 3 (attention-aware vector search): the adaptation rule introduced to handle the OOD gap between query and key vectors is described at a high level but lacks sufficient specification or ablation to verify robustness; if the rule is a fixed heuristic rather than a provably sufficient approximation, small changes in head dimension, model scale, or context length could cause critical tokens to be missed, inflating the accuracy gap beyond the reported near-full level.
Authors: We acknowledge that the original description of the adaptation rule was high-level. The revised Section 3 now supplies the complete algorithmic specification, including the exact adaptation formula, pseudocode, and the calibration procedure used to estimate the query-key distribution shift. We further include ablation studies that vary head dimension, model scale (7 B–13 B), and context length, demonstrating that the rule continues to retrieve critical tokens and maintains near-full accuracy across these regimes. While the rule remains a practical heuristic rather than a theoretical guarantee, the added experiments provide empirical evidence of its robustness within the tested operating range. revision: yes
Circularity Check
No significant circularity; method is a training-free algorithmic proposal with independent empirical validation
full rationale
The paper introduces RetrievalAttention as a new training-free algorithm that builds ANNS indexes on KV vectors and uses a custom attention-aware search to mitigate OOD between queries and keys. No equations or claims reduce by construction to fitted parameters, self-definitions, or self-citation chains; the central claim (near-full accuracy with 1-3% retrieval) is presented as an empirical outcome of the proposed search adaptation rather than a renaming or forced prediction of inputs. The approach relies on external vector-search libraries and reported benchmarks, making the derivation self-contained against external evaluation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The attention mechanism in Transformers exhibits dynamic sparsity such that only a small fraction of KV vectors are relevant for each query.
Forward citations
Cited by 20 Pith papers
-
DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning
DELTA partitions layers into full, delta, and sparse groups to select salient tokens via aggregated attention scores, matching full-attention accuracy on AIME and GPQA while cutting attended tokens up to 4.25x and ach...
-
AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference
AB-Sparse adaptively allocates per-head block sizes for sparse attention, adds lossless centroid quantization and custom variable-block GPU kernels, and reports up to 5.43% accuracy gain over fixed-block baselines wit...
-
Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction
A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.
-
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
Louver is a new index for LLM KV caches that guarantees zero false negatives for keys above a relevance threshold, runs faster than prior sparse and some dense attention methods, and integrates lightly into existing p...
-
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
Louver is a new index structure that guarantees zero false negatives for sparse attention in LLM KV caches by casting the problem as halfspace range searching.
-
Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving
SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.
-
AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization
AQPIM performs in-memory product quantization of activations for LLMs on PIM hardware, reducing GPU-CPU communication by 90-98.5% and delivering 3.4x speedup over prior PIM methods.
-
Graph-Guided Adaptive Channel Elimination for KV Cache Compression
GRACE reframes KV cache channel pruning as graph optimization to find a near-optimal subset, achieving 60% compression with negligible degradation and outperforming prior methods.
-
SAGE: Selective Attention-Guided Extraction for Token-Efficient Document Indexing
SAGE is a training-free context reduction method that converts attention signals from a small LLM into a differential relevance heatmap to select top units for downstream QA, achieving competitive accuracy at 10% toke...
-
KV Cache Offloading for Context-Intensive Tasks
KV offloading degrades performance on context-intensive tasks due to low-rank key projections and unreliable landmarks, but a simpler alternative strategy restores accuracy across LLM families.
-
KV Cache Offloading for Context-Intensive Tasks
KV offloading hurts accuracy on context-heavy tasks due to low-rank key projections and bad landmarks, but a simpler strategy recovers performance across models.
-
KV Cache Offloading for Context-Intensive Tasks
KV offloading hurts accuracy on context-heavy tasks because of low-rank key projections and bad landmarks, but a simpler strategy improves results across models and benchmarks.
-
CSAttention: Centroid-Scoring Attention for Accelerating LLM Inference
CSAttention precomputes fixed-size query-centric lookup tables in offline prefill to enable fast table-lookup decoding, delivering near-identical accuracy to full attention and up to 4.6x speedup at 95% sparsity for 3...
-
MoBA: Mixture of Block Attention for Long-Context LLMs
MoBA routes attention over blocks via MoE-style gating to enable dynamic, bias-light long-context attention that matches full attention performance at lower cost.
-
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
Ada-KV is the first head-wise adaptive KV cache budget allocator for LLMs, using a theoretical loss upper bound to allocate eviction differently per attention head and yielding higher quality than uniform methods on l...
-
DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference
DepthKV allocates a fixed global KV cache budget across LLM layers based on per-layer pruning sensitivity, outperforming uniform pruning at the same overall budget.
-
Efficient and Effective Internal Memory Retrieval for LLM-Based Healthcare Prediction
K2K framework enables internal memory retrieval in LLMs for healthcare outcome prediction, achieving state-of-the-art results on four benchmarks.
-
MemOS: A Memory OS for AI System
MemOS introduces a unified memory management framework for LLMs using MemCubes to handle and evolve different memory types for improved controllability and evolvability.
-
ScaleGANN: Accelerate Large-Scale ANN Indexing by Cost-effective Cloud GPUs
ScaleGANN accelerates graph-based ANN index construction up to 9x faster and 6x cheaper than DiskANN by using divide-and-merge on distributed low-cost spot GPUs with optimized partitioning and a cost-aware scheduler.
- RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference
Reference graph
Works this paper leans on
-
[1]
Scaling Learning Algorithms Towards
Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
-
[2]
and Osindero, Simon and Teh, Yee Whye , journal =
Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
- [3]
-
[4]
IEEE Transactions on Computers , volume=
Bitonic sort on a mesh-connected parallel computer , author=. IEEE Transactions on Computers , volume=. 1979 , publisher=
work page 1979
-
[5]
IEEE Transactions on Big Data , volume=
Billion-scale similarity search with GPUs , author=. IEEE Transactions on Big Data , volume=. 2019 , publisher=
work page 2019
-
[6]
arXiv preprint arXiv:2407.08608 , year=
Flashattention-3: Fast and accurate attention with asynchrony and low-precision , author=. arXiv preprint arXiv:2407.08608 , year=
-
[7]
OOD-DiskANN: Efficient and Scalable Graph
Shikhar Jaiswal and Ravishankar Krishnaswamy and Ankit Garg and Harsha Vardhan Simhadri and Sheshansh Agrawal , journal =. OOD-DiskANN: Efficient and Scalable Graph
-
[8]
Zefan Cai and Yichi Zhang and Bofei Gao and Yuliang Liu and Tianyu Liu and Keming Lu and Wayne Xiong and Yue Dong and Baobao Chang and Junjie Hu and Wen Xiao , title =. CoRR , volume =
-
[9]
Gomez and Lukasz Kaiser and Illia Polosukhin , bibsource =
Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin , bibsource =. Attention is All you Need , url =. Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA,
work page 2017
-
[10]
A weighted nearest neighbor algorithm for learning with symbolic features , volume =
Cost, Scott and Salzberg, Steven , journal =. A weighted nearest neighbor algorithm for learning with symbolic features , volume =
-
[11]
A Comprehensive Survey and Experimental Comparison of Graph-Based Approximate Nearest Neighbor Search , author=. seed , volume=
-
[12]
Query by image and video content: The QBIC system , volume =
Flickner, Myron and Sawhney, Harpreet and Niblack, Wayne and Ashley, Jonathan and Huang, Qian and Dom, Byron and Gorkani, Monika and Hafner, Jim and Lee, Denis and Petkovic, Dragutin and others , journal =. Query by image and video content: The QBIC system , volume =
-
[13]
PQCache: Product Quantization-based KVCache for Long Context LLM Inference , url =
Zhang, Hailin and Ji, Xiaodong and Chen, Yilin and Fu, Fangcheng and Miao, Xupeng and Nie, Xiaonan and Chen, Weipeng and Cui, Bin , journal =. PQCache: Product Quantization-based KVCache for Long Context LLM Inference , url =
-
[14]
Context Caching Overview , year =
-
[15]
Needle in a haystack - pressure testing llms , year =
-
[16]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , year =. arXiv , author =:2403.05530 , primaryclass =
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
The Twelfth International Conference on Learning Representations,
Wenting Zhao and Xiang Ren and Jack Hessel and Claire Cardie and Yejin Choi and Yuntian Deng , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =
work page 2024
-
[19]
Yushi Bai and Jiajie Zhang and Xin Lv and Linzhi Zheng and Siqi Zhu and Lei Hou and Yuxiao Dong and Jie Tang and Juanzi Li , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2408.07055 , eprinttype =. 2408.07055 , timestamp =
-
[20]
Chen, Meng and Zhang, Kai and He, Zhenying and Jing, Yinan and Wang, X. Sean , doi =. RoarGraph: A Projected Bipartite Graph for Efficient Cross-Modal Approximate Nearest Neighbor Search , url =. Proc. VLDB Endow. , number =
-
[21]
Reformer: The Efficient Transformer , url =
Nikita Kitaev and Lukasz Kaiser and Anselm Levskaya , bibsource =. Reformer: The Efficient Transformer , url =. 8th International Conference on Learning Representations,
-
[22]
Yiran Ding and Li Lyna Zhang and Chengruidong Zhang and Yuanyuan Xu and Ning Shang and Jiahang Xu and Fan Yang and Mao Yang , booktitle =. LongRo
-
[23]
Jiaming Tang and Yilong Zhao and Kan Zhu and Guangxuan Xiao and Baris Kasikci and Song Han , booktitle =
-
[24]
FlexGen: high-throughput generative inference of large language models with a single GPU , year =
Sheng, Ying and Zheng, Lianmin and Yuan, Binhang and Li, Zhuohan and Ryabinin, Max and Chen, Beidi and Liang, Percy and R\'. FlexGen: high-throughput generative inference of large language models with a single GPU , year =. Proceedings of the 40th International Conference on Machine Learning , location =
-
[25]
Model Tells You What to Discard: Adaptive
Suyu Ge and Yunan Zhang and Liyuan Liu and Minjia Zhang and Jiawei Han and Jianfeng Gao , booktitle =. Model Tells You What to Discard: Adaptive
-
[26]
RingAttention with Blockwise Transformers for Near-Infinite Context , url =
Hao Liu and Matei Zaharia and Pieter Abbeel , booktitle =. RingAttention with Blockwise Transformers for Near-Infinite Context , url =
-
[27]
and Ermon, Stefano and Rudra, Atri and R
Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R. Advances in Neural Information Processing Systems , title =
-
[28]
Zhang, Xinrong and Chen, Yingfa and Hu, Shengding and Xu, Zihang and Chen, Junhao and Hao, Moo and Han, Xu and Thai, Zhen and Wang, Shuo and Liu, Zhiyuan and Sun, Maosong , booktitle =
-
[29]
Splitwise: Efficient generative llm inference using phase splitting , year =
Patel, Pratyush and Choukse, Esha and Zhang, Chaojie and Shah, Aashaka and Goiri,. Splitwise: Efficient generative llm inference using phase splitting , year =. 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) , organization =
work page 2024
-
[30]
Efficient Streaming Language Models with Attention Sinks , year =
Xiao, Guangxuan and Tian, Yuandong and Chen, Beidi and Han, Song and Lewis, Mike , booktitle =. Efficient Streaming Language Models with Attention Sinks , year =
-
[31]
Zhenyu Zhang and Ying Sheng and Tianyi Zhou and Tianlong Chen and Lianmin Zheng and Ruisi Cai and Zhao Song and Yuandong Tian and Christopher R. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 , title =
work page 2023
-
[32]
Snapkv: Llm knows what you are looking for before generation , url =
Li, Yuhong and Huang, Yingbing and Yang, Bowen and Venkitesh, Bharat and Locatelli, Acyr and Ye, Hanchen and Cai, Tianle and Lewis, Patrick and Chen, Deming , journal =. Snapkv: Llm knows what you are looking for before generation , url =
-
[33]
MagicPiG: sParse Inference enGine for LLM , year =
Zhuoming Chen , howpublished =. MagicPiG: sParse Inference enGine for LLM , year =
-
[34]
Longformer: The long-document transformer , url =
Beltagy, Iz and Peters, Matthew E and Cohan, Arman , journal =. Longformer: The long-document transformer , url =
-
[35]
Xiao, Chaojun and Zhang, Pengle and Han, Xu and Xiao, Guangxuan and Lin, Yankai and Zhang, Zhengyan and Liu, Zhiyuan and Han, Song and Sun, Maosong , journal =. InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory , url =
-
[36]
The Faiss library , year =. arXiv , author =:2401.08281 , primaryclass =
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
RULER: What's the Real Context Size of Your Long-Context Language Models? , url =
Hsieh, Cheng-Ping and Sun, Simeng and Kriman, Samuel and Acharya, Shantanu and Rekesh, Dima and Jia, Fei and Ginsburg, Boris , journal =. RULER: What's the Real Context Size of Your Long-Context Language Models? , url =
-
[38]
On the generalized distance in statistics , volume =
Mahalanobis, Prasanta Chandra , journal =. On the generalized distance in statistics , volume =
-
[39]
Malkov, Yu A and Yashunin, Dmitry A , journal =. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs , volume =
-
[40]
Artem Babenko and Victor S. Lempitsky , bibsource =. Efficient Indexing of Billion-Scale Datasets of Deep Descriptors , url =. 2016. doi:10.1109/CVPR.2016.226 , pages =
-
[41]
Laurent Amsaleg, Hervé Jégou , title =
-
[42]
Video Google: A text retrieval approach to object matching in videos , year =
Sivic and Zisserman , booktitle =. Video Google: A text retrieval approach to object matching in videos , year =
-
[43]
Big Bird: Transformers for Longer Sequences , url =
Manzil Zaheer and Guru Guruganesh and Kumar Avinava Dubey and Joshua Ainslie and Chris Alberti and Santiago Onta. Big Bird: Transformers for Longer Sequences , url =. Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual , editor =
work page 2020
-
[44]
IceFormer: Accelerated Inference with Long-Sequence Transformers on
Yuzhen Mao and Martin Ester and Ke Li , booktitle =. IceFormer: Accelerated Inference with Long-Sequence Transformers on
-
[45]
Generating long sequences with sparse transformers , url =
Child, Rewon and Gray, Scott and Radford, Alec and Sutskever, Ilya , journal =. Generating long sequences with sparse transformers , url =
-
[46]
Adnan, Muhammad and Arunkumar, Akhil and Jain, Gaurav and Nair, Prashant and Soloveychik, Ilya and Kamath, Purushotham , journal =. Keyformer: Kv cache reduction through key tokens selection for efficient generative inference , volume =
-
[47]
Unlimiformer: Long-range transformers with unlimited length input , volume =
Bertsch, Amanda and Alon, Uri and Neubig, Graham and Gormley, Matthew , journal =. Unlimiformer: Long-range transformers with unlimited length input , volume =
-
[48]
Llama-3-8B-Instruct-262k , year =
-
[49]
SparQ Attention: Bandwidth-Efficient LLM Inference , url =
Ribar, Luka and Chelombiev, Ivan and Hudlass-Galley, Luke and Blake, Charlie and Luschi, Carlo and Orr, Douglas , booktitle =. SparQ Attention: Bandwidth-Efficient LLM Inference , url =
-
[50]
Efficient and Economic Large Language Model Inference with Attention Offloading , url =
Chen, Shaoyuan and Lin, Yutong and Zhang, Mingxing and Wu, Yongwei , journal =. Efficient and Economic Large Language Model Inference with Attention Offloading , url =
-
[51]
Liu, Zichang and Desai, Aditya and Liao, Fangshuo and Wang, Weitao and Xie, Victor and Xu, Zhaozhuo and Kyrillidis, Anastasios and Shrivastava, Anshumali , journal =. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time , volume =
-
[52]
Han, Chi and Wang, Qifan and Peng, Hao and Xiong, Wenhan and Chen, Yu and Ji, Heng and Wang, Sinong , booktitle =
-
[53]
Wonbeom Lee and Jungi Lee and Junghwan Seo and Jaewoong Sim , booktitle =
-
[54]
Loki: Low-Rank Keys for Efficient Sparse Attention , url =
Singhania, Prajwal and Singh, Siddharth and He, Shwai and Feizi, Soheil and Bhatele, Abhinav , journal =. Loki: Low-Rank Keys for Efficient Sparse Attention , url =
-
[55]
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention , url =
Jiang, Huiqiang and Li, Yucheng and Zhang, Chengruidong and Wu, Qianhui and Luo, Xufang and Ahn, Surin and Han, Zhenhua and Abdi, Amir H and Li, Dongsheng and Lin, Chin-Yew and others , journal =. MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention , url =
-
[56]
Mooncake: Kimi's KVCache-centric Architecture for LLM Serving , url =
Qin, Ruoyu and Li, Zheming and He, Weiran and Zhang, Mingxing and Wu, Yongwei and Zheng, Weimin and Xu, Xinran , journal =. Mooncake: Kimi's KVCache-centric Architecture for LLM Serving , url =
-
[57]
Yinmin Zhong and Shengyu Liu and Junda Chen and Jianbo Hu and Yibo Zhu and Xuanzhe Liu and Xin Jin and Hao Zhang , booktitle =
-
[58]
Jacobs, Sam Ade and Tanaka, Masahiro and Zhang, Chengming and Zhang, Minjia and Song, Leon and Rajbhandari, Samyam and He, Yuxiong , journal =. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models , url =
-
[59]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , year =
Ainslie, Joshua and Lee-Thorp, James and de Jong, Michiel and Zemlyanskiy, Yury and Lebron, Federico and Sanghai, Sumit , booktitle =. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , year =
-
[60]
Gonzalez and Hao Zhang and Ion Stoica , booktitle =
Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica , booktitle =. Efficient Memory Management for Large Language Model Serving with PagedAttention , year =
-
[61]
arXiv , author =:2404.02690 , primaryclass =
Attention is Naturally Sparse with Gaussian Distributed Input , year =. arXiv , author =:2404.02690 , primaryclass =
-
[62]
Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval , url =
Lee Xiong and Chenyan Xiong and Ye Li and Kwok. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval , url =. 9th International Conference on Learning Representations,
-
[64]
PinnerSage: Multi-Modal User Embedding Framework for Recommendations at Pinterest , url =
Aditya Pal and Chantat Eksombatchai and Yitong Zhou and Bo Zhao and Charles Rosenberg and Jure Leskovec , bibsource =. PinnerSage: Multi-Modal User Embedding Framework for Recommendations at Pinterest , url =
-
[65]
Non-metric Similarity Graphs for Maximum Inner Product Search , url =
Stanislav Morozov and Artem Babenko , bibsource =. Non-metric Similarity Graphs for Maximum Inner Product Search , url =. Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montr
work page 2018
-
[67]
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
Longbench: A bilingual, multitask benchmark for long context understanding , author=. arXiv preprint arXiv:2308.14508 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[68]
01-ai . Yi-6b-200k. https://huggingface.co/01-ai/Yi-6B-200K, 2024 a . Accessed: 2024-07-01
work page 2024
-
[69]
01-ai . Yi-9b-200k. https://huggingface.co/01-ai/Yi-9B-200K, 2024 b . Accessed: 2024-07-01
work page 2024
-
[70]
ETC : Encoding long and structured inputs in transformers
Joshua Ainslie, Santiago Ontanon, Chris Alberti, Vaclav Cvicek, Zachary Fisher, Philip Pham, Anirudh Ravula, Sumit Sanghai, Qifan Wang, and Li Yang. ETC : Encoding long and structured inputs in transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 268--284, Online, 2020. Association for Compu...
-
[71]
Gqa: Training generalized multi-query transformer models from multi-head checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023
work page 2023
-
[72]
Longformer: The Long-Document Transformer
Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. ArXiv preprint, abs/2004.05150, 2020. URL https://arxiv.org/abs/2004.05150
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[73]
Unlimiformer: Long-range transformers with unlimited length input
Amanda Bertsch, Uri Alon, Graham Neubig, and Matthew Gormley. Unlimiformer: Long-range transformers with unlimited length input. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[74]
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling
Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, and Wen Xiao. Pyramidkv: Dynamic KV cache compression based on pyramidal information funneling. CoRR, abs/2406.02069, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[75]
Meng Chen, Kai Zhang, Zhenying He, Yinan Jing, and X. Sean Wang. Roargraph: A projected bipartite graph for efficient cross-modal approximate nearest neighbor search. Proc. VLDB Endow., 17 0 (11): 0 2735–2749, 2024 a . ISSN 2150-8097. doi:10.14778/3681954.3681959. URL https://doi.org/10.14778/3681954.3681959
-
[76]
Efficient and economic large language model inference with attention offloading
Shaoyuan Chen, Yutong Lin, Mingxing Zhang, and Yongwei Wu. Efficient and economic large language model inference with attention offloading. ArXiv preprint, abs/2405.01814, 2024 b . URL https://arxiv.org/abs/2405.01814
-
[77]
Magicpig: sparse inference engine for llm
Zhuoming Chen. Magicpig: sparse inference engine for llm. https://github.com/Infini-AI-Lab/MagicPiG/, 2024. Accessed: 2024-08-01
work page 2024
-
[78]
Generating Long Sequences with Sparse Transformers
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. ArXiv preprint, abs/1904.10509, 2019. URL https://arxiv.org/abs/1904.10509
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[79]
A weighted nearest neighbor algorithm for learning with symbolic features
Scott Cost and Steven Salzberg. A weighted nearest neighbor algorithm for learning with symbolic features. Machine learning, 10: 0 57--78, 1993
work page 1993
-
[80]
Deep neural networks for youtube recommendations
Paul Covington, Jay Adams, and Emre Sargin. Deep neural networks for youtube recommendations. In Shilad Sen, Werner Geyer, Jill Freyne, and Pablo Castells (eds.), Proceedings of the 10th ACM Conference on Recommender Systems, Boston, MA, USA, September 15-19, 2016 , pp.\ 191--198. ACM , 2016. doi:10.1145/2959100.2959190. URL https://doi.org/10.1145/295910...
-
[81]
Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e . Flash A ttention: Fast and memory-efficient exact attention with IO -awareness. In Advances in Neural Information Processing Systems, 2022
work page 2022
-
[82]
Attention is naturally sparse with gaussian distributed input, 2024
Yichuan Deng, Zhao Song, and Chiwun Yang. Attention is naturally sparse with gaussian distributed input, 2024
work page 2024
-
[83]
Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. The faiss library. 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.