Recognition: no theorem link
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
Pith reviewed 2026-05-15 14:04 UTC · model grok-4.3
The pith
Quest selects only the top-K critical KV cache pages using query vectors and min-max key bounds to accelerate long-context LLM attention.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Quest is a query-aware KV cache selection algorithm that keeps track of the minimal and maximal Key values in KV cache pages and estimates the criticality of a given page using Query vectors. By only loading the Top-K critical KV cache pages for attention, Quest achieves up to 2.23x self-attention speedup and reduces inference latency by 7.03x while performing well on tasks with long dependencies with negligible accuracy loss.
What carries the argument
Query-aware page criticality scoring that uses per-page minimum and maximum key values to estimate attention contribution without loading the full page.
If this is right
- Self-attention runs up to 2.23 times faster by skipping irrelevant KV pages.
- End-to-end inference latency drops by as much as 7.03 times for long sequences.
- Accuracy stays nearly identical on tasks that require information from distant tokens.
- Memory bandwidth pressure during attention decreases in proportion to the fraction of pages skipped.
Where Pith is reading between the lines
- The same page-level approximation could be adapted to other sparse attention patterns such as sliding-window or local attention.
- Hardware with fast sparse memory access might see even larger gains than the reported software speedups.
- Adjusting page size dynamically according to sequence length could further improve the accuracy-speed trade-off.
- The approach might combine with existing quantization or pruning methods to compound efficiency benefits.
Load-bearing premise
Min-max key bounds per page plus query-vector scoring are enough to identify which pages truly matter for the final attention result.
What would settle it
Measure whether accuracy on a long-dependency benchmark drops when Quest is forced to load strictly fewer than its chosen top-K pages compared with full-cache attention.
read the original abstract
As the demand for long-context large language models (LLMs) increases, models with context windows of up to 128K or 1M tokens are becoming increasingly prevalent. However, long-context LLM inference is challenging since the inference speed decreases significantly as the sequence length grows. This slowdown is primarily caused by loading a large KV cache during self-attention. Previous works have shown that a small portion of critical tokens will dominate the attention outcomes. However, we observe the criticality of a token highly depends on the query. To this end, we propose Quest, a query-aware KV cache selection algorithm. Quest keeps track of the minimal and maximal Key values in KV cache pages and estimates the criticality of a given page using Query vectors. By only loading the Top-K critical KV cache pages for attention, Quest significantly speeds up self-attention without sacrificing accuracy. We show that Quest can achieve up to 2.23x self-attention speedup, which reduces inference latency by 7.03x while performing well on tasks with long dependencies with negligible accuracy loss. Code is available at http://github.com/mit-han-lab/Quest .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Quest, a query-aware KV cache selection algorithm for efficient long-context LLM inference. It tracks per-page min/max key values in the KV cache and uses the current query vector to estimate page criticality via an upper-bound attention logit, then loads only the top-K critical pages for the attention computation. The central empirical claim is that this yields up to 2.23× self-attention speedup and 7.03× end-to-end latency reduction while preserving accuracy on long-dependency tasks.
Significance. If the approximation reliably ranks pages without dropping critical tokens, Quest would supply a practical, training-free technique for reducing KV-cache bandwidth in long-context serving. The query-dependent scoring improves upon static sparsity heuristics and could be combined with existing paging or quantization methods.
major comments (3)
- [§3] §3 (Method), page-scoring procedure: the min/max key bound produces a correct but arbitrarily loose upper bound on the true max attention logit whenever keys inside a page exhibit variance. A page containing the single highest-attended token can therefore receive a lower rank than a page whose bound is inflated but whose realized scores are low. This directly threatens the claim of negligible accuracy loss on long-dependency tasks; the manuscript must quantify bound tightness (e.g., fraction of pages where the bound exceeds the true max by >20 %) and report failure cases.
- [§4] §4 (Experiments): the reported 2.23× and 7.03× speedups are presented without hardware details, batch-size specification, number of runs, or error bars. It is therefore impossible to determine whether the selection overhead (min/max maintenance + scoring) is fully included in the latency figures or whether the gains are stable across random seeds and model scales.
- [Evaluation] Evaluation on long-dependency tasks: the abstract asserts “negligible accuracy loss,” yet no per-task accuracy deltas, Needle-in-Haystack retrieval curves, or ablation on Top-K and page size are visible. Because Top-K and page size are free parameters, the central claim that the method “performs well … with negligible accuracy loss” rests on unshown controls.
minor comments (2)
- [§3] Clarify the exact scoring formula (how min/max keys are combined with the query to produce the page score) in the main text rather than leaving it implicit from the abstract.
- [§2] Add a short related-work paragraph contrasting Quest with prior KV-cache eviction methods (e.g., H2O, StreamingLLM) to highlight the query-aware novelty.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the requested analyses, details, and controls.
read point-by-point responses
-
Referee: [§3] §3 (Method), page-scoring procedure: the min/max key bound produces a correct but arbitrarily loose upper bound on the true max attention logit whenever keys inside a page exhibit variance. A page containing the single highest-attended token can therefore receive a lower rank than a page whose bound is inflated but whose realized scores are low. This directly threatens the claim of negligible accuracy loss on long-dependency tasks; the manuscript must quantify bound tightness (e.g., fraction of pages where the bound exceeds the true max by >20 %) and report failure cases.
Authors: We agree the upper bound can be loose under high intra-page key variance. In the revised manuscript we will add a dedicated analysis quantifying bound tightness: we will report the distribution of (bound - true max logit) gaps over sampled pages from our evaluation workloads and the fraction of pages where the gap exceeds 20%. We will also include concrete failure-case examples (pages ranked too low despite containing critical tokens) together with the resulting accuracy impact on the affected tasks. revision: yes
-
Referee: [§4] §4 (Experiments): the reported 2.23× and 7.03× speedups are presented without hardware details, batch-size specification, number of runs, or error bars. It is therefore impossible to determine whether the selection overhead (min/max maintenance + scoring) is fully included in the latency figures or whether the gains are stable across random seeds and model scales.
Authors: All reported speedups were measured on NVIDIA A100-80GB GPUs with batch size 1 (single-request serving). The latency figures fully include the overhead of per-page min/max maintenance and query-aware scoring. Results are averages over 10 independent runs; we will add error bars to all latency plots and explicitly state the model scales (Llama-2 7B/13B) and seed stability in the revision. revision: yes
-
Referee: [Evaluation] Evaluation on long-dependency tasks: the abstract asserts “negligible accuracy loss,” yet no per-task accuracy deltas, Needle-in-Haystack retrieval curves, or ablation on Top-K and page size are visible. Because Top-K and page size are free parameters, the central claim that the method “performs well … with negligible accuracy loss” rests on unshown controls.
Authors: We will expand the evaluation section with (i) per-task accuracy tables reporting absolute deltas versus full attention, (ii) Needle-in-Haystack retrieval accuracy curves across context lengths for multiple Top-K ratios, and (iii) ablation tables varying Top-K (10/20/50 % of pages) and page size (16/32/64 tokens). These additions will substantiate the “negligible accuracy loss” claim with the requested controls. revision: yes
Circularity Check
No circularity: Quest is an empirical heuristic with external evaluation
full rationale
The paper presents Quest as a query-dependent KV-cache page selection algorithm that maintains per-page min/max key bounds and scores pages via dot-product upper bounds with the current query vector before selecting top-K pages. All reported speedups (2.23x attention, 7.03x end-to-end) and accuracy claims are obtained from direct wall-clock measurements and benchmark accuracy on standard long-context tasks. No equations, derivations, or self-citations are used to define the method in terms of its own outputs; the approximation is explicitly heuristic and its quality is assessed externally rather than by construction. No load-bearing step reduces to a fitted parameter, renamed known result, or author-self-citation chain.
Axiom & Free-Parameter Ledger
free parameters (2)
- Top-K
- page size
axioms (2)
- domain assumption A small portion of critical tokens dominate attention outcomes
- domain assumption Criticality of a token highly depends on the query
Forward citations
Cited by 20 Pith papers
-
Very Efficient Listwise Multimodal Reranking for Long Documents
ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.
-
MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference
MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.
-
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression
TriAttention compresses KV cache by exploiting stable pre-RoPE Q/K concentration and trigonometric distance preferences to match full-attention reasoning accuracy with far lower memory and higher speed.
-
AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference
AB-Sparse adaptively allocates per-head block sizes for sparse attention, adds lossless centroid quantization and custom variable-block GPU kernels, and reports up to 5.43% accuracy gain over fixed-block baselines wit...
-
Compute Where it Counts: Self Optimizing Language Models
SOL trains a policy to dynamically control multiple efficiency mechanisms per token via group-relative policy optimization on teacher-forced episodes, yielding better quality at matched average budget than static or r...
-
Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction
A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.
-
Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval
RetrieveVGGT enables constant-memory long-context streaming 3D reconstruction by retrieving relevant frames via query-key similarities in VGGT's first attention layer, outperforming StreamVGGT and others.
-
Reformulating KV Cache Eviction Problem for Long-Context LLM Inference
LaProx reformulates KV cache eviction as an output-aware matrix approximation, enabling a unified global token selection strategy that preserves LLM performance at 5% cache size across long-context benchmarks.
-
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
Louver is a new index structure that guarantees zero false negatives for sparse attention in LLM KV caches by casting the problem as halfspace range searching.
-
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
Louver is a new index for LLM KV caches that guarantees zero false negatives for keys above a relevance threshold, runs faster than prior sparse and some dense attention methods, and integrates lightly into existing p...
-
Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility
SPEED uses layer-asymmetric KV visibility to process non-anchor prompt tokens only in lower layers during prefill, achieving near-baseline quality on Llama-3.1-8B with 33% better TTFT and 25% lower active KV memory at...
-
Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding
Salca is a new ASIC accelerator that achieves 3.82× speedup and 74.19× energy efficiency over A100 for long-context attention via dual-compression dynamic sparse attention and pipelined hardware.
-
Defusing the Trigger: Plug-and-Play Defense for Backdoored LLMs via Tail-Risk Intrinsic Geometric Smoothing
TIGS detects backdoor-induced attention collapse in LLMs and applies content-aware tail-risk screening plus intrinsic geometric smoothing to suppress attacks while preserving normal performance.
-
Sub-Token Routing in LoRA for Adaptation and Query-Aware KV Compression
Sub-token routing in LoRA-adapted transformers adds a finer compression axis for KV caches, with query-independent and query-aware designs that improve efficiency under reduced budgets when combined with token-level s...
-
Forget, Then Recall: Learnable Compression and Selective Unfolding via Gist Sparse Attention
Gist Sparse Attention uses learnable gist compression tokens as both summaries and routing signals, then selectively unfolds relevant raw chunks for fine-grained attention, outperforming compression and sparse-attenti...
-
AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation
AdaCluster delivers a training-free adaptive query-key clustering framework for sparse attention in video DiTs, yielding 1.67-4.31x inference speedup with negligible quality loss on CogVideoX-2B, HunyuanVideo, and Wan-2.1.
-
Graph-Guided Adaptive Channel Elimination for KV Cache Compression
GRACE reframes KV cache channel pruning as graph optimization to find a near-optimal subset, achieving 60% compression with negligible degradation and outperforming prior methods.
-
IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs
IceCache combines semantic token clustering with PagedAttention to keep only 25% of the KV cache tokens while retaining 99% accuracy on LongBench and matching or beating prior offloading methods in latency.
-
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k
Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
-
Comparative Characterization of KV Cache Management Strategies for LLM Inference
Benchmarks of vLLM, InfiniGen, and H2O identify conditions under which each KV cache strategy delivers the best trade-off between memory consumption and inference performance.
Reference graph
Works this paper leans on
-
[1]
I ntroducing the next generation of C laude
Anthropic. I ntroducing the next generation of C laude. https://www.anthropic.com/news/claude-3-family, 2024. [Accessed 28-05-2024]
work page 2024
-
[2]
Longbench: A bilingual, multitask benchmark for long context understanding, 2023
Bai, Y., Lv, X., Zhang, J., Lyu, H., Tang, J., Huang, Z., Du, Z., Liu, X., Zeng, A., Hou, L., Dong, Y., Tang, J., and Li, J. Longbench: A bilingual, multitask benchmark for long context understanding, 2023
work page 2023
-
[3]
Y., Ermon, S., Rudra, A., and Ré, C
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Ré, C. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022
work page 2022
-
[4]
Dasigi, P., Lo, K., Beltagy, I., Cohan, A., Smith, N. A., and Gardner, M. A dataset of information-seeking questions and answers anchored in research papers, 2021
work page 2021
-
[5]
Model tells you what to discard: Adaptive kv cache compression for llms, 2024
Ge, S., Zhang, Y., Liu, L., Zhang, M., Han, J., and Gao, J. Model tells you what to discard: Adaptive kv cache compression for llms, 2024
work page 2024
-
[9]
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023
work page 2023
-
[10]
E., Stoica, I., Ma, X., and Zhang, H
Li, D., Shao, R., Xie, A., Sheng, Y., Zheng, L., Gonzalez, J. E., Stoica, I., Ma, X., and Zhang, H. How long can open-source llms truly promise on context length?, June 2023. URL https://lmsys.org/blog/2023-06-29-longchat
work page 2023
-
[11]
World model on million-length video and language with blockwise ringattention, 2024 a
Liu, H., Yan, W., Zaharia, M., and Abbeel, P. World model on million-length video and language with blockwise ringattention, 2024 a
work page 2024
-
[12]
Scaling laws of rope-based extrapolation, 2024 b
Liu, X., Yan, H., Zhang, S., An, C., Qiu, X., and Lin, D. Scaling laws of rope-based extrapolation, 2024 b
work page 2024
-
[13]
Nvidia ada lovelace professional gpu architecture
NVIDIA. Nvidia ada lovelace professional gpu architecture. https://images.nvidia.com/aem-dam/en-zz/Solutions/technologies/NVIDIA-ADA-GPU-PROVIZ-Architecture-Whitepaper_1.1.pdf, 2023. [Accessed 28-05-2024]
work page 2023
-
[14]
Nvbench: Nvidia's benchmarking tool for gpus, 2024
NVIDIA. Nvbench: Nvidia's benchmarking tool for gpus, 2024. Available online: https://github.com/NVIDIA/nvbench
work page 2024
-
[15]
New models and developer products announced at devday
OpenAI. New models and developer products announced at devday. https://openai.com/blog/new-models-and-developer-products-announced-at-devday#OpenAI, November 2023. Accessed: 2024-01-31
work page 2023
-
[16]
Introducing gpt-4o: our fastest and most affordable flagship model
OpenAI. Introducing gpt-4o: our fastest and most affordable flagship model. https://platform.openai.com/docs/models, 2024. [Accessed 28-05-2024]
work page 2024
-
[17]
Transformers are multi-state RNNs , 2024
Oren, M., Hassid, M., Adi, Y., and Schwartz, R. Transformers are multi-state RNNs , 2024. URL https://arxiv.org/abs/2401.06104. arXiv :2401.06104
-
[18]
Yarn: Efficient context window extension of large language models, 2023
Peng, B., Quesnelle, J., Fan, H., and Shippole, E. Yarn: Efficient context window extension of large language models, 2023
work page 2023
-
[19]
W., Potapenko, A., Jayakumar, S
Rae, J. W., Potapenko, A., Jayakumar, S. M., Hillier, C., and Lillicrap, T. P. Compressive transformers for long-range sequence modelling. arXiv preprint, 2019. URL https://arxiv.org/abs/1911.05507
-
[20]
Sparq attention: Bandwidth-efficient llm inference, 2023
Ribar, L., Chelombiev, I., Hudlass-Galley, L., Blake, C., Luschi, C., and Orr, D. Sparq attention: Bandwidth-efficient llm inference, 2023
work page 2023
-
[21]
Roformer: Enhanced transformer with rotary position embedding, 2023
Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., and Liu, Y. Roformer: Enhanced transformer with rotary position embedding, 2023
work page 2023
-
[22]
Llama: Open and efficient foundation language models, 2023
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. Llama: Open and efficient foundation language models, 2023
work page 2023
-
[23]
Focused transformer: Contrastive training for context scaling, 2023
Tworkowski, S., Staniszewski, K., Pacek, M., Wu, Y., Michalewski, H., and Miłoś, P. Focused transformer: Contrastive training for context scaling, 2023
work page 2023
-
[24]
Efficient streaming language models with attention sinks
Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M. Efficient streaming language models with attention sinks. arXiv, 2023
work page 2023
-
[25]
W., Salakhutdinov, R., and Manning, C
Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W., Salakhutdinov, R., and Manning, C. D. Hotpotqa: A dataset for diverse, explainable multi-hop question answering, 2018
work page 2018
-
[26]
Cascade inference: Memory bandwidth efficient shared prefix batch decoding
Ye, Z., Lai, R., Lu, R., Lin, C.-Y., Zheng, S., Chen, L., Chen, T., and Ceze, L. Cascade inference: Memory bandwidth efficient shared prefix batch decoding. https://flashinfer.ai/2024/01/08/cascade-inference.html, Jan 2024. URL https://flashinfer.ai/2024/01/08/cascade-inference.html. Accessed on 2024-02-01
work page 2024
-
[28]
H _2 o: Heavy-hitter oracle for efficient generative inference of large language models, 2023 b
Zhang, Z., Sheng, Y., Zhou, T., Chen, T., Zheng, L., Cai, R., Song, Z., Tian, Y., Ré, C., Barrett, C., Wang, Z., and Chen, B. H _2 o: Heavy-hitter oracle for efficient generative inference of large language models, 2023 b
work page 2023
-
[29]
Atom: Low-bit quantization for efficient and accurate llm serving, 2024
Zhao, Y., Lin, C.-Y., Zhu, K., Ye, Z., Chen, L., Zheng, S., Ceze, L., Krishnamurthy, A., Chen, T., and Kasikci, B. Atom: Low-bit quantization for efficient and accurate llm serving, 2024
work page 2024
-
[30]
P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =
work page 2000
-
[31]
T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980
work page 1980
-
[32]
M. J. Kearns , title =
-
[33]
Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983
work page 1983
-
[34]
R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000
work page 2000
-
[35]
Suppressed for Anonymity , author=
-
[36]
A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981
work page 1981
-
[37]
A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959
work page 1959
-
[38]
YaRN: Efficient Context Window Extension of Large Language Models , author=. 2023 , eprint=
work page 2023
-
[39]
Focused Transformer: Contrastive Training for Context Scaling , author=. 2023 , eprint=
work page 2023
- [40]
-
[41]
LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=
work page 2023
-
[42]
Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=
Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=
-
[43]
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models , author=. 2023 , eprint=
work page 2023
-
[44]
Atom: Low-bit Quantization for Efficient and Accurate LLM Serving , author=. 2024 , eprint=
work page 2024
-
[45]
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration , author=. 2023 , eprint=
work page 2023
-
[46]
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale , author=. 2022 , eprint=
work page 2022
-
[47]
Rae, Jack W and Potapenko, Anna and Jayakumar, Siddhant M and Hillier, Chloe and Lillicrap, Timothy P , title =. arXiv preprint , url =
-
[48]
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding , author=. 2023 , eprint=
work page 2023
-
[49]
The N arrative QA Reading Comprehension Challenge
Ko c isk \'y , Tom \'a s and Schwarz, Jonathan and Blunsom, Phil and Dyer, Chris and Hermann, Karl Moritz and Melis, G \'a bor and Grefenstette, Edward. The N arrative QA Reading Comprehension Challenge. Transactions of the Association for Computational Linguistics. 2018. doi:10.1162/tacl_a_00023
-
[50]
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , author=. 2018 , eprint=
work page 2018
-
[51]
A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers , author=. 2021 , eprint=
work page 2021
-
[52]
T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Joshi, Mandar and Choi, Eunsol and Weld, Daniel and Zettlemoyer, Luke. T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi:10.18653/v1/P17-1147
-
[53]
Efficient Attentions for Long Document Summarization
Huang, Luyang and Cao, Shuyang and Parulian, Nikolaus and Ji, Heng and Wang, Lu. Efficient Attentions for Long Document Summarization. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021. doi:10.18653/v1/2021.naacl-main.112
-
[54]
H _2 O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models , author=. 2023 , eprint=
work page 2023
-
[55]
Matanel Oren and Michael Hassid and Yossi Adi and Roy Schwartz , year=. Transformers are Multi-State
-
[56]
Efficient Streaming Language Models with Attention Sinks , author=. arXiv , year=
-
[57]
and Stoica, Ion and Ma, Xuezhe and Zhang, Hao , month = jun, year =
Li, Dacheng and Shao, Rulin and Xie, Anze and Sheng, Ying and Zheng, Lianmin and Gonzalez, Joseph E. and Stoica, Ion and Ma, Xuezhe and Zhang, Hao , month = jun, year =. How Long Can Open-Source LLMs Truly Promise on Context Length? , url =
-
[58]
Ye, Zihao and Lai, Ruihang and Lu, Roy and Lin, Chien-Yu and Zheng, Size and Chen, Lequn and Chen, Tianqi and Ceze, Luis , title =. 2024 , month =
work page 2024
-
[59]
RoFormer: Enhanced Transformer with Rotary Position Embedding , author=. 2023 , eprint=
work page 2023
-
[60]
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs , author=. 2024 , eprint=
work page 2024
-
[61]
SparQ Attention: Bandwidth-Efficient LLM Inference , author=. 2023 , eprint=
work page 2023
-
[62]
Zhang, Jingrong and Naruse, Akira and Li, Xipeng and Wang, Yong , title =. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , articleno =. 2023 , isbn =. doi:10.1145/3581784.3607062 , abstract =
-
[63]
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , author=. 2022 , eprint=
work page 2022
- [64]
-
[65]
GitHub repository , howpublished =
Frantar, Elias and Alistarh, Dan , title =. GitHub repository , howpublished =. 2024 , publisher =
work page 2024
-
[66]
World Model on Million-Length Video And Language With Blockwise RingAttention , author=. 2024 , eprint=
work page 2024
- [67]
- [68]
- [69]
- [70]
-
[71]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , author=. 2023 , eprint=
work page 2023
-
[72]
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model , author=. 2024 , eprint=
work page 2024
-
[73]
Thakkar, Vijay and Ramani, Pradeep and Cecka, Cris and Shivam, Aniket and Lu, Honghao and Yan, Ethan and Kosaian, Jack and Hoemmen, Mark and Wu, Haicheng and Kerr, Andrew and Nicely, Matt and Merrill, Duane and Blasig, Dustyn and Qiao, Fengqi and Majcher, Piotr and Springer, Paul and Hohnerbach, Markus and Wang, Jin and Gupta, Manish , license =
-
[74]
Lianmin Zheng and Zhuohan Li and Hao Zhang and Yonghao Zhuang and Zhifeng Chen and Yanping Huang and Yida Wang and Yuanzhong Xu and Danyang Zhuo and Eric P. Xing and Joseph E. Gonzalez and Ion Stoica , title =. 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) , year =
-
[75]
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration , author=. 2024 , eprint=
work page 2024
-
[76]
Guo, Cong and Tang, Jiaming and Hu, Weiming and Leng, Jingwen and Zhang, Chen and Yang, Fan and Liu, Yunxin and Guo, Minyi and Zhu, Yuhao , year=. OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization , url=. doi:10.1145/3579371.3589038 , booktitle=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.