arxiv: 2406.10774 · v2 · submitted 2024-06-16 · 💻 cs.CL · cs.LG

Recognition: no theorem link

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Jiaming Tang , Yilong Zhao , Kan Zhu , Guangxuan Xiao , Baris Kasikci , Song Han

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:04 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords KV cache sparsitylong-context LLMquery-aware selectionself-attention optimizationinference accelerationtoken criticalitymemory-efficient inference

0 comments

The pith

Quest selects only the top-K critical KV cache pages using query vectors and min-max key bounds to accelerate long-context LLM attention.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Quest as a query-aware method to sparsify the key-value cache during self-attention in long-context LLMs. It tracks the minimum and maximum key values within fixed pages of the cache and scores each page's relevance to the current query vector. Only the highest-scoring pages are loaded for the attention computation. This reduces the data movement and compute required while preserving accuracy on tasks that depend on distant context. The result is substantial speedups in both attention and overall inference latency.

Core claim

Quest is a query-aware KV cache selection algorithm that keeps track of the minimal and maximal Key values in KV cache pages and estimates the criticality of a given page using Query vectors. By only loading the Top-K critical KV cache pages for attention, Quest achieves up to 2.23x self-attention speedup and reduces inference latency by 7.03x while performing well on tasks with long dependencies with negligible accuracy loss.

What carries the argument

Query-aware page criticality scoring that uses per-page minimum and maximum key values to estimate attention contribution without loading the full page.

If this is right

Self-attention runs up to 2.23 times faster by skipping irrelevant KV pages.
End-to-end inference latency drops by as much as 7.03 times for long sequences.
Accuracy stays nearly identical on tasks that require information from distant tokens.
Memory bandwidth pressure during attention decreases in proportion to the fraction of pages skipped.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same page-level approximation could be adapted to other sparse attention patterns such as sliding-window or local attention.
Hardware with fast sparse memory access might see even larger gains than the reported software speedups.
Adjusting page size dynamically according to sequence length could further improve the accuracy-speed trade-off.
The approach might combine with existing quantization or pruning methods to compound efficiency benefits.

Load-bearing premise

Min-max key bounds per page plus query-vector scoring are enough to identify which pages truly matter for the final attention result.

What would settle it

Measure whether accuracy on a long-dependency benchmark drops when Quest is forced to load strictly fewer than its chosen top-K pages compared with full-cache attention.

read the original abstract

As the demand for long-context large language models (LLMs) increases, models with context windows of up to 128K or 1M tokens are becoming increasingly prevalent. However, long-context LLM inference is challenging since the inference speed decreases significantly as the sequence length grows. This slowdown is primarily caused by loading a large KV cache during self-attention. Previous works have shown that a small portion of critical tokens will dominate the attention outcomes. However, we observe the criticality of a token highly depends on the query. To this end, we propose Quest, a query-aware KV cache selection algorithm. Quest keeps track of the minimal and maximal Key values in KV cache pages and estimates the criticality of a given page using Query vectors. By only loading the Top-K critical KV cache pages for attention, Quest significantly speeds up self-attention without sacrificing accuracy. We show that Quest can achieve up to 2.23x self-attention speedup, which reduces inference latency by 7.03x while performing well on tasks with long dependencies with negligible accuracy loss. Code is available at http://github.com/mit-han-lab/Quest .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Quest gives a simple query-dependent way to drop most KV cache pages using min-max key bounds per page, delivering 2x attention speedups and 7x latency cuts on long contexts with small accuracy cost in their tests.

read the letter

Quest makes the critical-token observation dynamic by scoring each KV cache page against the current query using the min and max key values inside that page. They keep only the top-K pages for attention. This produces up to 2.23x faster self-attention and 7.03x lower overall latency while accuracy stays close on the long-dependency tasks they measured. The code is public, which helps a lot for checking the numbers.

Referee Report

3 major / 2 minor

Summary. The paper introduces Quest, a query-aware KV cache selection algorithm for efficient long-context LLM inference. It tracks per-page min/max key values in the KV cache and uses the current query vector to estimate page criticality via an upper-bound attention logit, then loads only the top-K critical pages for the attention computation. The central empirical claim is that this yields up to 2.23× self-attention speedup and 7.03× end-to-end latency reduction while preserving accuracy on long-dependency tasks.

Significance. If the approximation reliably ranks pages without dropping critical tokens, Quest would supply a practical, training-free technique for reducing KV-cache bandwidth in long-context serving. The query-dependent scoring improves upon static sparsity heuristics and could be combined with existing paging or quantization methods.

major comments (3)

[§3] §3 (Method), page-scoring procedure: the min/max key bound produces a correct but arbitrarily loose upper bound on the true max attention logit whenever keys inside a page exhibit variance. A page containing the single highest-attended token can therefore receive a lower rank than a page whose bound is inflated but whose realized scores are low. This directly threatens the claim of negligible accuracy loss on long-dependency tasks; the manuscript must quantify bound tightness (e.g., fraction of pages where the bound exceeds the true max by >20 %) and report failure cases.
[§4] §4 (Experiments): the reported 2.23× and 7.03× speedups are presented without hardware details, batch-size specification, number of runs, or error bars. It is therefore impossible to determine whether the selection overhead (min/max maintenance + scoring) is fully included in the latency figures or whether the gains are stable across random seeds and model scales.
[Evaluation] Evaluation on long-dependency tasks: the abstract asserts “negligible accuracy loss,” yet no per-task accuracy deltas, Needle-in-Haystack retrieval curves, or ablation on Top-K and page size are visible. Because Top-K and page size are free parameters, the central claim that the method “performs well … with negligible accuracy loss” rests on unshown controls.

minor comments (2)

[§3] Clarify the exact scoring formula (how min/max keys are combined with the query to produce the page score) in the main text rather than leaving it implicit from the abstract.
[§2] Add a short related-work paragraph contrasting Quest with prior KV-cache eviction methods (e.g., H2O, StreamingLLM) to highlight the query-aware novelty.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the requested analyses, details, and controls.

read point-by-point responses

Referee: [§3] §3 (Method), page-scoring procedure: the min/max key bound produces a correct but arbitrarily loose upper bound on the true max attention logit whenever keys inside a page exhibit variance. A page containing the single highest-attended token can therefore receive a lower rank than a page whose bound is inflated but whose realized scores are low. This directly threatens the claim of negligible accuracy loss on long-dependency tasks; the manuscript must quantify bound tightness (e.g., fraction of pages where the bound exceeds the true max by >20 %) and report failure cases.

Authors: We agree the upper bound can be loose under high intra-page key variance. In the revised manuscript we will add a dedicated analysis quantifying bound tightness: we will report the distribution of (bound - true max logit) gaps over sampled pages from our evaluation workloads and the fraction of pages where the gap exceeds 20%. We will also include concrete failure-case examples (pages ranked too low despite containing critical tokens) together with the resulting accuracy impact on the affected tasks. revision: yes
Referee: [§4] §4 (Experiments): the reported 2.23× and 7.03× speedups are presented without hardware details, batch-size specification, number of runs, or error bars. It is therefore impossible to determine whether the selection overhead (min/max maintenance + scoring) is fully included in the latency figures or whether the gains are stable across random seeds and model scales.

Authors: All reported speedups were measured on NVIDIA A100-80GB GPUs with batch size 1 (single-request serving). The latency figures fully include the overhead of per-page min/max maintenance and query-aware scoring. Results are averages over 10 independent runs; we will add error bars to all latency plots and explicitly state the model scales (Llama-2 7B/13B) and seed stability in the revision. revision: yes
Referee: [Evaluation] Evaluation on long-dependency tasks: the abstract asserts “negligible accuracy loss,” yet no per-task accuracy deltas, Needle-in-Haystack retrieval curves, or ablation on Top-K and page size are visible. Because Top-K and page size are free parameters, the central claim that the method “performs well … with negligible accuracy loss” rests on unshown controls.

Authors: We will expand the evaluation section with (i) per-task accuracy tables reporting absolute deltas versus full attention, (ii) Needle-in-Haystack retrieval accuracy curves across context lengths for multiple Top-K ratios, and (iii) ablation tables varying Top-K (10/20/50 % of pages) and page size (16/32/64 tokens). These additions will substantiate the “negligible accuracy loss” claim with the requested controls. revision: yes

Circularity Check

0 steps flagged

No circularity: Quest is an empirical heuristic with external evaluation

full rationale

The paper presents Quest as a query-dependent KV-cache page selection algorithm that maintains per-page min/max key bounds and scores pages via dot-product upper bounds with the current query vector before selecting top-K pages. All reported speedups (2.23x attention, 7.03x end-to-end) and accuracy claims are obtained from direct wall-clock measurements and benchmark accuracy on standard long-context tasks. No equations, derivations, or self-citations are used to define the method in terms of its own outputs; the approximation is explicitly heuristic and its quality is assessed externally rather than by construction. No load-bearing step reduces to a fitted parameter, renamed known result, or author-self-citation chain.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The approach rests on two domain assumptions about attention sparsity and query dependence plus a small number of implementation hyperparameters; no new physical entities are introduced.

free parameters (2)

Top-K
Number of highest-scoring pages to load; chosen to balance speed and accuracy.
page size
Granularity for grouping KV entries into pages for min/max tracking; selected for implementation efficiency.

axioms (2)

domain assumption A small portion of critical tokens dominate attention outcomes
Foundation taken from previous works on KV cache sparsity.
domain assumption Criticality of a token highly depends on the query
Central observation that enables the query-aware design.

pith-pipeline@v0.9.0 · 5515 in / 1355 out tokens · 146843 ms · 2026-05-15T14:04:56.497780+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Very Efficient Listwise Multimodal Reranking for Long Documents
cs.IR 2026-05 unverdicted novelty 7.0

ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.
MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference
cs.LG 2026-05 conditional novelty 7.0

MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression
cs.CL 2026-04 unverdicted novelty 7.0

TriAttention compresses KV cache by exploiting stable pre-RoPE Q/K concentration and trigonometric distance preferences to match full-attention reasoning accuracy with far lower memory and higher speed.
AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference
cs.DC 2026-05 unverdicted novelty 6.0

AB-Sparse adaptively allocates per-head block sizes for sparse attention, adds lossless centroid quantization and custom variable-block GPU kernels, and reports up to 5.43% accuracy gain over fixed-block baselines wit...
Compute Where it Counts: Self Optimizing Language Models
cs.LG 2026-05 unverdicted novelty 6.0

SOL trains a policy to dynamically control multiple efficiency mechanisms per token via group-relative policy optimization on teacher-forced episodes, yielding better quality at matched average budget than static or r...
Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction
cs.LG 2026-05 unverdicted novelty 6.0

A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.
Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval
cs.CV 2026-05 unverdicted novelty 6.0

RetrieveVGGT enables constant-memory long-context streaming 3D reconstruction by retrieving relevant frames via query-key similarities in VGGT's first attention layer, outperforming StreamVGGT and others.
Reformulating KV Cache Eviction Problem for Long-Context LLM Inference
cs.CL 2026-05 unverdicted novelty 6.0

LaProx reformulates KV cache eviction as an output-aware matrix approximation, enabling a unified global token selection strategy that preserves LLM performance at 5% cache size across long-context benchmarks.
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
cs.LG 2026-05 unverdicted novelty 6.0

Louver is a new index structure that guarantees zero false negatives for sparse attention in LLM KV caches by casting the problem as halfspace range searching.
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
cs.LG 2026-05 unverdicted novelty 6.0

Louver is a new index for LLM KV caches that guarantees zero false negatives for keys above a relevance threshold, runs faster than prior sparse and some dense attention methods, and integrates lightly into existing p...
Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility
cs.AI 2026-05 unverdicted novelty 6.0

SPEED uses layer-asymmetric KV visibility to process non-anchor prompt tokens only in lower layers during prefill, achieving near-baseline quality on Llama-3.1-8B with 33% better TTFT and 25% lower active KV memory at...
Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding
cs.AR 2026-04 unverdicted novelty 6.0

Salca is a new ASIC accelerator that achieves 3.82× speedup and 74.19× energy efficiency over A100 for long-context attention via dual-compression dynamic sparse attention and pipelined hardware.
Defusing the Trigger: Plug-and-Play Defense for Backdoored LLMs via Tail-Risk Intrinsic Geometric Smoothing
cs.CR 2026-04 unverdicted novelty 6.0

TIGS detects backdoor-induced attention collapse in LLMs and applies content-aware tail-risk screening plus intrinsic geometric smoothing to suppress attacks while preserving normal performance.
Sub-Token Routing in LoRA for Adaptation and Query-Aware KV Compression
cs.LG 2026-04 unverdicted novelty 6.0

Sub-token routing in LoRA-adapted transformers adds a finer compression axis for KV caches, with query-independent and query-aware designs that improve efficiency under reduced budgets when combined with token-level s...
Forget, Then Recall: Learnable Compression and Selective Unfolding via Gist Sparse Attention
cs.LG 2026-04 unverdicted novelty 6.0

Gist Sparse Attention uses learnable gist compression tokens as both summaries and routing signals, then selectively unfolds relevant raw chunks for fine-grained attention, outperforming compression and sparse-attenti...
AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

AdaCluster delivers a training-free adaptive query-key clustering framework for sparse attention in video DiTs, yielding 1.67-4.31x inference speedup with negligible quality loss on CogVideoX-2B, HunyuanVideo, and Wan-2.1.
Graph-Guided Adaptive Channel Elimination for KV Cache Compression
eess.SP 2026-04 unverdicted novelty 6.0

GRACE reframes KV cache channel pruning as graph optimization to find a near-optimal subset, achieving 60% compression with negligible degradation and outperforming prior methods.
IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs
cs.LG 2026-04 unverdicted novelty 6.0

IceCache combines semantic token clustering with PagedAttention to keep only 25% of the KV cache tokens while retaining 99% accuracy on LongBench and matching or beating prior offloading methods in latency.
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k
cs.LG 2026-05 accept novelty 5.0

Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
Comparative Characterization of KV Cache Management Strategies for LLM Inference
cs.AR 2026-04 unverdicted novelty 3.0

Benchmarks of vLLM, InfiniGen, and H2O identify conditions under which each KV cache strategy delivers the best trade-off between memory consumption and inference performance.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · cited by 19 Pith papers

[1]

I ntroducing the next generation of C laude

Anthropic. I ntroducing the next generation of C laude. https://www.anthropic.com/news/claude-3-family, 2024. [Accessed 28-05-2024]

work page 2024
[2]

Longbench: A bilingual, multitask benchmark for long context understanding, 2023

Bai, Y., Lv, X., Zhang, J., Lyu, H., Tang, J., Huang, Z., Du, Z., Liu, X., Zeng, A., Hou, L., Dong, Y., Tang, J., and Li, J. Longbench: A bilingual, multitask benchmark for long context understanding, 2023

work page 2023
[3]

Y., Ermon, S., Rudra, A., and Ré, C

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Ré, C. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022

work page 2022
[4]

A., and Gardner, M

Dasigi, P., Lo, K., Beltagy, I., Cohan, A., Smith, N. A., and Gardner, M. A dataset of information-seeking questions and answers anchored in research papers, 2021

work page 2021
[5]

Model tells you what to discard: Adaptive kv cache compression for llms, 2024

Ge, S., Zhang, Y., Liu, L., Zhang, M., Han, J., and Gao, J. Model tells you what to discard: Adaptive kv cache compression for llms, 2024

work page 2024
[9]

H., Gonzalez, J

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023
[10]

E., Stoica, I., Ma, X., and Zhang, H

Li, D., Shao, R., Xie, A., Sheng, Y., Zheng, L., Gonzalez, J. E., Stoica, I., Ma, X., and Zhang, H. How long can open-source llms truly promise on context length?, June 2023. URL https://lmsys.org/blog/2023-06-29-longchat

work page 2023
[11]

World model on million-length video and language with blockwise ringattention, 2024 a

Liu, H., Yan, W., Zaharia, M., and Abbeel, P. World model on million-length video and language with blockwise ringattention, 2024 a

work page 2024
[12]

Scaling laws of rope-based extrapolation, 2024 b

Liu, X., Yan, H., Zhang, S., An, C., Qiu, X., and Lin, D. Scaling laws of rope-based extrapolation, 2024 b

work page 2024
[13]

Nvidia ada lovelace professional gpu architecture

NVIDIA. Nvidia ada lovelace professional gpu architecture. https://images.nvidia.com/aem-dam/en-zz/Solutions/technologies/NVIDIA-ADA-GPU-PROVIZ-Architecture-Whitepaper_1.1.pdf, 2023. [Accessed 28-05-2024]

work page 2023
[14]

Nvbench: Nvidia's benchmarking tool for gpus, 2024

NVIDIA. Nvbench: Nvidia's benchmarking tool for gpus, 2024. Available online: https://github.com/NVIDIA/nvbench

work page 2024
[15]

New models and developer products announced at devday

OpenAI. New models and developer products announced at devday. https://openai.com/blog/new-models-and-developer-products-announced-at-devday#OpenAI, November 2023. Accessed: 2024-01-31

work page 2023
[16]

Introducing gpt-4o: our fastest and most affordable flagship model

OpenAI. Introducing gpt-4o: our fastest and most affordable flagship model. https://platform.openai.com/docs/models, 2024. [Accessed 28-05-2024]

work page 2024
[17]

Transformers are multi-state RNNs , 2024

Oren, M., Hassid, M., Adi, Y., and Schwartz, R. Transformers are multi-state RNNs , 2024. URL https://arxiv.org/abs/2401.06104. arXiv :2401.06104

work page arXiv 2024
[18]

Yarn: Efficient context window extension of large language models, 2023

Peng, B., Quesnelle, J., Fan, H., and Shippole, E. Yarn: Efficient context window extension of large language models, 2023

work page 2023
[19]

W., Potapenko, A., Jayakumar, S

Rae, J. W., Potapenko, A., Jayakumar, S. M., Hillier, C., and Lillicrap, T. P. Compressive transformers for long-range sequence modelling. arXiv preprint, 2019. URL https://arxiv.org/abs/1911.05507

work page arXiv 2019
[20]

Sparq attention: Bandwidth-efficient llm inference, 2023

Ribar, L., Chelombiev, I., Hudlass-Galley, L., Blake, C., Luschi, C., and Orr, D. Sparq attention: Bandwidth-efficient llm inference, 2023

work page 2023
[21]

Roformer: Enhanced transformer with rotary position embedding, 2023

Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., and Liu, Y. Roformer: Enhanced transformer with rotary position embedding, 2023

work page 2023
[22]

Llama: Open and efficient foundation language models, 2023

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. Llama: Open and efficient foundation language models, 2023

work page 2023
[23]

Focused transformer: Contrastive training for context scaling, 2023

Tworkowski, S., Staniszewski, K., Pacek, M., Wu, Y., Michalewski, H., and Miłoś, P. Focused transformer: Contrastive training for context scaling, 2023

work page 2023
[24]

Efficient streaming language models with attention sinks

Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M. Efficient streaming language models with attention sinks. arXiv, 2023

work page 2023
[25]

W., Salakhutdinov, R., and Manning, C

Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W., Salakhutdinov, R., and Manning, C. D. Hotpotqa: A dataset for diverse, explainable multi-hop question answering, 2018

work page 2018
[26]

Cascade inference: Memory bandwidth efficient shared prefix batch decoding

Ye, Z., Lai, R., Lu, R., Lin, C.-Y., Zheng, S., Chen, L., Chen, T., and Ceze, L. Cascade inference: Memory bandwidth efficient shared prefix batch decoding. https://flashinfer.ai/2024/01/08/cascade-inference.html, Jan 2024. URL https://flashinfer.ai/2024/01/08/cascade-inference.html. Accessed on 2024-02-01

work page 2024
[28]

H _2 o: Heavy-hitter oracle for efficient generative inference of large language models, 2023 b

Zhang, Z., Sheng, Y., Zhou, T., Chen, T., Zheng, L., Cai, R., Song, Z., Tian, Y., Ré, C., Barrett, C., Wang, Z., and Chen, B. H _2 o: Heavy-hitter oracle for efficient generative inference of large language models, 2023 b

work page 2023
[29]

Atom: Low-bit quantization for efficient and accurate llm serving, 2024

Zhao, Y., Lin, C.-Y., Zhu, K., Ye, Z., Chen, L., Zheng, S., Ceze, L., Krishnamurthy, A., Chen, T., and Kasikci, B. Atom: Low-bit quantization for efficient and accurate llm serving, 2024

work page 2024
[30]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

work page 2000
[31]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

work page 1980
[32]

M. J. Kearns , title =

work page
[33]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

work page 1983
[34]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

work page 2000
[35]

Suppressed for Anonymity , author=

work page
[36]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

work page 1981
[37]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

work page 1959
[38]

2023 , eprint=

YaRN: Efficient Context Window Extension of Large Language Models , author=. 2023 , eprint=

work page 2023
[39]

2023 , eprint=

Focused Transformer: Contrastive Training for Context Scaling , author=. 2023 , eprint=

work page 2023
[40]

2023 , month =

OpenAI , title =. 2023 , month =

work page 2023
[41]

2023 , eprint=

LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

work page 2023
[42]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

work page
[43]

2023 , eprint=

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models , author=. 2023 , eprint=

work page 2023
[44]

2024 , eprint=

Atom: Low-bit Quantization for Efficient and Accurate LLM Serving , author=. 2024 , eprint=

work page 2024
[45]

2023 , eprint=

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration , author=. 2023 , eprint=

work page 2023
[46]

2022 , eprint=

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale , author=. 2022 , eprint=

work page 2022
[47]

arXiv preprint , url =

Rae, Jack W and Potapenko, Anna and Jayakumar, Siddhant M and Hillier, Chloe and Lillicrap, Timothy P , title =. arXiv preprint , url =

work page
[48]

2023 , eprint=

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding , author=. 2023 , eprint=

work page 2023
[49]

The N arrative QA Reading Comprehension Challenge

Ko c isk \'y , Tom \'a s and Schwarz, Jonathan and Blunsom, Phil and Dyer, Chris and Hermann, Karl Moritz and Melis, G \'a bor and Grefenstette, Edward. The N arrative QA Reading Comprehension Challenge. Transactions of the Association for Computational Linguistics. 2018. doi:10.1162/tacl_a_00023

work page doi:10.1162/tacl_a_00023 2018
[50]

2018 , eprint=

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , author=. 2018 , eprint=

work page 2018
[51]

2021 , eprint=

A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers , author=. 2021 , eprint=

work page 2021
[52]

T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Joshi, Mandar and Choi, Eunsol and Weld, Daniel and Zettlemoyer, Luke. T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi:10.18653/v1/P17-1147

work page doi:10.18653/v1/p17-1147 2017
[53]

Efficient Attentions for Long Document Summarization

Huang, Luyang and Cao, Shuyang and Parulian, Nikolaus and Ji, Heng and Wang, Lu. Efficient Attentions for Long Document Summarization. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021. doi:10.18653/v1/2021.naacl-main.112

work page doi:10.18653/v1/2021.naacl-main.112 2021
[54]

2023 , eprint=

H _2 O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models , author=. 2023 , eprint=

work page 2023
[55]

Transformers are Multi-State

Matanel Oren and Michael Hassid and Yossi Adi and Roy Schwartz , year=. Transformers are Multi-State

work page
[56]

arXiv , year=

Efficient Streaming Language Models with Attention Sinks , author=. arXiv , year=

work page
[57]

and Stoica, Ion and Ma, Xuezhe and Zhang, Hao , month = jun, year =

Li, Dacheng and Shao, Rulin and Xie, Anze and Sheng, Ying and Zheng, Lianmin and Gonzalez, Joseph E. and Stoica, Ion and Ma, Xuezhe and Zhang, Hao , month = jun, year =. How Long Can Open-Source LLMs Truly Promise on Context Length? , url =

work page
[58]

2024 , month =

Ye, Zihao and Lai, Ruihang and Lu, Roy and Lin, Chien-Yu and Zheng, Size and Chen, Lequn and Chen, Tianqi and Ceze, Luis , title =. 2024 , month =

work page 2024
[59]

2023 , eprint=

RoFormer: Enhanced Transformer with Rotary Position Embedding , author=. 2023 , eprint=

work page 2023
[60]

2024 , eprint=

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs , author=. 2024 , eprint=

work page 2024
[61]

2023 , eprint=

SparQ Attention: Bandwidth-Efficient LLM Inference , author=. 2023 , eprint=

work page 2023
[62]

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , articleno =

Zhang, Jingrong and Naruse, Akira and Li, Xipeng and Wang, Yong , title =. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , articleno =. 2023 , isbn =. doi:10.1145/3581784.3607062 , abstract =

work page doi:10.1145/3581784.3607062 2023
[63]

2022 , eprint=

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , author=. 2022 , eprint=

work page 2022
[64]

2024 , note =

NVIDIA , title =. 2024 , note =

work page 2024
[65]

GitHub repository , howpublished =

Frantar, Elias and Alistarh, Dan , title =. GitHub repository , howpublished =. 2024 , publisher =

work page 2024
[66]

2024 , eprint=

World Model on Million-Length Video And Language With Blockwise RingAttention , author=. 2024 , eprint=

work page 2024
[67]

2023 , note =

NVIDIA , title =. 2023 , note =

work page 2023
[68]

2024 , eprint=

Scaling Laws of RoPE-based Extrapolation , author=. 2024 , eprint=

work page 2024
[69]

2024 , note =

OpenAI , title =. 2024 , note =

work page 2024
[70]

2024 , note =

Anthropic , title =. 2024 , note =

work page 2024
[71]

2023 , eprint=

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , author=. 2023 , eprint=

work page 2023
[72]

2024 , eprint=

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model , author=. 2024 , eprint=

work page 2024
[73]

Thakkar, Vijay and Ramani, Pradeep and Cecka, Cris and Shivam, Aniket and Lu, Honghao and Yan, Ethan and Kosaian, Jack and Hoemmen, Mark and Wu, Haicheng and Kerr, Andrew and Nicely, Matt and Merrill, Duane and Blasig, Dustyn and Qiao, Fengqi and Majcher, Piotr and Springer, Paul and Hohnerbach, Markus and Wang, Jin and Gupta, Manish , license =

work page
[74]

Xing and Joseph E

Lianmin Zheng and Zhuohan Li and Hao Zhang and Yonghao Zhuang and Zhifeng Chen and Yanping Huang and Yida Wang and Yuanzhong Xu and Danyang Zhuo and Eric P. Xing and Joseph E. Gonzalez and Ion Stoica , title =. 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) , year =

work page
[75]

2024 , eprint=

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration , author=. 2024 , eprint=

work page 2024
[76]

OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization , url=

Guo, Cong and Tang, Jiaming and Hu, Weiming and Leng, Jingwen and Zhang, Chen and Yang, Fan and Liu, Yunxin and Guo, Minyi and Zhu, Yuhao , year=. OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization , url=. doi:10.1145/3579371.3589038 , booktitle=

work page doi:10.1145/3579371.3589038