pith. machine review for the scientific record. sign in

arxiv: 2406.10774 · v2 · submitted 2024-06-16 · 💻 cs.CL · cs.LG

Recognition: no theorem link

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:04 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords KV cache sparsitylong-context LLMquery-aware selectionself-attention optimizationinference accelerationtoken criticalitymemory-efficient inference
0
0 comments X

The pith

Quest selects only the top-K critical KV cache pages using query vectors and min-max key bounds to accelerate long-context LLM attention.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Quest as a query-aware method to sparsify the key-value cache during self-attention in long-context LLMs. It tracks the minimum and maximum key values within fixed pages of the cache and scores each page's relevance to the current query vector. Only the highest-scoring pages are loaded for the attention computation. This reduces the data movement and compute required while preserving accuracy on tasks that depend on distant context. The result is substantial speedups in both attention and overall inference latency.

Core claim

Quest is a query-aware KV cache selection algorithm that keeps track of the minimal and maximal Key values in KV cache pages and estimates the criticality of a given page using Query vectors. By only loading the Top-K critical KV cache pages for attention, Quest achieves up to 2.23x self-attention speedup and reduces inference latency by 7.03x while performing well on tasks with long dependencies with negligible accuracy loss.

What carries the argument

Query-aware page criticality scoring that uses per-page minimum and maximum key values to estimate attention contribution without loading the full page.

If this is right

  • Self-attention runs up to 2.23 times faster by skipping irrelevant KV pages.
  • End-to-end inference latency drops by as much as 7.03 times for long sequences.
  • Accuracy stays nearly identical on tasks that require information from distant tokens.
  • Memory bandwidth pressure during attention decreases in proportion to the fraction of pages skipped.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same page-level approximation could be adapted to other sparse attention patterns such as sliding-window or local attention.
  • Hardware with fast sparse memory access might see even larger gains than the reported software speedups.
  • Adjusting page size dynamically according to sequence length could further improve the accuracy-speed trade-off.
  • The approach might combine with existing quantization or pruning methods to compound efficiency benefits.

Load-bearing premise

Min-max key bounds per page plus query-vector scoring are enough to identify which pages truly matter for the final attention result.

What would settle it

Measure whether accuracy on a long-dependency benchmark drops when Quest is forced to load strictly fewer than its chosen top-K pages compared with full-cache attention.

read the original abstract

As the demand for long-context large language models (LLMs) increases, models with context windows of up to 128K or 1M tokens are becoming increasingly prevalent. However, long-context LLM inference is challenging since the inference speed decreases significantly as the sequence length grows. This slowdown is primarily caused by loading a large KV cache during self-attention. Previous works have shown that a small portion of critical tokens will dominate the attention outcomes. However, we observe the criticality of a token highly depends on the query. To this end, we propose Quest, a query-aware KV cache selection algorithm. Quest keeps track of the minimal and maximal Key values in KV cache pages and estimates the criticality of a given page using Query vectors. By only loading the Top-K critical KV cache pages for attention, Quest significantly speeds up self-attention without sacrificing accuracy. We show that Quest can achieve up to 2.23x self-attention speedup, which reduces inference latency by 7.03x while performing well on tasks with long dependencies with negligible accuracy loss. Code is available at http://github.com/mit-han-lab/Quest .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Quest, a query-aware KV cache selection algorithm for efficient long-context LLM inference. It tracks per-page min/max key values in the KV cache and uses the current query vector to estimate page criticality via an upper-bound attention logit, then loads only the top-K critical pages for the attention computation. The central empirical claim is that this yields up to 2.23× self-attention speedup and 7.03× end-to-end latency reduction while preserving accuracy on long-dependency tasks.

Significance. If the approximation reliably ranks pages without dropping critical tokens, Quest would supply a practical, training-free technique for reducing KV-cache bandwidth in long-context serving. The query-dependent scoring improves upon static sparsity heuristics and could be combined with existing paging or quantization methods.

major comments (3)
  1. [§3] §3 (Method), page-scoring procedure: the min/max key bound produces a correct but arbitrarily loose upper bound on the true max attention logit whenever keys inside a page exhibit variance. A page containing the single highest-attended token can therefore receive a lower rank than a page whose bound is inflated but whose realized scores are low. This directly threatens the claim of negligible accuracy loss on long-dependency tasks; the manuscript must quantify bound tightness (e.g., fraction of pages where the bound exceeds the true max by >20 %) and report failure cases.
  2. [§4] §4 (Experiments): the reported 2.23× and 7.03× speedups are presented without hardware details, batch-size specification, number of runs, or error bars. It is therefore impossible to determine whether the selection overhead (min/max maintenance + scoring) is fully included in the latency figures or whether the gains are stable across random seeds and model scales.
  3. [Evaluation] Evaluation on long-dependency tasks: the abstract asserts “negligible accuracy loss,” yet no per-task accuracy deltas, Needle-in-Haystack retrieval curves, or ablation on Top-K and page size are visible. Because Top-K and page size are free parameters, the central claim that the method “performs well … with negligible accuracy loss” rests on unshown controls.
minor comments (2)
  1. [§3] Clarify the exact scoring formula (how min/max keys are combined with the query to produce the page score) in the main text rather than leaving it implicit from the abstract.
  2. [§2] Add a short related-work paragraph contrasting Quest with prior KV-cache eviction methods (e.g., H2O, StreamingLLM) to highlight the query-aware novelty.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the requested analyses, details, and controls.

read point-by-point responses
  1. Referee: [§3] §3 (Method), page-scoring procedure: the min/max key bound produces a correct but arbitrarily loose upper bound on the true max attention logit whenever keys inside a page exhibit variance. A page containing the single highest-attended token can therefore receive a lower rank than a page whose bound is inflated but whose realized scores are low. This directly threatens the claim of negligible accuracy loss on long-dependency tasks; the manuscript must quantify bound tightness (e.g., fraction of pages where the bound exceeds the true max by >20 %) and report failure cases.

    Authors: We agree the upper bound can be loose under high intra-page key variance. In the revised manuscript we will add a dedicated analysis quantifying bound tightness: we will report the distribution of (bound - true max logit) gaps over sampled pages from our evaluation workloads and the fraction of pages where the gap exceeds 20%. We will also include concrete failure-case examples (pages ranked too low despite containing critical tokens) together with the resulting accuracy impact on the affected tasks. revision: yes

  2. Referee: [§4] §4 (Experiments): the reported 2.23× and 7.03× speedups are presented without hardware details, batch-size specification, number of runs, or error bars. It is therefore impossible to determine whether the selection overhead (min/max maintenance + scoring) is fully included in the latency figures or whether the gains are stable across random seeds and model scales.

    Authors: All reported speedups were measured on NVIDIA A100-80GB GPUs with batch size 1 (single-request serving). The latency figures fully include the overhead of per-page min/max maintenance and query-aware scoring. Results are averages over 10 independent runs; we will add error bars to all latency plots and explicitly state the model scales (Llama-2 7B/13B) and seed stability in the revision. revision: yes

  3. Referee: [Evaluation] Evaluation on long-dependency tasks: the abstract asserts “negligible accuracy loss,” yet no per-task accuracy deltas, Needle-in-Haystack retrieval curves, or ablation on Top-K and page size are visible. Because Top-K and page size are free parameters, the central claim that the method “performs well … with negligible accuracy loss” rests on unshown controls.

    Authors: We will expand the evaluation section with (i) per-task accuracy tables reporting absolute deltas versus full attention, (ii) Needle-in-Haystack retrieval accuracy curves across context lengths for multiple Top-K ratios, and (iii) ablation tables varying Top-K (10/20/50 % of pages) and page size (16/32/64 tokens). These additions will substantiate the “negligible accuracy loss” claim with the requested controls. revision: yes

Circularity Check

0 steps flagged

No circularity: Quest is an empirical heuristic with external evaluation

full rationale

The paper presents Quest as a query-dependent KV-cache page selection algorithm that maintains per-page min/max key bounds and scores pages via dot-product upper bounds with the current query vector before selecting top-K pages. All reported speedups (2.23x attention, 7.03x end-to-end) and accuracy claims are obtained from direct wall-clock measurements and benchmark accuracy on standard long-context tasks. No equations, derivations, or self-citations are used to define the method in terms of its own outputs; the approximation is explicitly heuristic and its quality is assessed externally rather than by construction. No load-bearing step reduces to a fitted parameter, renamed known result, or author-self-citation chain.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The approach rests on two domain assumptions about attention sparsity and query dependence plus a small number of implementation hyperparameters; no new physical entities are introduced.

free parameters (2)
  • Top-K
    Number of highest-scoring pages to load; chosen to balance speed and accuracy.
  • page size
    Granularity for grouping KV entries into pages for min/max tracking; selected for implementation efficiency.
axioms (2)
  • domain assumption A small portion of critical tokens dominate attention outcomes
    Foundation taken from previous works on KV cache sparsity.
  • domain assumption Criticality of a token highly depends on the query
    Central observation that enables the query-aware design.

pith-pipeline@v0.9.0 · 5515 in / 1355 out tokens · 146843 ms · 2026-05-15T14:04:56.497780+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Very Efficient Listwise Multimodal Reranking for Long Documents

    cs.IR 2026-05 unverdicted novelty 7.0

    ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.

  2. MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference

    cs.LG 2026-05 conditional novelty 7.0

    MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.

  3. TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

    cs.CL 2026-04 unverdicted novelty 7.0

    TriAttention compresses KV cache by exploiting stable pre-RoPE Q/K concentration and trigonometric distance preferences to match full-attention reasoning accuracy with far lower memory and higher speed.

  4. AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference

    cs.DC 2026-05 unverdicted novelty 6.0

    AB-Sparse adaptively allocates per-head block sizes for sparse attention, adds lossless centroid quantization and custom variable-block GPU kernels, and reports up to 5.43% accuracy gain over fixed-block baselines wit...

  5. Compute Where it Counts: Self Optimizing Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    SOL trains a policy to dynamically control multiple efficiency mechanisms per token via group-relative policy optimization on teacher-forced episodes, yielding better quality at matched average budget than static or r...

  6. Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

    cs.LG 2026-05 unverdicted novelty 6.0

    A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.

  7. Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval

    cs.CV 2026-05 unverdicted novelty 6.0

    RetrieveVGGT enables constant-memory long-context streaming 3D reconstruction by retrieving relevant frames via query-key similarities in VGGT's first attention layer, outperforming StreamVGGT and others.

  8. Reformulating KV Cache Eviction Problem for Long-Context LLM Inference

    cs.CL 2026-05 unverdicted novelty 6.0

    LaProx reformulates KV cache eviction as an output-aware matrix approximation, enabling a unified global token selection strategy that preserves LLM performance at 5% cache size across long-context benchmarks.

  9. Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache

    cs.LG 2026-05 unverdicted novelty 6.0

    Louver is a new index structure that guarantees zero false negatives for sparse attention in LLM KV caches by casting the problem as halfspace range searching.

  10. Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache

    cs.LG 2026-05 unverdicted novelty 6.0

    Louver is a new index for LLM KV caches that guarantees zero false negatives for keys above a relevance threshold, runs faster than prior sparse and some dense attention methods, and integrates lightly into existing p...

  11. Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility

    cs.AI 2026-05 unverdicted novelty 6.0

    SPEED uses layer-asymmetric KV visibility to process non-anchor prompt tokens only in lower layers during prefill, achieving near-baseline quality on Llama-3.1-8B with 33% better TTFT and 25% lower active KV memory at...

  12. Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding

    cs.AR 2026-04 unverdicted novelty 6.0

    Salca is a new ASIC accelerator that achieves 3.82× speedup and 74.19× energy efficiency over A100 for long-context attention via dual-compression dynamic sparse attention and pipelined hardware.

  13. Defusing the Trigger: Plug-and-Play Defense for Backdoored LLMs via Tail-Risk Intrinsic Geometric Smoothing

    cs.CR 2026-04 unverdicted novelty 6.0

    TIGS detects backdoor-induced attention collapse in LLMs and applies content-aware tail-risk screening plus intrinsic geometric smoothing to suppress attacks while preserving normal performance.

  14. Sub-Token Routing in LoRA for Adaptation and Query-Aware KV Compression

    cs.LG 2026-04 unverdicted novelty 6.0

    Sub-token routing in LoRA-adapted transformers adds a finer compression axis for KV caches, with query-independent and query-aware designs that improve efficiency under reduced budgets when combined with token-level s...

  15. Forget, Then Recall: Learnable Compression and Selective Unfolding via Gist Sparse Attention

    cs.LG 2026-04 unverdicted novelty 6.0

    Gist Sparse Attention uses learnable gist compression tokens as both summaries and routing signals, then selectively unfolds relevant raw chunks for fine-grained attention, outperforming compression and sparse-attenti...

  16. AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    AdaCluster delivers a training-free adaptive query-key clustering framework for sparse attention in video DiTs, yielding 1.67-4.31x inference speedup with negligible quality loss on CogVideoX-2B, HunyuanVideo, and Wan-2.1.

  17. Graph-Guided Adaptive Channel Elimination for KV Cache Compression

    eess.SP 2026-04 unverdicted novelty 6.0

    GRACE reframes KV cache channel pruning as graph optimization to find a near-optimal subset, achieving 60% compression with negligible degradation and outperforming prior methods.

  18. IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs

    cs.LG 2026-04 unverdicted novelty 6.0

    IceCache combines semantic token clustering with PagedAttention to keep only 25% of the KV cache tokens while retaining 99% accuracy on LongBench and matching or beating prior offloading methods in latency.

  19. StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k

    cs.LG 2026-05 accept novelty 5.0

    Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.

  20. Comparative Characterization of KV Cache Management Strategies for LLM Inference

    cs.AR 2026-04 unverdicted novelty 3.0

    Benchmarks of vLLM, InfiniGen, and H2O identify conditions under which each KV cache strategy delivers the best trade-off between memory consumption and inference performance.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · cited by 19 Pith papers

  1. [1]

    I ntroducing the next generation of C laude

    Anthropic. I ntroducing the next generation of C laude. https://www.anthropic.com/news/claude-3-family, 2024. [Accessed 28-05-2024]

  2. [2]

    Longbench: A bilingual, multitask benchmark for long context understanding, 2023

    Bai, Y., Lv, X., Zhang, J., Lyu, H., Tang, J., Huang, Z., Du, Z., Liu, X., Zeng, A., Hou, L., Dong, Y., Tang, J., and Li, J. Longbench: A bilingual, multitask benchmark for long context understanding, 2023

  3. [3]

    Y., Ermon, S., Rudra, A., and Ré, C

    Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Ré, C. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022

  4. [4]

    A., and Gardner, M

    Dasigi, P., Lo, K., Beltagy, I., Cohan, A., Smith, N. A., and Gardner, M. A dataset of information-seeking questions and answers anchored in research papers, 2021

  5. [5]

    Model tells you what to discard: Adaptive kv cache compression for llms, 2024

    Ge, S., Zhang, Y., Liu, L., Zhang, M., Han, J., and Gao, J. Model tells you what to discard: Adaptive kv cache compression for llms, 2024

  6. [9]

    H., Gonzalez, J

    Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  7. [10]

    E., Stoica, I., Ma, X., and Zhang, H

    Li, D., Shao, R., Xie, A., Sheng, Y., Zheng, L., Gonzalez, J. E., Stoica, I., Ma, X., and Zhang, H. How long can open-source llms truly promise on context length?, June 2023. URL https://lmsys.org/blog/2023-06-29-longchat

  8. [11]

    World model on million-length video and language with blockwise ringattention, 2024 a

    Liu, H., Yan, W., Zaharia, M., and Abbeel, P. World model on million-length video and language with blockwise ringattention, 2024 a

  9. [12]

    Scaling laws of rope-based extrapolation, 2024 b

    Liu, X., Yan, H., Zhang, S., An, C., Qiu, X., and Lin, D. Scaling laws of rope-based extrapolation, 2024 b

  10. [13]

    Nvidia ada lovelace professional gpu architecture

    NVIDIA. Nvidia ada lovelace professional gpu architecture. https://images.nvidia.com/aem-dam/en-zz/Solutions/technologies/NVIDIA-ADA-GPU-PROVIZ-Architecture-Whitepaper_1.1.pdf, 2023. [Accessed 28-05-2024]

  11. [14]

    Nvbench: Nvidia's benchmarking tool for gpus, 2024

    NVIDIA. Nvbench: Nvidia's benchmarking tool for gpus, 2024. Available online: https://github.com/NVIDIA/nvbench

  12. [15]

    New models and developer products announced at devday

    OpenAI. New models and developer products announced at devday. https://openai.com/blog/new-models-and-developer-products-announced-at-devday#OpenAI, November 2023. Accessed: 2024-01-31

  13. [16]

    Introducing gpt-4o: our fastest and most affordable flagship model

    OpenAI. Introducing gpt-4o: our fastest and most affordable flagship model. https://platform.openai.com/docs/models, 2024. [Accessed 28-05-2024]

  14. [17]

    Transformers are multi-state RNNs , 2024

    Oren, M., Hassid, M., Adi, Y., and Schwartz, R. Transformers are multi-state RNNs , 2024. URL https://arxiv.org/abs/2401.06104. arXiv :2401.06104

  15. [18]

    Yarn: Efficient context window extension of large language models, 2023

    Peng, B., Quesnelle, J., Fan, H., and Shippole, E. Yarn: Efficient context window extension of large language models, 2023

  16. [19]

    W., Potapenko, A., Jayakumar, S

    Rae, J. W., Potapenko, A., Jayakumar, S. M., Hillier, C., and Lillicrap, T. P. Compressive transformers for long-range sequence modelling. arXiv preprint, 2019. URL https://arxiv.org/abs/1911.05507

  17. [20]

    Sparq attention: Bandwidth-efficient llm inference, 2023

    Ribar, L., Chelombiev, I., Hudlass-Galley, L., Blake, C., Luschi, C., and Orr, D. Sparq attention: Bandwidth-efficient llm inference, 2023

  18. [21]

    Roformer: Enhanced transformer with rotary position embedding, 2023

    Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., and Liu, Y. Roformer: Enhanced transformer with rotary position embedding, 2023

  19. [22]

    Llama: Open and efficient foundation language models, 2023

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. Llama: Open and efficient foundation language models, 2023

  20. [23]

    Focused transformer: Contrastive training for context scaling, 2023

    Tworkowski, S., Staniszewski, K., Pacek, M., Wu, Y., Michalewski, H., and Miłoś, P. Focused transformer: Contrastive training for context scaling, 2023

  21. [24]

    Efficient streaming language models with attention sinks

    Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M. Efficient streaming language models with attention sinks. arXiv, 2023

  22. [25]

    W., Salakhutdinov, R., and Manning, C

    Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W., Salakhutdinov, R., and Manning, C. D. Hotpotqa: A dataset for diverse, explainable multi-hop question answering, 2018

  23. [26]

    Cascade inference: Memory bandwidth efficient shared prefix batch decoding

    Ye, Z., Lai, R., Lu, R., Lin, C.-Y., Zheng, S., Chen, L., Chen, T., and Ceze, L. Cascade inference: Memory bandwidth efficient shared prefix batch decoding. https://flashinfer.ai/2024/01/08/cascade-inference.html, Jan 2024. URL https://flashinfer.ai/2024/01/08/cascade-inference.html. Accessed on 2024-02-01

  24. [28]

    H _2 o: Heavy-hitter oracle for efficient generative inference of large language models, 2023 b

    Zhang, Z., Sheng, Y., Zhou, T., Chen, T., Zheng, L., Cai, R., Song, Z., Tian, Y., Ré, C., Barrett, C., Wang, Z., and Chen, B. H _2 o: Heavy-hitter oracle for efficient generative inference of large language models, 2023 b

  25. [29]

    Atom: Low-bit quantization for efficient and accurate llm serving, 2024

    Zhao, Y., Lin, C.-Y., Zhu, K., Ye, Z., Chen, L., Zheng, S., Ceze, L., Krishnamurthy, A., Chen, T., and Kasikci, B. Atom: Low-bit quantization for efficient and accurate llm serving, 2024

  26. [30]

    Langley , title =

    P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

  27. [31]

    T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

  28. [32]

    M. J. Kearns , title =

  29. [33]

    Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

  30. [34]

    R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

  31. [35]

    Suppressed for Anonymity , author=

  32. [36]

    Newell and P

    A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

  33. [37]

    A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

  34. [38]

    2023 , eprint=

    YaRN: Efficient Context Window Extension of Large Language Models , author=. 2023 , eprint=

  35. [39]

    2023 , eprint=

    Focused Transformer: Contrastive Training for Context Scaling , author=. 2023 , eprint=

  36. [40]

    2023 , month =

    OpenAI , title =. 2023 , month =

  37. [41]

    2023 , eprint=

    LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

  38. [42]

    Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

    Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

  39. [43]

    2023 , eprint=

    SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models , author=. 2023 , eprint=

  40. [44]

    2024 , eprint=

    Atom: Low-bit Quantization for Efficient and Accurate LLM Serving , author=. 2024 , eprint=

  41. [45]

    2023 , eprint=

    AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration , author=. 2023 , eprint=

  42. [46]

    2022 , eprint=

    LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale , author=. 2022 , eprint=

  43. [47]

    arXiv preprint , url =

    Rae, Jack W and Potapenko, Anna and Jayakumar, Siddhant M and Hillier, Chloe and Lillicrap, Timothy P , title =. arXiv preprint , url =

  44. [48]

    2023 , eprint=

    LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding , author=. 2023 , eprint=

  45. [49]

    The N arrative QA Reading Comprehension Challenge

    Ko c isk \'y , Tom \'a s and Schwarz, Jonathan and Blunsom, Phil and Dyer, Chris and Hermann, Karl Moritz and Melis, G \'a bor and Grefenstette, Edward. The N arrative QA Reading Comprehension Challenge. Transactions of the Association for Computational Linguistics. 2018. doi:10.1162/tacl_a_00023

  46. [50]

    2018 , eprint=

    HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , author=. 2018 , eprint=

  47. [51]

    2021 , eprint=

    A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers , author=. 2021 , eprint=

  48. [52]

    T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Joshi, Mandar and Choi, Eunsol and Weld, Daniel and Zettlemoyer, Luke. T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi:10.18653/v1/P17-1147

  49. [53]

    Efficient Attentions for Long Document Summarization

    Huang, Luyang and Cao, Shuyang and Parulian, Nikolaus and Ji, Heng and Wang, Lu. Efficient Attentions for Long Document Summarization. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021. doi:10.18653/v1/2021.naacl-main.112

  50. [54]

    2023 , eprint=

    H _2 O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models , author=. 2023 , eprint=

  51. [55]

    Transformers are Multi-State

    Matanel Oren and Michael Hassid and Yossi Adi and Roy Schwartz , year=. Transformers are Multi-State

  52. [56]

    arXiv , year=

    Efficient Streaming Language Models with Attention Sinks , author=. arXiv , year=

  53. [57]

    and Stoica, Ion and Ma, Xuezhe and Zhang, Hao , month = jun, year =

    Li, Dacheng and Shao, Rulin and Xie, Anze and Sheng, Ying and Zheng, Lianmin and Gonzalez, Joseph E. and Stoica, Ion and Ma, Xuezhe and Zhang, Hao , month = jun, year =. How Long Can Open-Source LLMs Truly Promise on Context Length? , url =

  54. [58]

    2024 , month =

    Ye, Zihao and Lai, Ruihang and Lu, Roy and Lin, Chien-Yu and Zheng, Size and Chen, Lequn and Chen, Tianqi and Ceze, Luis , title =. 2024 , month =

  55. [59]

    2023 , eprint=

    RoFormer: Enhanced Transformer with Rotary Position Embedding , author=. 2023 , eprint=

  56. [60]

    2024 , eprint=

    Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs , author=. 2024 , eprint=

  57. [61]

    2023 , eprint=

    SparQ Attention: Bandwidth-Efficient LLM Inference , author=. 2023 , eprint=

  58. [62]

    Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , articleno =

    Zhang, Jingrong and Naruse, Akira and Li, Xipeng and Wang, Yong , title =. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , articleno =. 2023 , isbn =. doi:10.1145/3581784.3607062 , abstract =

  59. [63]

    2022 , eprint=

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , author=. 2022 , eprint=

  60. [64]

    2024 , note =

    NVIDIA , title =. 2024 , note =

  61. [65]

    GitHub repository , howpublished =

    Frantar, Elias and Alistarh, Dan , title =. GitHub repository , howpublished =. 2024 , publisher =

  62. [66]

    2024 , eprint=

    World Model on Million-Length Video And Language With Blockwise RingAttention , author=. 2024 , eprint=

  63. [67]

    2023 , note =

    NVIDIA , title =. 2023 , note =

  64. [68]

    2024 , eprint=

    Scaling Laws of RoPE-based Extrapolation , author=. 2024 , eprint=

  65. [69]

    2024 , note =

    OpenAI , title =. 2024 , note =

  66. [70]

    2024 , note =

    Anthropic , title =. 2024 , note =

  67. [71]

    2023 , eprint=

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , author=. 2023 , eprint=

  68. [72]

    2024 , eprint=

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model , author=. 2024 , eprint=

  69. [73]

    Thakkar, Vijay and Ramani, Pradeep and Cecka, Cris and Shivam, Aniket and Lu, Honghao and Yan, Ethan and Kosaian, Jack and Hoemmen, Mark and Wu, Haicheng and Kerr, Andrew and Nicely, Matt and Merrill, Duane and Blasig, Dustyn and Qiao, Fengqi and Majcher, Piotr and Springer, Paul and Hohnerbach, Markus and Wang, Jin and Gupta, Manish , license =

  70. [74]

    Xing and Joseph E

    Lianmin Zheng and Zhuohan Li and Hao Zhang and Yonghao Zhuang and Zhifeng Chen and Yanping Huang and Yida Wang and Yuanzhong Xu and Danyang Zhuo and Eric P. Xing and Joseph E. Gonzalez and Ion Stoica , title =. 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) , year =

  71. [75]

    2024 , eprint=

    AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration , author=. 2024 , eprint=

  72. [76]

    OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization , url=

    Guo, Cong and Tang, Jiaming and Hu, Weiming and Leng, Jingwen and Zhang, Chen and Yang, Fan and Liu, Yunxin and Guo, Minyi and Zhu, Yuhao , year=. OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization , url=. doi:10.1145/3579371.3589038 , booktitle=