arxiv: 2410.10819 · v1 · submitted 2024-10-14 · 💻 cs.CL

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Guangxuan Xiao , Jiaming Tang , Jingwei Zuo , Junxian Guo , Shang Yang , Haotian Tang , Yao Fu , Song Han This is my paper

Pith reviewed 2026-05-18 11:44 UTC · model grok-4.3

classification 💻 cs.CL

keywords attention mechanismKV cache optimizationlong context modelingefficient inferenceretrieval headsstreaming headsLLM acceleration

0 comments

The pith

Only retrieval heads need full key-value caches for long-context processing in large language models, while streaming heads can use short fixed caches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that long-context abilities in LLMs depend primarily on a minority of attention heads that maintain full attention over all previous tokens. The remaining heads focus mostly on recent context and attention sinks, so they can operate with limited caches. By classifying heads into these two groups using an optimization procedure on synthetic examples, the DuoAttention method applies full caching only where necessary. This selective caching delivers large reductions in memory footprint and faster inference times for both filling the context and generating output. The approach keeps task performance nearly identical to the original model.

Core claim

The paper claims that identifying retrieval heads, which require complete KV caches for long contexts, and streaming heads, which suffice with constant-length caches, enables efficient inference. The identification uses a lightweight optimization-based algorithm with synthetic data. This leads to memory savings up to 2.55 times for certain models and speedups in decoding and pre-filling, all with minimal impact on accuracy for long-context tasks.

What carries the argument

The separation of attention heads into retrieval heads that keep full KV caches and streaming heads that use a lightweight constant-length KV cache, with the split determined by an optimization algorithm on synthetic data.

If this is right

Long-context inference memory usage drops substantially, up to 2.55x for MHA models and 1.67x for GQA models.
Decoding becomes faster by up to 2.18x for MHA and 1.50x for GQA.
Pre-filling stage accelerates by up to 1.73x and 1.63x respectively.
With quantization, models can handle contexts as long as 3.3 million tokens on a single high-end GPU.
Long-context capabilities remain largely intact despite the reduced caching.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The head classification might reveal similar structure in other transformer variants beyond the tested models.
Integrating this with other compression methods could yield further gains in efficiency.
Testing on a wider range of benchmarks would confirm if the synthetic data method generalizes across tasks.
If streaming heads prove task-dependent, online reclassification could be explored.

Load-bearing premise

The optimization algorithm using synthetic data correctly identifies which heads are retrieval heads that truly require the full KV cache to maintain long-context performance.

What would settle it

Running the method on a new long-context task and observing that accuracy drops significantly when using the constant cache for the designated streaming heads would falsify the claim.

read the original abstract

Deploying long-context large language models (LLMs) is essential but poses significant computational and memory challenges. Caching all Key and Value (KV) states across all attention heads consumes substantial memory. Existing KV cache pruning methods either damage the long-context capabilities of LLMs or offer only limited efficiency improvements. In this paper, we identify that only a fraction of attention heads, a.k.a, Retrieval Heads, are critical for processing long contexts and require full attention across all tokens. In contrast, all other heads, which primarily focus on recent tokens and attention sinks--referred to as Streaming Heads--do not require full attention. Based on this insight, we introduce DuoAttention, a framework that only applies a full KV cache to retrieval heads while using a light-weight, constant-length KV cache for streaming heads, which reduces both LLM's decoding and pre-filling memory and latency without compromising its long-context abilities. DuoAttention uses a lightweight, optimization-based algorithm with synthetic data to identify retrieval heads accurately. Our method significantly reduces long-context inference memory by up to 2.55x for MHA and 1.67x for GQA models while speeding up decoding by up to 2.18x and 1.50x and accelerating pre-filling by up to 1.73x and 1.63x for MHA and GQA models, respectively, with minimal accuracy loss compared to full attention. Notably, combined with quantization, DuoAttention enables Llama-3-8B decoding with 3.3 million context length on a single A100 GPU. Code is provided in https://github.com/mit-han-lab/duo-attention.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DuoAttention splits heads into retrieval and streaming types to shrink KV cache, delivering measurable speed and memory gains on long contexts, though the synthetic-data selection step could use tighter checks.

read the letter

The core takeaway is that only a subset of attention heads really needs the full KV cache for long contexts, while the rest can run with a short fixed-length cache. This split lets them report up to 2.55x memory cuts and 2x decoding speedups on MHA models with little accuracy loss, and they even reach 3.3 million tokens on a single A100 after quantization. Code is out, so the numbers can be checked directly.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes DuoAttention, which classifies attention heads in LLMs into retrieval heads (requiring full KV cache to preserve long-context capabilities) and streaming heads (approximable with constant-length KV cache focused on recent tokens and attention sinks). A lightweight optimization algorithm run on synthetic data identifies the retrieval heads. The method is claimed to reduce long-context inference memory by up to 2.55x (MHA) and 1.67x (GQA), speed up decoding by up to 2.18x and 1.50x, and accelerate pre-filling by up to 1.73x and 1.63x respectively, while incurring only minimal accuracy loss versus full attention. Combined with quantization, it enables 3.3M-token context on Llama-3-8B using a single A100 GPU. Code is released.

Significance. If the synthetic-data head classification proves robust and generalizes, the work would meaningfully advance practical deployment of long-context LLMs by cutting KV-cache memory and latency without large accuracy penalties. The open-source code is a clear strength that supports reproducibility. The approach builds on existing observations about head specialization and attention sinks but its broader impact depends on whether the identified partition remains necessary and sufficient outside the reported evaluation settings.

major comments (2)

[§3] §3 (Head Identification): The optimization procedure on synthetic data is used to select retrieval heads, yet the manuscript provides no direct ablation demonstrating necessity (e.g., accuracy drop when a selected retrieval head is forced to use constant-length cache) or sufficiency (e.g., that restricting all other heads to constant cache preserves performance on the long-context benchmarks). This partition is load-bearing for the central efficiency claims.
[§4] §4 (Experiments): Performance numbers (memory reduction, speedups, accuracy) are reported only after head classification on synthetic data; there is no cross-task or cross-length hold-out validation showing the selected subset remains adequate for arbitrary long-context tasks, leaving open the possibility that the synthetic objective yields a convenient but incomplete partition.

minor comments (3)

[Abstract] The abstract states 'minimal accuracy loss' without quantifying the exact delta or the specific long-context tasks/metrics used for this assessment.
[Figures] Figure captions and legends for attention-pattern visualizations could be expanded to clarify how retrieval versus streaming heads are highlighted.
[§3] Notation for the constant-length cache size hyperparameter and its relation to attention-sink handling is introduced without a dedicated equation or pseudocode block.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the validation needs for the head classification in DuoAttention. We address each major comment below and will revise the manuscript accordingly to include the suggested ablations and cross-validation experiments.

read point-by-point responses

Referee: [§3] §3 (Head Identification): The optimization procedure on synthetic data is used to select retrieval heads, yet the manuscript provides no direct ablation demonstrating necessity (e.g., accuracy drop when a selected retrieval head is forced to use constant-length cache) or sufficiency (e.g., that restricting all other heads to constant cache preserves performance on the long-context benchmarks). This partition is load-bearing for the central efficiency claims.

Authors: We agree that direct ablations on necessity and sufficiency would provide stronger support for the retrieval head partition. In the revised manuscript, we will add experiments that force selected retrieval heads to use constant-length KV cache and measure the resulting accuracy drop on long-context benchmarks. We will also report results when all streaming heads are restricted to constant-length cache while retrieval heads retain full KV cache, confirming that performance is preserved. These ablations will follow the same synthetic data identification and evaluation protocol as the original results. revision: yes
Referee: [§4] §4 (Experiments): Performance numbers (memory reduction, speedups, accuracy) are reported only after head classification on synthetic data; there is no cross-task or cross-length hold-out validation showing the selected subset remains adequate for arbitrary long-context tasks, leaving open the possibility that the synthetic objective yields a convenient but incomplete partition.

Authors: We acknowledge the need to demonstrate generalization of the identified heads. In the revision, we will include additional experiments applying the synthetic-data-selected retrieval heads to hold-out long-context tasks and context lengths not used in the optimization. We will report accuracy, memory savings, and latency improvements on these settings to show that the partition remains effective and is not limited to the synthetic objective. revision: yes

Circularity Check

0 steps flagged

No significant circularity in DuoAttention derivation chain

full rationale

The paper's core derivation identifies retrieval heads via a lightweight optimization procedure run on synthetic data, then applies full KV cache only to that subset while restricting streaming heads to constant-length cache; long-context accuracy and efficiency metrics are measured on separate benchmark tasks after identification. This separation means the reported performance numbers do not reduce to quantities fitted on the evaluation data itself. No equations, self-citations, or uniqueness theorems are invoked that would make the partition or the efficiency gains equivalent to the inputs by construction. The approach remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The approach introduces a new categorization of attention heads without prior independent evidence outside this work; the identification algorithm uses optimization on synthetic data whose parameters are not detailed here.

axioms (1)

domain assumption Attention heads in transformer LLMs can be partitioned into retrieval heads that require full long-range context and streaming heads that do not.
This partition is invoked to justify the differentiated KV cache strategy.

invented entities (2)

Retrieval Heads no independent evidence
purpose: Attention heads critical for long-context processing that require full KV cache
Newly postulated category based on observed attention patterns.
Streaming Heads no independent evidence
purpose: Attention heads focused on recent tokens and sinks that use reduced constant-length KV cache
Complementary category introduced to enable memory savings.

pith-pipeline@v0.9.0 · 5855 in / 1445 out tokens · 57765 ms · 2026-05-18T11:44:47.663514+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DuoAttention uses a lightweight, optimization-based algorithm with synthetic data to identify retrieval heads accurately.
IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our method significantly reduces long-context inference memory by up to 2.55× for MHA and 1.67× for GQA models while speeding up decoding by up to 2.18× and 1.50×

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving
cs.DC 2026-05 conditional novelty 7.0

KVServe delivers up to 9.13x job completion time speedup and 32.8x time-to-first-token reduction by making KV cache compression service-aware and adaptive in disaggregated LLM serving.
InfiniLoRA: Disaggregated Multi-LoRA Serving for Large Language Models
cs.DC 2026-04 unverdicted novelty 7.0

InfiniLoRA decouples LoRA execution from base-model inference and reports 3.05x higher request throughput plus 54% more adapters meeting strict latency SLOs.
AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference
cs.DC 2026-05 unverdicted novelty 6.0

AB-Sparse adaptively allocates per-head block sizes for sparse attention, adds lossless centroid quantization and custom variable-block GPU kernels, and reports up to 5.43% accuracy gain over fixed-block baselines wit...
Compute Where it Counts: Self Optimizing Language Models
cs.LG 2026-05 unverdicted novelty 6.0

SOL trains a policy to dynamically control multiple efficiency mechanisms per token via group-relative policy optimization on teacher-forced episodes, yielding better quality at matched average budget than static or r...
Reformulating KV Cache Eviction Problem for Long-Context LLM Inference
cs.CL 2026-05 unverdicted novelty 6.0

LaProx reformulates KV cache eviction as an output-aware matrix approximation, enabling a unified global token selection strategy that preserves LLM performance at 5% cache size across long-context benchmarks.
The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity
cs.LG 2026-05 unverdicted novelty 6.0

Attention sinks arise from variance discrepancy in self-attention value aggregation, amplified by super neurons and first-token dimension disparity, and can be mitigated by head-wise RMSNorm to accelerate pre-training...
Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility
cs.AI 2026-05 unverdicted novelty 6.0

SPEED uses layer-asymmetric KV visibility to process non-anchor prompt tokens only in lower layers during prefill, achieving near-baseline quality on Llama-3.1-8B with 33% better TTFT and 25% lower active KV memory at...
Training Transformers for KV Cache Compressibility
cs.LG 2026-05 unverdicted novelty 6.0

KV compressibility is a property of learned transformer representations that can be improved by training with KV sparsification, leading to better quality-budget tradeoffs in downstream compression for retrieval, QA, ...
Training Transformers for KV Cache Compressibility
cs.LG 2026-05 unverdicted novelty 6.0

Training transformers with KV sparsification during continued pretraining produces representations that admit better post-hoc KV cache compression, improving quality under memory budgets for long-context tasks.
CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference
cs.DC 2026-04 unverdicted novelty 6.0

CodecSight reuses video codec signals for online patch pruning before the vision transformer and selective KV-cache refresh in the LLM, delivering up to 3x higher throughput and 87% lower GPU compute than prior baseli...
RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference
cs.LG 2026-02 conditional novelty 6.0

RAT+ pretrains a single dense recurrent-augmented attention model that supports flexible dilated sparse inference after short adaptation, matching dense accuracy at moderate dilation and losing only 1-3 points at high...
BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding
cs.CL 2025-12 unverdicted novelty 6.0

BLASST dynamically sparsifies attention by thresholding softmax scores to skip blocks, delivering 1.5x speedups at 70%+ sparsity while preserving benchmark accuracy.
Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs
cs.LG 2025-10 unverdicted novelty 6.0

A conditional scaling law fitted on over 200 models from 80M to 3B parameters identifies architectures that deliver up to 2.1% higher accuracy and 42% higher inference throughput than LLaMA-3.2 under the same training budget.
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
cs.CL 2025-02 unverdicted novelty 6.0

NSA is a hardware-aligned sparse attention mechanism that enables end-to-end trainable long-context modeling by combining coarse token compression with fine-grained selection.
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
cs.CL 2024-07 accept novelty 6.0

Ada-KV is the first head-wise adaptive KV cache budget allocator for LLMs, using a theoretical loss upper bound to allocate eviction differently per attention head and yielding higher quality than uniform methods on l...
TIDE: Every Layer Knows the Token Beneath the Context
cs.CL 2026-05 unverdicted novelty 5.0

TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention
cs.DC 2026-04 unverdicted novelty 5.0

HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, pl...
Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse
cs.CL 2026-02 unverdicted novelty 5.0

Attention sinks forge native MoE mechanisms in attention layers that cause head collapse, addressed by sink-aware training with auxiliary load balancing.
The Pitfalls of KV Cache Compression
cs.LG 2025-09 conditional novelty 5.0

KV cache compression causes certain instructions to degrade rapidly and be ignored in multi-instruction prompting, with system prompt leakage worsened by method choice, instruction order, and eviction bias; simple pol...

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · cited by 18 Pith papers · 13 internal anchors

[1]

Cold compress: A toolkit for benchmarking kv cache compression approaches, 8 2024

Griffin Adams, Faisal Ladhak, Hailey Schoelkopf, and Raja Biswas. Cold compress: A toolkit for benchmarking kv cache compression approaches, 8 2024. URL https://www.answer.ai/posts/2024-08-01-cold-compress.html

work page 2024
[2]

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee. Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills, 2023. URL https://arxiv.org/abs/2308.16369

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023

work page 2023
[4]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer, 2020. arXiv:2004.05150

work page internal anchor Pith review Pith/arXiv arXiv 2020
[7]

GPT-NeoX-20B: An Open-Source Autoregressive Language Model

Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. GPT - N eo X -20 B : An open-source autoregressive language model, 2022. arXiv: 2204.06745

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90\ URL https://lmsys.org/blog/2023-03-30-vicuna/

work page 2023
[9]

Generating long sequences with sparse transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. 2019

work page 2019
[10]

Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. What does BERT look at? an analysis of BERT ' s attention. In Tal Linzen, Grzegorz Chrupa a, Yonatan Belinkov, and Dieuwke Hupkes (eds.), Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp.\ 276--286, Florence, Italy, August 2019. ...

work page doi:10.18653/v1/w19-4828 2019
[11]

Flash A ttention-2: Faster attention with better parallelism and work partitioning, 2023

Tri Dao. Flash A ttention-2: Faster attention with better parallelism and work partitioning, 2023

work page 2023
[12]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention : Fast and memory-efficient exact attention with IO -awareness, 2022. arXiv:2205.14135

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Model tells you what to discard: Adaptive KV cache compression for LLM s

Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive KV cache compression for LLM s. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=uNrFpDPMyo

work page 2024
[15]

Evaluating factuality in generation with dependency-level entailment

Tanya Goyal and Greg Durrett. Evaluating factuality in generation with dependency-level entailment. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 2020. Association for Computational Linguistics

work page 2020
[16]

Mamba: Linear-time sequence modeling with selective state spaces, 2023

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces, 2023

work page 2023
[17]

Block Sparse Attention

Junxian Guo, Haotian Tang, Shang Yang, Zhekai Zhang, Zhijian Liu, and Song Han. Block Sparse Attention . https://github.com/mit-han-lab/Block-Sparse-Attention, 2024

work page 2024
[18]

LM - I nfinite: Simple on-the-fly length generalization for large language models, 2023

Chi Han, Qifan Wang, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. LM - I nfinite: Simple on-the-fly length generalization for large language models, 2023

work page 2023
[19]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021

work page 2021
[20]

Flashdecoding++: Faster large language model inference on gpus, 2024

Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, Kangdi Chen, Yuhan Dong, and Yu Wang. Flashdecoding++: Faster large language model inference on gpus, 2024

work page 2024
[21]

Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization, 2024

work page 2024
[22]

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models, 2023. URL https://arxiv.org/abs/2309.14509

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023

work page 2023
[24]

Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention

Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention. arXiv preprint arXiv:2407.02490, 2024

work page arXiv 2024
[25]

Llmtest\_needleinahaystack: Doing simple retrieval from llm models at various context lengths to measure accuracy

Greg Kamradt. Llmtest\_needleinahaystack: Doing simple retrieval from llm models at various context lengths to measure accuracy. https://github.com/gkamradt/LLMTest_NeedleInAHaystack, 2024. Accessed: 2024-05-23

work page 2024
[26]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings , 2015. URL http://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2015
[27]

Booksum: A collection of datasets for long-form narrative summarization

Wojciech Kry \'s ci \'n ski, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong, and Dragomir Radev. Booksum: A collection of datasets for long-form narrative summarization. 2021

work page 2021
[28]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention, 2023

work page 2023
[29]

Video-llava: Learning united visual representation by alignment before projection, 2023

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection, 2023

work page 2023
[30]

Awq: Activation-aware weight quantization for llm compression and acceleration, 2024

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration, 2024

work page 2024
[31]

Qserve: W4a8kv4 quantization and system co-design for efficient llm serving

Yujun Lin*, Haotian Tang*, Shang Yang*, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. Qserve: W4a8kv4 quantization and system co-design for efficient llm serving. arXiv preprint arXiv:2405.04532, 2024

work page arXiv 2024
[32]

Ring attention with blockwise transformers for near-infinite context, 2023 a

Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context, 2023 a

work page 2023
[33]

Visual instruction tuning, 2023 b

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023 b

work page 2023
[34]

Learning efficient convolutional networks through network slimming

Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In ICCV, 2017

work page 2017
[35]

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023
[37]

Transformers are multi-state rnns, 2024

Matanel Oren, Michael Hassid, Yossi Adi, and Roy Schwartz. Transformers are multi-state rnns, 2024

work page 2024
[38]

Py T orch: An imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Py T orch: An imperative style, high-per...

work page 2019
[39]

Ofir Press, Noah A

Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré. Hyena hierarchy: Towards larger convolutional language models, 2023. URL https://arxiv.org/abs/2302.10866

work page arXiv 2023
[40]

Chatgpt: Optimizing language models for dialogue

John Schulman, Barret Zoph, Christina Kim, Jacob Hilton, Jacob Menick, Jiayi Weng, Juan Felipe Ceron Uribe, Liam Fedus, Luke Metz, Michael Pokorny, et al. Chatgpt: Optimizing language models for dialogue. OpenAI blog, 2022

work page 2022
[41]

Fast transformer decoding: One write-head is all you need, 2019

Noam Shazeer. Fast transformer decoding: One write-head is all you need, 2019

work page 2019
[42]

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[43]

Razorattention: Efficient kv cache compression through retrieval heads, 2024 a

Hanlin Tang, Yang Lin, Jing Lin, Qingsen Han, Shikuan Hong, Yiwu Yao, and Gongyi Wang. Razorattention: Efficient kv cache compression through retrieval heads, 2024 a . URL https://arxiv.org/abs/2407.15891

work page arXiv 2024
[44]

Quest: Query-aware sparsity for efficient long-context llm inference, 2024 b

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: Query-aware sparsity for efficient long-context llm inference, 2024 b

work page 2024
[45]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

work page 2023
[46]

Tibshirani

R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society (Series B), 58: 0 267--288, 1996

work page 1996
[47]

Llama-2-7b-32k-instruct — and fine-tuning for llama-2 models with together api, June 2023

Together. Llama-2-7b-32k-instruct — and fine-tuning for llama-2 models with together api, June 2023. URL https://together.ai/blog/llama-2-7b-32k-instruct

work page 2023
[48]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023 a

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023 b

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Retrieval head mechanistically explains long-context factuality, 2024

Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, and Yao Fu. Retrieval head mechanistically explains long-context factuality, 2024

work page 2024
[51]

S mooth Q uant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. S mooth Q uant: Accurate and efficient post-training quantization for large language models. In Proceedings of the 40th International Conference on Machine Learning, 2023 a

work page 2023
[52]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. arXiv, 2023 b

work page 2023
[53]

Cascade inference: Memory bandwidth efficient shared prefix batch decoding

Zihao Ye, Ruihang Lai, Roy Lu, Chien-Yu Lin, Size Zheng, Lequn Chen, Tianqi Chen, and Luis Ceze. Cascade inference: Memory bandwidth efficient shared prefix batch decoding. https://flashinfer.ai/2024/01/08/cascade-inference.html, Jan 2024. URL https://flashinfer.ai/2024/01/08/cascade-inference.html. Accessed on 2024-02-01

work page 2024
[54]

Big Bird : T ransformers for longer sequences

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big Bird : T ransformers for longer sequences. In Proc. of NeurIPS, volume 33, 2020

work page 2020
[55]

Hashimoto

Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown, and Tatsunori B. Hashimoto. Benchmarking large language models for news summarization, 2023 a

work page 2023
[56]

H _2 o: Heavy-hitter oracle for efficient generative inference of large language models, 2023 b

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen. H _2 o: Heavy-hitter oracle for efficient generative inference of large language models, 2023 b

work page 2023
[57]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023

work page 2023