DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
Pith reviewed 2026-05-18 11:44 UTC · model grok-4.3
The pith
Only retrieval heads need full key-value caches for long-context processing in large language models, while streaming heads can use short fixed caches.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that identifying retrieval heads, which require complete KV caches for long contexts, and streaming heads, which suffice with constant-length caches, enables efficient inference. The identification uses a lightweight optimization-based algorithm with synthetic data. This leads to memory savings up to 2.55 times for certain models and speedups in decoding and pre-filling, all with minimal impact on accuracy for long-context tasks.
What carries the argument
The separation of attention heads into retrieval heads that keep full KV caches and streaming heads that use a lightweight constant-length KV cache, with the split determined by an optimization algorithm on synthetic data.
If this is right
- Long-context inference memory usage drops substantially, up to 2.55x for MHA models and 1.67x for GQA models.
- Decoding becomes faster by up to 2.18x for MHA and 1.50x for GQA.
- Pre-filling stage accelerates by up to 1.73x and 1.63x respectively.
- With quantization, models can handle contexts as long as 3.3 million tokens on a single high-end GPU.
- Long-context capabilities remain largely intact despite the reduced caching.
Where Pith is reading between the lines
- The head classification might reveal similar structure in other transformer variants beyond the tested models.
- Integrating this with other compression methods could yield further gains in efficiency.
- Testing on a wider range of benchmarks would confirm if the synthetic data method generalizes across tasks.
- If streaming heads prove task-dependent, online reclassification could be explored.
Load-bearing premise
The optimization algorithm using synthetic data correctly identifies which heads are retrieval heads that truly require the full KV cache to maintain long-context performance.
What would settle it
Running the method on a new long-context task and observing that accuracy drops significantly when using the constant cache for the designated streaming heads would falsify the claim.
read the original abstract
Deploying long-context large language models (LLMs) is essential but poses significant computational and memory challenges. Caching all Key and Value (KV) states across all attention heads consumes substantial memory. Existing KV cache pruning methods either damage the long-context capabilities of LLMs or offer only limited efficiency improvements. In this paper, we identify that only a fraction of attention heads, a.k.a, Retrieval Heads, are critical for processing long contexts and require full attention across all tokens. In contrast, all other heads, which primarily focus on recent tokens and attention sinks--referred to as Streaming Heads--do not require full attention. Based on this insight, we introduce DuoAttention, a framework that only applies a full KV cache to retrieval heads while using a light-weight, constant-length KV cache for streaming heads, which reduces both LLM's decoding and pre-filling memory and latency without compromising its long-context abilities. DuoAttention uses a lightweight, optimization-based algorithm with synthetic data to identify retrieval heads accurately. Our method significantly reduces long-context inference memory by up to 2.55x for MHA and 1.67x for GQA models while speeding up decoding by up to 2.18x and 1.50x and accelerating pre-filling by up to 1.73x and 1.63x for MHA and GQA models, respectively, with minimal accuracy loss compared to full attention. Notably, combined with quantization, DuoAttention enables Llama-3-8B decoding with 3.3 million context length on a single A100 GPU. Code is provided in https://github.com/mit-han-lab/duo-attention.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes DuoAttention, which classifies attention heads in LLMs into retrieval heads (requiring full KV cache to preserve long-context capabilities) and streaming heads (approximable with constant-length KV cache focused on recent tokens and attention sinks). A lightweight optimization algorithm run on synthetic data identifies the retrieval heads. The method is claimed to reduce long-context inference memory by up to 2.55x (MHA) and 1.67x (GQA), speed up decoding by up to 2.18x and 1.50x, and accelerate pre-filling by up to 1.73x and 1.63x respectively, while incurring only minimal accuracy loss versus full attention. Combined with quantization, it enables 3.3M-token context on Llama-3-8B using a single A100 GPU. Code is released.
Significance. If the synthetic-data head classification proves robust and generalizes, the work would meaningfully advance practical deployment of long-context LLMs by cutting KV-cache memory and latency without large accuracy penalties. The open-source code is a clear strength that supports reproducibility. The approach builds on existing observations about head specialization and attention sinks but its broader impact depends on whether the identified partition remains necessary and sufficient outside the reported evaluation settings.
major comments (2)
- [§3] §3 (Head Identification): The optimization procedure on synthetic data is used to select retrieval heads, yet the manuscript provides no direct ablation demonstrating necessity (e.g., accuracy drop when a selected retrieval head is forced to use constant-length cache) or sufficiency (e.g., that restricting all other heads to constant cache preserves performance on the long-context benchmarks). This partition is load-bearing for the central efficiency claims.
- [§4] §4 (Experiments): Performance numbers (memory reduction, speedups, accuracy) are reported only after head classification on synthetic data; there is no cross-task or cross-length hold-out validation showing the selected subset remains adequate for arbitrary long-context tasks, leaving open the possibility that the synthetic objective yields a convenient but incomplete partition.
minor comments (3)
- [Abstract] The abstract states 'minimal accuracy loss' without quantifying the exact delta or the specific long-context tasks/metrics used for this assessment.
- [Figures] Figure captions and legends for attention-pattern visualizations could be expanded to clarify how retrieval versus streaming heads are highlighted.
- [§3] Notation for the constant-length cache size hyperparameter and its relation to attention-sink handling is introduced without a dedicated equation or pseudocode block.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the validation needs for the head classification in DuoAttention. We address each major comment below and will revise the manuscript accordingly to include the suggested ablations and cross-validation experiments.
read point-by-point responses
-
Referee: [§3] §3 (Head Identification): The optimization procedure on synthetic data is used to select retrieval heads, yet the manuscript provides no direct ablation demonstrating necessity (e.g., accuracy drop when a selected retrieval head is forced to use constant-length cache) or sufficiency (e.g., that restricting all other heads to constant cache preserves performance on the long-context benchmarks). This partition is load-bearing for the central efficiency claims.
Authors: We agree that direct ablations on necessity and sufficiency would provide stronger support for the retrieval head partition. In the revised manuscript, we will add experiments that force selected retrieval heads to use constant-length KV cache and measure the resulting accuracy drop on long-context benchmarks. We will also report results when all streaming heads are restricted to constant-length cache while retrieval heads retain full KV cache, confirming that performance is preserved. These ablations will follow the same synthetic data identification and evaluation protocol as the original results. revision: yes
-
Referee: [§4] §4 (Experiments): Performance numbers (memory reduction, speedups, accuracy) are reported only after head classification on synthetic data; there is no cross-task or cross-length hold-out validation showing the selected subset remains adequate for arbitrary long-context tasks, leaving open the possibility that the synthetic objective yields a convenient but incomplete partition.
Authors: We acknowledge the need to demonstrate generalization of the identified heads. In the revision, we will include additional experiments applying the synthetic-data-selected retrieval heads to hold-out long-context tasks and context lengths not used in the optimization. We will report accuracy, memory savings, and latency improvements on these settings to show that the partition remains effective and is not limited to the synthetic objective. revision: yes
Circularity Check
No significant circularity in DuoAttention derivation chain
full rationale
The paper's core derivation identifies retrieval heads via a lightweight optimization procedure run on synthetic data, then applies full KV cache only to that subset while restricting streaming heads to constant-length cache; long-context accuracy and efficiency metrics are measured on separate benchmark tasks after identification. This separation means the reported performance numbers do not reduce to quantities fitted on the evaluation data itself. No equations, self-citations, or uniqueness theorems are invoked that would make the partition or the efficiency gains equivalent to the inputs by construction. The approach remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Attention heads in transformer LLMs can be partitioned into retrieval heads that require full long-range context and streaming heads that do not.
invented entities (2)
-
Retrieval Heads
no independent evidence
-
Streaming Heads
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DuoAttention uses a lightweight, optimization-based algorithm with synthetic data to identify retrieval heads accurately.
-
IndisputableMonolith.Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our method significantly reduces long-context inference memory by up to 2.55× for MHA and 1.67× for GQA models while speeding up decoding by up to 2.18× and 1.50×
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 19 Pith papers
-
KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving
KVServe delivers up to 9.13x job completion time speedup and 32.8x time-to-first-token reduction by making KV cache compression service-aware and adaptive in disaggregated LLM serving.
-
InfiniLoRA: Disaggregated Multi-LoRA Serving for Large Language Models
InfiniLoRA decouples LoRA execution from base-model inference and reports 3.05x higher request throughput plus 54% more adapters meeting strict latency SLOs.
-
AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference
AB-Sparse adaptively allocates per-head block sizes for sparse attention, adds lossless centroid quantization and custom variable-block GPU kernels, and reports up to 5.43% accuracy gain over fixed-block baselines wit...
-
Compute Where it Counts: Self Optimizing Language Models
SOL trains a policy to dynamically control multiple efficiency mechanisms per token via group-relative policy optimization on teacher-forced episodes, yielding better quality at matched average budget than static or r...
-
Reformulating KV Cache Eviction Problem for Long-Context LLM Inference
LaProx reformulates KV cache eviction as an output-aware matrix approximation, enabling a unified global token selection strategy that preserves LLM performance at 5% cache size across long-context benchmarks.
-
The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity
Attention sinks arise from variance discrepancy in self-attention value aggregation, amplified by super neurons and first-token dimension disparity, and can be mitigated by head-wise RMSNorm to accelerate pre-training...
-
Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility
SPEED uses layer-asymmetric KV visibility to process non-anchor prompt tokens only in lower layers during prefill, achieving near-baseline quality on Llama-3.1-8B with 33% better TTFT and 25% lower active KV memory at...
-
Training Transformers for KV Cache Compressibility
KV compressibility is a property of learned transformer representations that can be improved by training with KV sparsification, leading to better quality-budget tradeoffs in downstream compression for retrieval, QA, ...
-
Training Transformers for KV Cache Compressibility
Training transformers with KV sparsification during continued pretraining produces representations that admit better post-hoc KV cache compression, improving quality under memory budgets for long-context tasks.
-
CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference
CodecSight reuses video codec signals for online patch pruning before the vision transformer and selective KV-cache refresh in the LLM, delivering up to 3x higher throughput and 87% lower GPU compute than prior baseli...
-
RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference
RAT+ pretrains a single dense recurrent-augmented attention model that supports flexible dilated sparse inference after short adaptation, matching dense accuracy at moderate dilation and losing only 1-3 points at high...
-
BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding
BLASST dynamically sparsifies attention by thresholding softmax scores to skip blocks, delivering 1.5x speedups at 70%+ sparsity while preserving benchmark accuracy.
-
Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs
A conditional scaling law fitted on over 200 models from 80M to 3B parameters identifies architectures that deliver up to 2.1% higher accuracy and 42% higher inference throughput than LLaMA-3.2 under the same training budget.
-
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
NSA is a hardware-aligned sparse attention mechanism that enables end-to-end trainable long-context modeling by combining coarse token compression with fine-grained selection.
-
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
Ada-KV is the first head-wise adaptive KV cache budget allocator for LLMs, using a theoretical loss upper bound to allocate eviction differently per attention head and yielding higher quality than uniform methods on l...
-
TIDE: Every Layer Knows the Token Beneath the Context
TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.
-
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention
HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, pl...
-
Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse
Attention sinks forge native MoE mechanisms in attention layers that cause head collapse, addressed by sink-aware training with auxiliary load balancing.
-
The Pitfalls of KV Cache Compression
KV cache compression causes certain instructions to degrade rapidly and be ignored in multi-instruction prompting, with system prompt leakage worsened by method choice, instruction order, and eviction bias; simple pol...
Reference graph
Works this paper leans on
-
[1]
Cold compress: A toolkit for benchmarking kv cache compression approaches, 8 2024
Griffin Adams, Faisal Ladhak, Hailey Schoelkopf, and Raja Biswas. Cold compress: A toolkit for benchmarking kv cache compression approaches, 8 2024. URL https://www.answer.ai/posts/2024-08-01-cold-compress.html
work page 2024
-
[2]
SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee. Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills, 2023. URL https://arxiv.org/abs/2308.16369
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023
work page 2023
-
[4]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Longformer: The Long-Document Transformer
Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer, 2020. arXiv:2004.05150
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[7]
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. GPT - N eo X -20 B : An open-source autoregressive language model, 2022. arXiv: 2204.06745
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[8]
Gonzalez, Ion Stoica, and Eric P
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90\ URL https://lmsys.org/blog/2023-03-30-vicuna/
work page 2023
-
[9]
Generating long sequences with sparse transformers
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. 2019
work page 2019
-
[10]
Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. What does BERT look at? an analysis of BERT ' s attention. In Tal Linzen, Grzegorz Chrupa a, Yonatan Belinkov, and Dieuwke Hupkes (eds.), Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp.\ 276--286, Florence, Italy, August 2019. ...
-
[11]
Flash A ttention-2: Faster attention with better parallelism and work partitioning, 2023
Tri Dao. Flash A ttention-2: Faster attention with better parallelism and work partitioning, 2023
work page 2023
-
[12]
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention : Fast and memory-efficient exact attention with IO -awareness, 2022. arXiv:2205.14135
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[13]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Model tells you what to discard: Adaptive KV cache compression for LLM s
Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive KV cache compression for LLM s. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=uNrFpDPMyo
work page 2024
-
[15]
Evaluating factuality in generation with dependency-level entailment
Tanya Goyal and Greg Durrett. Evaluating factuality in generation with dependency-level entailment. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 2020. Association for Computational Linguistics
work page 2020
-
[16]
Mamba: Linear-time sequence modeling with selective state spaces, 2023
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces, 2023
work page 2023
-
[17]
Junxian Guo, Haotian Tang, Shang Yang, Zhekai Zhang, Zhijian Liu, and Song Han. Block Sparse Attention . https://github.com/mit-han-lab/Block-Sparse-Attention, 2024
work page 2024
-
[18]
LM - I nfinite: Simple on-the-fly length generalization for large language models, 2023
Chi Han, Qifan Wang, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. LM - I nfinite: Simple on-the-fly length generalization for large language models, 2023
work page 2023
-
[19]
Measuring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021
work page 2021
-
[20]
Flashdecoding++: Faster large language model inference on gpus, 2024
Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, Kangdi Chen, Yuhan Dong, and Yu Wang. Flashdecoding++: Faster large language model inference on gpus, 2024
work page 2024
-
[21]
Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami
Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization, 2024
work page 2024
-
[22]
Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models, 2023. URL https://arxiv.org/abs/2309.14509
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023
work page 2023
-
[24]
Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention
Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention. arXiv preprint arXiv:2407.02490, 2024
-
[25]
Greg Kamradt. Llmtest\_needleinahaystack: Doing simple retrieval from llm models at various context lengths to measure accuracy. https://github.com/gkamradt/LLMTest_NeedleInAHaystack, 2024. Accessed: 2024-05-23
work page 2024
-
[26]
Adam: A Method for Stochastic Optimization
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings , 2015. URL http://arxiv.org/abs/1412.6980
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[27]
Booksum: A collection of datasets for long-form narrative summarization
Wojciech Kry \'s ci \'n ski, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong, and Dragomir Radev. Booksum: A collection of datasets for long-form narrative summarization. 2021
work page 2021
-
[28]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention, 2023
work page 2023
-
[29]
Video-llava: Learning united visual representation by alignment before projection, 2023
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection, 2023
work page 2023
-
[30]
Awq: Activation-aware weight quantization for llm compression and acceleration, 2024
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration, 2024
work page 2024
-
[31]
Qserve: W4a8kv4 quantization and system co-design for efficient llm serving
Yujun Lin*, Haotian Tang*, Shang Yang*, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. Qserve: W4a8kv4 quantization and system co-design for efficient llm serving. arXiv preprint arXiv:2405.04532, 2024
-
[32]
Ring attention with blockwise transformers for near-infinite context, 2023 a
Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context, 2023 a
work page 2023
-
[33]
Visual instruction tuning, 2023 b
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023 b
work page 2023
-
[34]
Learning efficient convolutional networks through network slimming
Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In ICCV, 2017
work page 2017
-
[35]
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [36]
-
[37]
Transformers are multi-state rnns, 2024
Matanel Oren, Michael Hassid, Yossi Adi, and Roy Schwartz. Transformers are multi-state rnns, 2024
work page 2024
-
[38]
Py T orch: An imperative style, high-performance deep learning library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Py T orch: An imperative style, high-per...
work page 2019
-
[39]
Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré. Hyena hierarchy: Towards larger convolutional language models, 2023. URL https://arxiv.org/abs/2302.10866
-
[40]
Chatgpt: Optimizing language models for dialogue
John Schulman, Barret Zoph, Christina Kim, Jacob Hilton, Jacob Menick, Jiayi Weng, Juan Felipe Ceron Uribe, Liam Fedus, Luke Metz, Michael Pokorny, et al. Chatgpt: Optimizing language models for dialogue. OpenAI blog, 2022
work page 2022
-
[41]
Fast transformer decoding: One write-head is all you need, 2019
Noam Shazeer. Fast transformer decoding: One write-head is all you need, 2019
work page 2019
-
[42]
RoFormer: Enhanced Transformer with Rotary Position Embedding
Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[43]
Razorattention: Efficient kv cache compression through retrieval heads, 2024 a
Hanlin Tang, Yang Lin, Jing Lin, Qingsen Han, Shikuan Hong, Yiwu Yao, and Gongyi Wang. Razorattention: Efficient kv cache compression through retrieval heads, 2024 a . URL https://arxiv.org/abs/2407.15891
-
[44]
Quest: Query-aware sparsity for efficient long-context llm inference, 2024 b
Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: Query-aware sparsity for efficient long-context llm inference, 2024 b
work page 2024
- [45]
-
[46]
R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society (Series B), 58: 0 267--288, 1996
work page 1996
-
[47]
Llama-2-7b-32k-instruct — and fine-tuning for llama-2 models with together api, June 2023
Together. Llama-2-7b-32k-instruct — and fine-tuning for llama-2 models with together api, June 2023. URL https://together.ai/blog/llama-2-7b-32k-instruct
work page 2023
-
[48]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023 a
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023 b
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
Retrieval head mechanistically explains long-context factuality, 2024
Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, and Yao Fu. Retrieval head mechanistically explains long-context factuality, 2024
work page 2024
-
[51]
S mooth Q uant: Accurate and efficient post-training quantization for large language models
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. S mooth Q uant: Accurate and efficient post-training quantization for large language models. In Proceedings of the 40th International Conference on Machine Learning, 2023 a
work page 2023
-
[52]
Efficient streaming language models with attention sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. arXiv, 2023 b
work page 2023
-
[53]
Cascade inference: Memory bandwidth efficient shared prefix batch decoding
Zihao Ye, Ruihang Lai, Roy Lu, Chien-Yu Lin, Size Zheng, Lequn Chen, Tianqi Chen, and Luis Ceze. Cascade inference: Memory bandwidth efficient shared prefix batch decoding. https://flashinfer.ai/2024/01/08/cascade-inference.html, Jan 2024. URL https://flashinfer.ai/2024/01/08/cascade-inference.html. Accessed on 2024-02-01
work page 2024
-
[54]
Big Bird : T ransformers for longer sequences
Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big Bird : T ransformers for longer sequences. In Proc. of NeurIPS, volume 33, 2020
work page 2020
- [55]
-
[56]
H _2 o: Heavy-hitter oracle for efficient generative inference of large language models, 2023 b
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen. H _2 o: Heavy-hitter oracle for efficient generative inference of large language models, 2023 b
work page 2023
-
[57]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.