Recognition: no theorem link
Fast Transformer Decoding: One Write-Head is All You Need
Pith reviewed 2026-05-10 23:45 UTC · model grok-4.3
The pith
Multi-query attention shares keys and values across heads to speed up Transformer decoding with little quality loss.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that tying the key and value projections together across all attention heads produces a multi-query attention layer whose key-value cache is far smaller than in standard multi-head attention. Because incremental decoding must repeatedly read this cache, the reduced size lowers memory bandwidth and therefore raises generation speed. The resulting models retain most of their original quality on the tasks and sizes tested.
What carries the argument
Multi-query attention, the variant in which a single key projection and a single value projection serve all query heads instead of separate projections per head.
If this is right
- Decoding speed rises because the key-value cache read at each step is much smaller.
- The same model size now fits within tighter memory-bandwidth limits.
- Quality remains close to the multi-head baseline across the tested model scales and tasks.
- Training stays unchanged because the modification affects only the inference path.
Where Pith is reading between the lines
- The same sharing idea could be applied to any autoregressive sequence model that uses multi-head attention.
- It opens the possibility of running larger models in real time on existing hardware.
- Future variants might use a small number of shared key-value groups rather than a single group to trade speed for capacity.
Load-bearing premise
Sharing one key-value pair across all heads still leaves the model enough capacity to learn the distinct attention patterns it needs.
What would settle it
Measure wall-clock decoding latency and task accuracy on a held-out benchmark after training an otherwise identical Transformer once with full multi-head attention and once with multi-query attention.
read the original abstract
Multi-head attention layers, as used in the Transformer neural sequence model, are a powerful alternative to RNNs for moving information across and between sequences. While training these layers is generally fast and simple, due to parallelizability across the length of the sequence, incremental inference (where such paralleization is impossible) is often slow, due to the memory-bandwidth cost of repeatedly loading the large "keys" and "values" tensors. We propose a variant called multi-query attention, where the keys and values are shared across all of the different attention "heads", greatly reducing the size of these tensors and hence the memory bandwidth requirements of incremental decoding. We verify experimentally that the resulting models can indeed be much faster to decode, and incur only minor quality degradation from the baseline.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes multi-query attention, a variant of the standard multi-head attention used in Transformers. In this design, the key and value projections are shared across all heads (reducing the KV cache size by a factor equal to the number of heads) while queries remain head-specific. The central claim is that this change substantially reduces memory-bandwidth costs during incremental/autoregressive decoding, yielding much faster inference with only minor quality degradation relative to full multi-head attention. The authors support the claim with experiments on WMT translation and language-modeling benchmarks using models up to a few hundred million parameters.
Significance. If the reported speed/quality trade-off holds under broader conditions, the result is practically significant for efficient deployment of large Transformers. The modification is minimal, requires no changes to the training algorithm, and directly attacks the KV-cache bandwidth bottleneck that dominates incremental decoding. By providing concrete measurements of both latency and task metrics on standard benchmarks, the work offers a falsifiable, immediately usable technique that scales with model size and sequence length.
minor comments (3)
- [Abstract] Abstract: the statement that models 'can indeed be much faster to decode' and incur 'only minor quality degradation' is not quantified. Adding one sentence with concrete factors (e.g., '2-4x faster decoding with <0.5 BLEU drop on WMT En-De') would make the high-level claim self-contained.
- [§3] §3 (Method): the dimension reduction for the shared K and V tensors is described in prose but would benefit from an explicit equation or diagram showing the new shapes relative to standard multi-head attention (e.g., K, V ∈ ℝ^{T×d_model} instead of ℝ^{T×h×d_head}).
- [§4] §4 (Experiments): while the speed/quality results are reported, the text should explicitly state whether all compared models were trained with identical hyper-parameters and total parameter budgets, and whether statistical significance or multiple random seeds were used for the quality metrics.
Simulated Author's Rebuttal
We thank the referee for the supportive summary and recommendation of minor revision. The report contains no specific major comments to address point-by-point. We remain available to incorporate any editorial or minor clarifications the editor may request in a revised version.
Circularity Check
No significant circularity
full rationale
The paper proposes multi-query attention as a direct architectural modification to standard multi-head attention by sharing keys and values across heads. This change is introduced explicitly to address memory bandwidth in incremental decoding and is validated through separate experiments on WMT translation and language-modeling tasks. No derivation chain exists that reduces a claimed result to its own inputs via self-definition, fitted parameters renamed as predictions, or load-bearing self-citations. The central claims rest on the proposal itself plus external empirical measurements rather than any closed logical loop or ansatz smuggled through prior author work.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard scaled dot-product attention and multi-head concatenation formulas from the original Transformer
invented entities (1)
-
multi-query attention
no independent evidence
Forward citations
Cited by 54 Pith papers
-
Nearly Optimal Attention Coresets
ε-coresets for attention exist of size O(√d e^{ρ+o(ρ)}/ε) for unit-norm keys/values and queries of norm ≤ρ, nearly matching the Ω(√d e^ρ/ε) lower bound.
-
Ragged Paged Attention: A High-Performance and Flexible LLM Inference Kernel for TPU
RPA kernel for TPUs achieves 86% MBU in decode and 73% MFU in prefill on Llama 3 8B via tiling for ragged memory, fused pipelines, and specialized compilation for prefill/decode workloads.
-
Fast Cross-Operator Optimization of Attention Dataflow
MMEE encodes dataflow decisions in matrix form for fast exhaustive search, delivering 40-69% lower latency and energy use than prior methods while running 64-343x faster.
-
Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation
SCOPE accelerates autoregressive video diffusion up to 4.73x by using a tri-modal cache-predict-recompute scheduler with Taylor extrapolation and selective active-frame computation while preserving output quality.
-
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
-
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
-
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
-
Accelerating Large Language Model Decoding with Speculative Sampling
Speculative sampling accelerates LLM decoding 2-2.5x by letting a draft model propose short sequences that the target model scores in parallel, then applies modified rejection sampling to keep the exact target distribution.
-
When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction
Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.
-
Search Your Block Floating Point Scales!
ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
-
Nectar: Neural Estimation of Cached-Token Attention via Regression
Nectar fits small per-layer per-head neural networks via regression to predict attention outputs and normalizers, enabling constant-time inference independent of context length while preserving semantic generation quality.
-
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models
MELT decouples reasoning depth from memory in looped LLMs by sharing a single gated KV cache per layer and using two-phase chunk-wise distillation from Ouro, delivering constant memory use while matching or beating st...
-
Revisiting Transformer Layer Parameterization Through Causal Energy Minimization
CEM recasts Transformer layers as energy minimization steps, enabling constrained parameterizations like weight sharing and low-rank interactions that match standard baselines in 100M-scale language modeling.
-
Cubit: Token Mixer with Kernel Ridge Regression
Cubit replaces Transformer attention with Kernel Ridge Regression token mixing and shows potential gains on longer sequences.
-
ZAYA1-8B Technical Report
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
-
The Impossibility Triangle of Long-Context Modeling
No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.
-
WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization
WindowQuant performs window-adaptive mixed-precision KV cache quantization guided by similarity to the text prompt, with reordering to enable efficient inference in VLMs.
-
Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs
CuTile delivers high performance on select AI workloads and GPUs but varies significantly by architecture and is less portable than Triton across tested platforms.
-
Sub-Token Routing in LoRA for Adaptation and Query-Aware KV Compression
Sub-token routing in LoRA-adapted transformers adds a finer compression axis for KV caches, with query-independent and query-aware designs that improve efficiency under reduced budgets when combined with token-level s...
-
Graph-Guided Adaptive Channel Elimination for KV Cache Compression
GRACE reframes KV cache channel pruning as graph optimization to find a near-optimal subset, achieving 60% compression with negligible degradation and outperforming prior methods.
-
Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon
Fused compressed-domain int4 attention on Apple Silicon delivers 48x speedup and 3.2x KV cache compression for 128K-context 70B models while matching FP16 token predictions.
-
The Illusion of Equivalence: Systematic FP16 Divergence in KV-Cached Autoregressive Inference
FP16 KV caching in transformers causes deterministic token divergence versus cache-free inference due to non-associative floating-point accumulation orderings.
-
Blink: CPU-Free LLM Inference by Delegating the Serving Stack to GPU and SmartNIC
Blink enables CPU-free LLM inference via SmartNIC offload and persistent GPU kernel, delivering up to 8.47x lower P99 TTFT, 3.4x lower P99 TPOT, 2.1x higher decode throughput, and 48.6% lower energy per token while re...
-
TRAPTI: Time-Resolved Analysis for SRAM Banking and Power Gating Optimization in Embedded Transformer Inference
TRAPTI delivers cycle-accurate memory occupancy traces to guide SRAM banking and power-gating choices, showing a 2.72x lower peak memory footprint for a GQA model versus MHA under identical accelerator settings.
-
ForkKV: Scaling Multi-LoRA Agent Serving via Copy-on-Write Disaggregated KV Cache
ForkKV uses copy-on-write disaggregated KV cache with DualRadixTree and ResidualAttention kernels to deliver up to 3x throughput over prior multi-LoRA serving systems with negligible quality loss.
-
Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion
Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.
-
CSAttention: Centroid-Scoring Attention for Accelerating LLM Inference
CSAttention precomputes fixed-size query-centric lookup tables in offline prefill to enable fast table-lookup decoding, delivering near-identical accuracy to full attention and up to 4.6x speedup at 95% sparsity for 3...
-
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.
-
StarCoder 2 and The Stack v2: The Next Generation
StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.
-
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
KIVI applies asymmetric 2-bit quantization to KV cache with per-channel keys and per-token values, reducing memory 2.6x and boosting throughput up to 3.47x with near-identical quality on Llama, Falcon, and Mistral.
-
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
EAGLE resolves feature-level uncertainty in speculative sampling via one-step token advancement, delivering 2.7x-3.5x speedup on LLaMA2-Chat 70B and doubled throughput across multiple model families and tasks.
-
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
FlashAttention-2 achieves roughly 2x speedup over FlashAttention by parallelizing attention across thread blocks and distributing work within blocks, reaching 50-73% of theoretical peak FLOPs/s on A100 GPUs.
-
Retentive Network: A Successor to Transformer for Large Language Models
RetNet is a new sequence modeling architecture that delivers parallel training, constant-time inference, and competitive language modeling performance as a potential replacement for Transformers.
-
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Uptraining multi-head transformer checkpoints to grouped-query attention models achieves near multi-head quality at multi-query inference speeds using 5% additional compute.
-
PaLM: Scaling Language Modeling with Pathways
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
-
ST-MoE: Designing Stable and Transferable Sparse Expert Models
ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...
-
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
GShard supplies automatic sharding and conditional computation support that enabled training a 600-billion-parameter multilingual translation model on thousands of TPUs with superior quality.
-
Kaczmarz Linear Attention
Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...
-
How to Compress KV Cache in RL Post-Training? Shadow Mask Distillation for Memory-Efficient Alignment
Shadow Mask Distillation enables KV cache compression in RL post-training of LLMs by mitigating amplified off-policy bias that defeats standard importance reweighting.
-
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k
Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
-
Make Your LVLM KV Cache More Lightweight
LightKV compresses vision-token KV cache in LVLMs to 55% size via prompt-guided cross-modality aggregation, halving cache memory, cutting compute 40%, and maintaining performance on benchmarks.
-
EdgeFM: Efficient Edge Inference for Vision-Language Models
EdgeFM is an agent-driven framework that strips non-essential features from VLMs and packages reusable optimized kernels, achieving up to 1.49x speedup over TensorRT-Edge-LLM on NVIDIA Orin while enabling first end-to...
-
Predictive Multi-Tier Memory Management for KV Cache in Large-Scale GPU Inference
A unified KV cache system with architecture-specific sizing, six-tier memory from GPU to filesystems, and Bayesian prediction delivers 7.4x higher batch sizes, 70-84% hit rates, and projected 1.7-2.9x throughput gains.
-
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention
HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, pl...
-
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
DeepSeek-V3.2 adds sparse attention, scaled RL post-training, and large-scale agentic data synthesis to reach GPT-5-level performance and gold medals in 2025 IMO and IOI with its high-compute variant.
-
gpt-oss-120b & gpt-oss-20b Model Card
OpenAI releases two open-weight reasoning models, gpt-oss-120b and gpt-oss-20b, trained via distillation and RL with claimed strong results on math, coding, and safety benchmarks.
-
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
DeepSeekMoE 2B matches GShard 2.9B performance and approaches a dense 2B model; the 16B version matches LLaMA2-7B at 40% compute by using fine-grained expert segmentation plus shared experts.
-
StarCoder: may the source be with you!
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
-
DARC-CLIP: Dynamic Adaptive Refinement with Cross-Attention for Meme Understanding
DARC-CLIP improves CLIP-based meme classification with hierarchical adaptive refinement, delivering +4.18 AUROC and +6.84 F1 gains in hate detection on PrideMM and CrisisHateMM benchmarks.
-
Gemma: Open Models Based on Gemini Research and Technology
Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.
-
Yi: Open Foundation Models by 01.AI
Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.
-
Gemma 2: Improving Open Language Models at a Practical Size
Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.
-
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
GLM-4 models rival or exceed GPT-4 on MMLU, GSM8K, MATH, BBH, GPQA, HumanEval, IFEval, long-context tasks, and Chinese alignment while adding autonomous tool use for web, code, and image generation.
Reference graph
Works this paper leans on
-
[1]
Neural machine translation by jointly learning to align and translate, 2014
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate, 2014
work page 2014
-
[2]
One billion word benchmark for measuring progress in statistical language modeling
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, and Phillipp Koehn. One billion word benchmark for measuring progress in statistical language modeling. CoRR, abs/1312.3005, 2013. URL http://arxiv.org/abs/1312.3005
-
[3]
Generating wikipedia by summarizing long sequences
Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating wikipedia by summarizing long sequences. In Proceedings of the International Conference on Learning Representations, 2018
work page 2018
-
[4]
A time-restricted self-attention layer for ASR
Daniel Povey, Hossein Hadian, Pegah Ghahremani, Ke Li, and Sanjeev Khudanpur. A time-restricted self-attention layer for ASR . In Proceddings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018
work page 2018
-
[5]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017
work page 2017
-
[6]
Accelerating neural transformer via an average attention network, 2018
Biao Zhang, Deyi Xiong, and Jinsong Su. Accelerating neural transformer via an average attention network, 2018
work page 2018
-
[7]
P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =
work page 2000
-
[8]
T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980
work page 1980
-
[9]
M. J. Kearns , title =
-
[10]
Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983
work page 1983
-
[11]
R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000
work page 2000
-
[12]
Suppressed for Anonymity , author=
-
[13]
A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981
work page 1981
-
[14]
A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959
work page 1959
- [15]
- [16]
-
[17]
Proceedings of the International Conference on Learning Representations , year=
Generating Wikipedia by Summarizing Long Sequences , author=. Proceedings of the International Conference on Learning Representations , year=
-
[18]
Dzmitry Bahdanau and Kyunghyun Cho and Yoshua Bengio , Title =. 2014 , Eprint =
work page 2014
-
[19]
Ciprian Chelba and Tomas Mikolov and Mike Schuster and Qi Ge and Thorsten Brants and Phillipp Koehn , title =. CoRR , volume =. 2013 , url =
work page 2013
-
[20]
A time-restricted self-attention layer for
Povey, Daniel and Hadian, Hossein and Ghahremani, Pegah and Li, Ke and Khudanpur, Sanjeev , booktitle=. A time-restricted self-attention layer for. 2018 , organization=
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.