arxiv: 2305.13245 · v3 · submitted 2023-05-22 · 💻 cs.CL · cs.LG

Recognition: 2 theorem links

· Lean Theorem

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie , James Lee-Thorp , Michiel de Jong , Yury Zemlyanskiy , Federico Lebr\'on , Sumit Sanghai

Authors on Pith no claims yet

Pith reviewed 2026-05-11 06:48 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords grouped-query attentionmulti-query attentionuptrainingtransformerinference optimizationlanguage modelsattention mechanismsmodel adaptation

0 comments

The pith

Uptraining multi-head attention checkpoints to grouped-query attention recovers near-original quality with only 5% additional compute and achieves multi-query inference speeds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors show how to adapt existing multi-head transformer language models to use grouped-query attention without starting over. They introduce GQA as an intermediate form between full multi-head attention and multi-query attention, where groups of query heads share key-value heads. A short uptraining phase costing 5% of the original pre-training compute suffices to bring the quality back close to the original model. This yields models that run inference as fast as multi-query attention while keeping most of the accuracy of the slower multi-head versions. The approach lets practitioners reuse valuable checkpoints rather than training new models from scratch for faster serving.

Core claim

Existing multi-head attention language model checkpoints can be uptrained into grouped-query attention (GQA) models using only 5% of the original pre-training compute. GQA generalizes multi-query attention by using more than one but fewer than the full number of key-value heads, with multiple query heads grouped to share each key-value head. The uptrained GQA models achieve quality close to the original multi-head attention models while providing inference speeds comparable to multi-query attention.

What carries the argument

Grouped-query attention (GQA), in which query heads are partitioned into groups that share the same key and value heads, serving as the central mechanism to balance model capacity and inference efficiency during uptraining.

If this is right

Uptrained GQA models can be deployed for inference at speeds similar to MQA without retraining from scratch.
The 5% compute uptraining makes converting large models practical and cost-effective.
GQA allows choosing the number of key-value heads as a tunable trade-off parameter between quality and speed.
Practitioners can leverage existing multi-head checkpoints for faster models instead of training dedicated inference-optimized versions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar uptraining recipes might extend to other attention modifications or model families beyond the tested transformers.
The grouping in GQA could be made layer-specific to optimize quality-speed tradeoffs further.
This method reduces barriers to experimenting with faster attention variants on pre-trained models.

Load-bearing premise

The 5% compute uptraining recipe is enough to restore quality close to the original multi-head model without hidden failures on particular tasks or model sizes.

What would settle it

If an uptrained GQA model shows substantially lower performance than the original multi-head model on standard language modeling benchmarks or downstream tasks, or if inference speed gains are not realized in practice, the central claim would be falsified.

read the original abstract

Multi-query attention (MQA), which only uses a single key-value head, drastically speeds up decoder inference. However, MQA can lead to quality degradation, and moreover it may not be desirable to train a separate model just for faster inference. We (1) propose a recipe for uptraining existing multi-head language model checkpoints into models with MQA using 5% of original pre-training compute, and (2) introduce grouped-query attention (GQA), a generalization of multi-query attention which uses an intermediate (more than one, less than number of query heads) number of key-value heads. We show that uptrained GQA achieves quality close to multi-head attention with comparable speed to MQA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GQA gives a practical low-cost way to convert existing multi-head checkpoints into faster grouped-query models, but the speed claim needs tighter qualification against the quality target.

read the letter

The main thing to know is that this paper shows how to uptrain a multi-head attention checkpoint into grouped-query attention using roughly 5% of the original pretraining compute, landing at quality close to the starting model while running substantially faster at inference than full multi-head attention. They also formalize GQA as the natural intermediate between multi-head and single-KV multi-query attention by sharing a small number of key-value heads across groups of query heads. Both the architecture and the uptraining schedule are presented as new contributions in the abstract and methods. The experiments do a solid job of validating the recipe across model scales, with direct comparisons to both the original multi-head baseline and a pure multi-query version on language modeling perplexity and a handful of downstream tasks. The uptraining appears stable and the quality recovery is consistent enough to be useful in practice. The soft spot is the speed comparison. The abstract states that uptrained GQA achieves quality close to multi-head with speed comparable to multi-query attention. In the paper the configurations that close most of the quality gap use 4–8 KV heads rather than 1, which directly increases KV cache size and memory bandwidth cost. In the memory-bound long-context regime that matters for deployment, those models will not match the throughput of true single-head MQA. The paper would be stronger with explicit tokens-per-second measurements at fixed batch size and long context to show exactly where the operating points sit. This is aimed at teams that already have trained large models and want to optimize inference without starting from scratch. A practitioner or efficiency researcher will get immediate value from the conversion method and the empirical numbers. It deserves serious peer review because the core technique is reproducible from the description and the practical payoff is clear, even if the speed-quality curves could be plotted more precisely.

Referee Report

2 major / 1 minor

Summary. The paper introduces grouped-query attention (GQA) as an intermediate attention mechanism between multi-head attention (MHA) and multi-query attention (MQA), along with a recipe to uptrain existing MHA language model checkpoints into GQA (or MQA) models using only 5% of the original pre-training compute. The central empirical claim is that the resulting uptrained GQA models recover quality close to the original MHA while delivering inference speed comparable to MQA.

Significance. If the empirical claims hold, the work is significant for efficient deployment of large language models: it offers a low-cost way to convert high-quality MHA checkpoints into faster-inference variants without full retraining, and GQA provides a tunable point on the quality-speed tradeoff that was previously missing between MHA and single-head MQA.

major comments (2)

[Abstract] Abstract: The load-bearing claim that 'uptrained GQA achieves quality close to multi-head attention with comparable speed to MQA' is not supported by any quantitative speed or latency numbers, nor by the specific GQA configuration (number of KV heads) used to achieve the reported quality. Because KV-cache size and memory-bandwidth cost scale linearly with the number of KV heads, any GQA variant that closes most of the quality gap to MHA necessarily has a larger cache than single-head MQA and cannot be assumed to deliver comparable speed in the memory-bound regime without explicit measurements.
[Results] Results section: The manuscript must include tables or figures that jointly report quality metrics and inference throughput/latency for the exact GQA configurations (e.g., 4 or 8 KV heads) that are claimed to be 'close' to MHA quality, together with the corresponding MHA and MQA baselines. Without these paired measurements it is impossible to verify whether the speed-quality tradeoff asserted in the abstract is actually realized.

minor comments (1)

[Abstract] The abstract and introduction would benefit from an explicit statement of the number of KV heads used in the GQA experiments that support the main claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the need for clearer quantitative support of the speed-quality claims. We will revise the manuscript to address both points by adding specific details and paired measurements.

read point-by-point responses

Referee: [Abstract] Abstract: The load-bearing claim that 'uptrained GQA achieves quality close to multi-head attention with comparable speed to MQA' is not supported by any quantitative speed or latency numbers, nor by the specific GQA configuration (number of KV heads) used to achieve the reported quality. Because KV-cache size and memory-bandwidth cost scale linearly with the number of KV heads, any GQA variant that closes most of the quality gap to MHA necessarily has a larger cache than single-head MQA and cannot be assumed to deliver comparable speed in the memory-bound regime without explicit measurements.

Authors: We agree the abstract would benefit from greater specificity. The body of the paper specifies the GQA configurations (e.g., 8 KV heads for 32-query-head models) and reports quality recovery in the results tables. Inference speed is analyzed via KV-cache size reduction in the memory-bound regime. We will revise the abstract to name the KV-head count used for the quality claims, reference the speed analysis, and clarify that GQA delivers speeds between MHA and MQA (closer to MQA as the number of groups increases). revision: yes
Referee: [Results] Results section: The manuscript must include tables or figures that jointly report quality metrics and inference throughput/latency for the exact GQA configurations (e.g., 4 or 8 KV heads) that are claimed to be 'close' to MHA quality, together with the corresponding MHA and MQA baselines. Without these paired measurements it is impossible to verify whether the speed-quality tradeoff asserted in the abstract is actually realized.

Authors: We acknowledge the value of paired reporting. The current results present quality metrics for GQA variants with different KV-head counts alongside a separate analysis of inference cost based on KV-cache memory bandwidth. We will add a new table or figure in the revised results section that jointly shows quality metrics and relative inference throughput (estimated from KV-cache size, with measured values where available) for MHA, GQA-8, GQA-4, and MQA baselines. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical recipe validated by direct experiments

full rationale

The paper proposes an uptraining procedure to convert multi-head attention checkpoints into grouped-query attention models and reports empirical quality and speed measurements. No derivation chain, first-principles equations, or predictions are present that could reduce to the inputs by construction. All load-bearing claims rest on experimental comparisons (quality metrics and inference throughput) rather than self-definitional quantities, fitted parameters renamed as predictions, or self-citation chains. The central result is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; no equations or experimental details are provided to audit.

pith-pipeline@v0.9.0 · 5437 in / 1031 out tokens · 67821 ms · 2026-05-11T06:48:00.303359+00:00 · methodology

discussion (0)

Forward citations

Cited by 55 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
cs.DC 2026-05 unverdicted novelty 7.0

Power capping is illusory in LLM decode as memory-bound operation leaves power headroom untouched on 700 W GPUs, while SM clock locking saves up to 32% energy and three DVFS classes appear across attention types.
Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation
cs.DC 2026-05 unverdicted novelty 7.0

Dooly reduces LLM inference profiling costs by 56.4% via configuration-agnostic taint-based labeling and selective database reuse, delivering simulation accuracy within 5% MAPE for TTFT and 8% for TPOT across 12 models.
Co-generation of Layout and Shape from Text via Autoregressive 3D Diffusion
cs.CV 2026-04 unverdicted novelty 7.0

3D-ARD+ unifies autoregressive token prediction with diffusion-based 3D latent generation to co-produce indoor scene layouts and object geometries that follow complex text-specified spatial and semantic constraints.
Ragged Paged Attention: A High-Performance and Flexible LLM Inference Kernel for TPU
cs.PF 2026-04 unverdicted novelty 7.0

RPA kernel for TPUs achieves 86% MBU in decode and 73% MFU in prefill on Llama 3 8B via tiling for ragged memory, fused pipelines, and specialized compilation for prefill/decode workloads.
A Full-Stack Performance Evaluation Infrastructure for 3D-DRAM-based LLM Accelerators
cs.AR 2026-04 conditional novelty 7.0

ATLAS is the first silicon-validated simulation framework for 3D-DRAM LLM accelerators, achieving under 8.57% error and over 97% correlation with real hardware while supporting design exploration.
MixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM Midtraining
cs.LG 2026-04 unverdicted novelty 7.0

MixAtlas uses CLIP-based decomposition and Gaussian process optimization on small proxies to discover data mixtures that improve multimodal benchmark performance by up to 17.6% and transfer to larger models with faste...
Training Agents Inside of Scalable World Models
cs.AI 2025-09 conditional novelty 7.0

Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
cs.LG 2024-05 unverdicted novelty 7.0

Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
cs.CL 2024-05 unverdicted novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
cs.LG 2024-01 conditional novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
cs.CV 2023-10 unverdicted novelty 7.0

A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
Search Your Block Floating Point Scales!
cs.LG 2026-05 unverdicted novelty 6.0

ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models
cs.CL 2026-05 unverdicted novelty 6.0

MELT decouples reasoning depth from memory in looped LLMs by sharing a single gated KV cache per layer and using two-phase chunk-wise distillation from Ouro, delivering constant memory use while matching or beating st...
Reformulating KV Cache Eviction Problem for Long-Context LLM Inference
cs.CL 2026-05 unverdicted novelty 6.0

LaProx reformulates KV cache eviction as an output-aware matrix approximation, enabling a unified global token selection strategy that preserves LLM performance at 5% cache size across long-context benchmarks.
Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism
cs.DC 2026-05 unverdicted novelty 6.0

Nitsum dynamically adapts tensor parallelism and GPU splits in LLM serving to raise SLO-compliant goodput by up to 5.3 times over prior systems.
ZAYA1-8B Technical Report
cs.AI 2026-05 unverdicted novelty 6.0

ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization
cs.CV 2026-05 unverdicted novelty 6.0

WindowQuant performs window-adaptive mixed-precision KV cache quantization guided by similarity to the text prompt, with reordering to enable efficient inference in VLMs.
QERNEL: a Scalable Large Electron Model
cond-mat.str-el 2026-04 unverdicted novelty 6.0

QERNEL is a single conditioned neural wavefunction that variationally solves families of many-electron Hamiltonians in moiré heterobilayers and identifies the quantum liquid-crystal phase transition.
SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference
cs.NI 2026-04 unverdicted novelty 6.0

SparKV reduces time-to-first-token by 1.3x-5.1x and energy use by 1.5x-3.3x for on-device LLM inference by adaptively choosing between cloud KV streaming and local computation while overlapping execution and adjusting...
Are Large Language Models Economically Viable for Industry Deployment?
cs.CL 2026-04 unverdicted novelty 6.0

Small LLMs under 2B parameters achieve better economic break-even, energy efficiency, and hardware density than larger models on legacy GPUs for industrial tasks.
Graph-Guided Adaptive Channel Elimination for KV Cache Compression
eess.SP 2026-04 unverdicted novelty 6.0

GRACE reframes KV cache channel pruning as graph optimization to find a near-optimal subset, achieving 60% compression with negligible degradation and outperforming prior methods.
Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon
cs.LG 2026-04 unverdicted novelty 6.0

Fused compressed-domain int4 attention on Apple Silicon delivers 48x speedup and 3.2x KV cache compression for 128K-context 70B models while matching FP16 token predictions.
Nucleus-Image: Sparse MoE for Image Generation
cs.CV 2026-04 unverdicted novelty 6.0

A 17B-parameter sparse MoE diffusion transformer activates 2B parameters per pass and reaches competitive quality on image generation benchmarks without post-training.
Quantization Dominates Rank Reduction for KV-Cache Compression
cs.LG 2026-04 conditional novelty 6.0

Quantization of the KV cache beats rank reduction for matched storage budgets by 4-364 PPL, because dimension removal can flip attention token selection under softmax while bounded quantization noise usually preserves...
IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs
cs.LG 2026-04 unverdicted novelty 6.0

IceCache combines semantic token clustering with PagedAttention to keep only 25% of the KV cache tokens while retaining 99% accuracy on LongBench and matching or beating prior offloading methods in latency.
WaveTune: Wave-aware Bilinear Modeling for Efficient GPU Kernel Auto-tuning
cs.PF 2026-04 unverdicted novelty 6.0

WaveTune introduces a wave-aware bilinear latency predictor and wave-structured sparse sampling to enable fast runtime auto-tuning of GPU kernels, achieving up to 1.83x kernel speedup and 1.33x TTFT reduction with dra...
Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion
cs.CL 2026-04 conditional novelty 6.0

Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.
DeepStack: Scalable and Accurate Design Space Exploration for Distributed 3D-Stacked AI Accelerators
cs.AR 2026-04 conditional novelty 6.0

DeepStack introduces a fast performance model and hierarchical search method for co-optimizing 3D DRAM stacking, interconnects, and distributed scheduling in AI accelerators, delivering up to 9.5x throughput gains ove...
EchoKV: Efficient KV Cache Compression via Similarity-Based Reconstruction
cs.CL 2026-03 unverdicted novelty 6.0

EchoKV compresses LLM KV caches by reconstructing missing components from partial data via inter- and intra-layer attention similarities, outperforming prior methods on LongBench and RULER while supporting on-demand f...
MAGI-1: Autoregressive Video Generation at Scale
cs.CV 2025-05 unverdicted novelty 6.0

MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
cs.CL 2025-05 conditional novelty 6.0

Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.
Emu3: Next-Token Prediction is All You Need
cs.CV 2024-09 unverdicted novelty 6.0

Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
cs.CL 2024-02 conditional novelty 6.0

KIVI applies asymmetric 2-bit quantization to KV cache with per-channel keys and per-token values, reducing memory 2.6x and boosting throughput up to 3.47x with near-identical quality on Llama, Falcon, and Mistral.
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
cs.LG 2023-07 accept novelty 6.0

FlashAttention-2 achieves roughly 2x speedup over FlashAttention by parallelizing attention across thread blocks and distributing work within blocks, reaching 50-73% of theoretical peak FLOPs/s on A100 GPUs.
Kaczmarz Linear Attention
cs.LG 2026-05 unverdicted novelty 5.0

Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...
Make Your LVLM KV Cache More Lightweight
cs.CV 2026-05 unverdicted novelty 5.0

LightKV compresses vision-token KV cache in LVLMs to 55% size via prompt-guided cross-modality aggregation, halving cache memory, cutting compute 40%, and maintaining performance on benchmarks.
ClusterFusion++: Expanding Cluster-Level Fusion to Full Transformer-Block Decoding
cs.DC 2026-04 unverdicted novelty 5.0

ClusterFusion++ fuses the entire Transformer block (LayerNorm to residual) via CUDA extensions and achieves 1.34x throughput on Pythia-2.8B with near-identical output fidelity.
When Does Removing LayerNorm Help? Activation Bounding as a Regime-Dependent Implicit Regularizer
cs.LG 2026-04 unverdicted novelty 5.0

DyT improves validation loss 27% at 64M params/1M tokens but worsens it 19% at 118M tokens, with saturation levels predicting the sign of the effect.
Sapiens2
cs.CV 2026-04 unverdicted novelty 5.0

Sapiens2 improves pretraining, data scale, and architecture over its predecessor to set new state-of-the-art results on human pose estimation, body-part segmentation, normal estimation, and new tasks like pointmap and...
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention
cs.DC 2026-04 unverdicted novelty 5.0

HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, pl...
VFA: Relieving Vector Operations in Flash Attention with Global Maximum Pre-computation
cs.LG 2026-04 unverdicted novelty 5.0

VFA optimizes Flash Attention by pre-computing global max approximations from key blocks and reordering traversal to reduce vector bottlenecks while preserving exact computation.
Modality-Aware Zero-Shot Pruning and Sparse Attention for Efficient Multimodal Edge Inference
cs.LG 2026-04 unverdicted novelty 5.0

SentryFuse delivers modality-aware zero-shot pruning and sparse attention that improves accuracy by 12.7% on average and up to 18% under sensor dropout while cutting memory 28.2% and latency up to 1.63x across multimo...
MiMo-V2-Flash Technical Report
cs.CL 2026-01 unverdicted novelty 5.0

MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurpos...
Emerging Properties in Unified Multimodal Pretraining
cs.CV 2025-05 unverdicted novelty 5.0

BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
cs.CL 2025-03 unverdicted novelty 5.0

Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
cs.CL 2025-02 unverdicted novelty 5.0

SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
Fully Homomorphic Encryption on Llama 3 model for privacy preserving LLM inference
cs.CR 2026-04 unverdicted novelty 4.0

A modified Llama 3 model using fully homomorphic encryption achieves up to 98% text generation accuracy and 80 tokens per second at 237 ms latency on an i9 CPU.
Ministral 3
cs.CL 2026-01 unverdicted novelty 4.0

Ministral 3 releases 3B/8B/14B parameter-efficient language models with base, instruction, and reasoning variants derived via iterative pruning and distillation, including image understanding capabilities.
Yi: Open Foundation Models by 01.AI
cs.CL 2024-03 unverdicted novelty 4.0

Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
cs.CL 2024-01 unverdicted novelty 4.0

DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
Comparative Characterization of KV Cache Management Strategies for LLM Inference
cs.AR 2026-04 unverdicted novelty 3.0

Benchmarks of vLLM, InfiniGen, and H2O identify conditions under which each KV cache strategy delivers the best trade-off between memory consumption and inference performance.
Gemma 2: Improving Open Language Models at a Practical Size
cs.CL 2024-07 conditional novelty 3.0

Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.
A Survey on Efficient Inference for Large Language Models
cs.CL 2024-04 accept novelty 3.0

The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
EXAONE 4.5 Technical Report
cs.CL 2026-04 unverdicted novelty 2.0

EXAONE 4.5 is a new open-weight multimodal model that matches general benchmarks and outperforms similar-scale models on document understanding and Korean contextual reasoning.
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
cs.CV 2024-02 unverdicted novelty 2.0

The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 55 Pith papers · 10 internal anchors

[1]

James Bradbury and Roy Frostig and Peter Hawkins and Matthew James Johnson and Chris Leary and Dougal Maclaurin and George Necula and Adam Paszke and Jake Vander

work page
[2]

Jonathan Heek and Anselm Levskaya and Avital Oliver and Marvin Ritter and Bertrand Rondepierre and Andreas Steiner and Marc van

work page
[3]

Roberts, Adam and Chung, Hyung Won and Levskaya, Anselm and Mishra, Gaurav and Bradbury, James and Andor, Daniel and Narang, Sharan and Lester, Brian and Gaffney, Colin and Mohiuddin, Afroz and Hawthorne, Curtis and Lewkowycz, Aitor and Salcianu, Alex and van Zee, Marc and Austin, Jacob and Goodman, Sebastian and Soares, Livio Baldini and Hu, Haitang and ...

work page arXiv
[4]

Kingma and Jimmy Ba , editor =

Diederik P. Kingma and Jimmy Ba , editor =. Adam:. 3rd International Conference on Learning Representations,. 2015 , url =

work page 2015
[5]

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , booktitle =

Noam Shazeer and Mitchell Stern , editor =. Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , booktitle =. 2018 , url =

work page 2018
[8]

and Zettlemoyer, Luke , title =

Joshi, Mandar and Choi, Eunsol and Weld, Daniel S. and Zettlemoyer, Luke , title =. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics , month =. 2017 , address =

work page 2017
[11]

Bowman , title =

Alex Wang and Amanpreet Singh and Julian Michael and Felix Hill and Omer Levy and Samuel R. Bowman , title =. 7th International Conference on Learning Representations,. 2019 , url =

work page 2019
[19]

Scaling Laws for Neural Language Models

Jared Kaplan and Sam McCandlish and Tom Henighan and Tom B. Brown and Benjamin Chess and Rewon Child and Scott Gray and Alec Radford and Jeffrey Wu and Dario Amodei , title =. CoRR , volume =. 2020 , url =. 2001.08361 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2020
[21]

Liu , title =

Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. J. Mach. Learn. Res. , volume =. 2020 , url =

work page 2020
[24]

GLM-130B: An Open Bilingual Pre-trained Model

Aohan Zeng and Xiao Liu and Zhengxiao Du and Zihan Wang and Hanyu Lai and Ming Ding and Zhuoyi Yang and Yifan Xu and Wendi Zheng and Xiao Xia and Weng Lam Tam and Zixuan Ma and Yufei Xue and Jidong Zhai and Wenguang Chen and Peng Zhang and Yuxiao Dong and Jie Tang , title =. CoRR , volume =. 2022 , url =. doi:10.48550/arXiv.2210.02414 , eprinttype =. 2210...

work page internal anchor Pith review doi:10.48550/arxiv.2210.02414 2022
[29]

Mahoney and Amir Gholami and Kurt Keutzer , title =

Sehoon Kim and Karttikeya Mangalam and Jitendra Malik and Michael W. Mahoney and Amir Gholami and Kurt Keutzer , title =. CoRR , volume =. 2023 , url =. doi:10.48550/arXiv.2302.07863 , eprinttype =. 2302.07863 , timestamp =

work page doi:10.48550/arxiv.2302.07863 2023
[33]

Memory-efficient attention , howpublished =

Markus Rabe , year =. Memory-efficient attention , howpublished =

work page
[34]

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander P las, Skye Wanderman- M ilne, and Qiao Zhang. 2018. http://github.com/google/jax JAX : composable transformations of P ython+ N um P y programs

work page 2018
[35]

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean - Baptiste Lespiau, Laurent Sifre, and John Jumper. 2023. https://doi.org/10.48550/arXiv.2302.01318 Accelerating large language model decoding with speculative sampling . CoRR, abs/2302.01318

work page internal anchor Pith review doi:10.48550/arxiv.2302.01318 2023
[36]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.02311 2022
[37]

Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Goharian. 2018. https://doi.org/10.18653/v1/N18-2097 A discourse-aware attention model for abstractive summarization of long documents . In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human...

work page doi:10.18653/v1/n18-2097 2018
[38]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher R \' e . 2022. https://doi.org/10.48550/arXiv.2205.14135 Flashattention: Fast and memory-efficient exact attention with io-awareness . CoRR, abs/2205.14135

work page internal anchor Pith review doi:10.48550/arxiv.2205.14135 2022
[39]

Michiel de Jong, Yury Zemlyanskiy, Joshua Ainslie, Nicholas FitzGerald, Sumit Sanghai, Fei Sha, and William Cohen. 2022. https://arxiv.org/abs/2212.08153 Fi DO : Fusion-in-decoder optimized for stronger performance and faster inference . arXiv preprint arXiv:2212.08153

work page arXiv 2022
[40]

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 2022. https://doi.org/10.48550/arXiv.2208.07339 Llm.int8(): 8-bit matrix multiplication for transformers at scale . CoRR, abs/2208.07339

work page internal anchor Pith review doi:10.48550/arxiv.2208.07339 2022
[41]

Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model , url =

Alexander R. Fabbri, Irene Li, Tianwei She, Suyi Li, and Dragomir R. Radev. 2019. https://doi.org/10.18653/v1/p19-1102 Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model . In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019...

work page doi:10.18653/v1/p19-1102 2019
[42]

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. https://doi.org/10.48550/arXiv.2210.17323 GPTQ: accurate post-training quantization for generative pre-trained transformers . CoRR, abs/2210.17323

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2210.17323 2022
[43]

Google. 2020. P rofile your model with cloud tpu tools. https://cloud.google.com/tpu/docs/cloud-tpu-tools. Accessed: 2022-11-11

work page 2020
[44]

Maybank, and Dacheng Tao

Jianping Gou, Baosheng Yu, Stephen J. Maybank, and Dacheng Tao. 2021. https://doi.org/10.1007/s11263-021-01453-z Knowledge distillation: A survey . Int. J. Comput. Vis., 129(6):1789--1819

work page doi:10.1007/s11263-021-01453-z 2021
[45]

Jonathan Heek, Anselm Levskaya, Avital Oliver, Marvin Ritter, Bertrand Rondepierre, Andreas Steiner, and Marc van Z ee. 2020. http://github.com/google/flax F lax: A neural network library and ecosystem for JAX

work page 2020
[46]

Distilling the Knowledge in a Neural Network

Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. http://arxiv.org/abs/1503.02531 Distilling the knowledge in a neural network . CoRR, abs/1503.02531

work page internal anchor Pith review Pith/arXiv arXiv 2015
[47]

Weld, and Luke Zettlemoyer

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada. Association for Computational Linguistics

work page 2017
[48]

Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby. 2022. https://doi.org/10.48550/ARXIV.2212.05055 Sparse upcycling: Training mixture-of-experts from dense checkpoints

work page doi:10.48550/arxiv.2212.05055 2022
[49]

Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2022. https://doi.org/10.48550/arXiv.2211.17192 Fast inference from transformers via speculative decoding . CoRR, abs/2211.17192

work page doi:10.48550/arxiv.2211.17192 2022
[50]

Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Yan Wang, Liujuan Cao, Yongjian Wu, Feiyue Huang, and Rongrong Ji. 2022. https://doi.org/10.1109/TIP.2021.3139234 Towards lightweight transformer via group-wise transformation for vision-and-language tasks . IEEE Trans. Image Process. , 31:3386--3398

work page doi:10.1109/tip.2021.3139234 2022
[51]

Ramesh Nallapati, Bowen Zhou, C \' cero Nogueira dos Santos, C aglar G \" u l c ehre, and Bing Xiang. 2016. https://doi.org/10.18653/v1/k16-1028 Abstractive text summarization using sequence-to-sequence rnns and beyond . In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, CoNLL 2016, Berlin, Germany, August 11-12, 2016...

work page doi:10.18653/v1/k16-1028 2016
[52]

Jinjie Ni, Rui Mao, Zonglin Yang, Han Lei, and Erik Cambria. 2023. https://doi.org/10.18653/V1/2023.ACL-LONG.812 Finding the pillars of strength for multi-head attention . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023 , pages 14526--14540. Asso...

work page doi:10.18653/v1/2023.acl-long.812 2023
[53]

Sungrae Park, Geewook Kim, Junyeop Lee, Junbum Cha, Ji - Hoon Kim, and Hwalsuk Lee. 2020. https://doi.org/10.18653/V1/2020.COLING-MAIN.607 Scale down transformer by grouping features for a lightweight character-level language model . In Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), D...

work page doi:10.18653/v1/2020.coling-main.607 2020
[54]

Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. 2022. Efficiently scaling transformer inference. arXiv preprint arXiv:2211.05102

work page arXiv 2022
[55]

Markus Rabe. 2023. Memory-efficient attention. https://github.com/google/flaxformer/blob/main/flaxformer/components/attention/memory_efficient_attention.py. Accessed: 2023-05-23

work page 2023
[56]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. http://jmlr.org/papers/v21/20-074.html Exploring the limits of transfer learning with a unified text-to-text transformer . J. Mach. Learn. Res., 21:140:1--140:67

work page 2020
[57]

Noam Shazeer. 2019. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150

work page internal anchor Pith review Pith/arXiv arXiv 2019
[58]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. https://doi.org/10.48550/ARXIV.2302.13971 Llama: Open and efficient foundation language models

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.13971 2023
[59]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. https://openreview.net/forum?id=rJ4km2R5t7 GLUE: A multi-task benchmark and analysis platform for natural language understanding . In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019 . OpenReview.net

work page 2019
[60]

Roofline: An insightful visual performance model for multicore architectures,

Samuel Williams, Andrew Waterman, and David A. Patterson. 2009. https://doi.org/10.1145/1498765.1498785 Roofline: an insightful visual performance model for multicore architectures . Commun. ACM , 52(4):65--76

work page doi:10.1145/1498765.1498785 2009
[61]

Chenguang Zhu, Yang Liu, Jie Mei, and Michael Zeng. 2021. https://doi.org/10.18653/v1/2021.naacl-main.474 Mediasum: A large-scale media interview dataset for dialogue summarization . In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, Jun...

work page doi:10.18653/v1/2021.naacl-main.474 2021