Training Compute-Optimal Large Language Models

Aidan Clark; Arthur Mensch; Aurelia Guy; Bogdan Damoc; Diego de las Casas; Elena Buchatskaya; Eliza Rutherford; Erich Elsen; Eric Noland; George van den Driessche

arxiv: 2203.15556 · v1 · submitted 2022-03-29 · 💻 cs.CL · cs.LG

Training Compute-Optimal Large Language Models

Jordan Hoffmann , Sebastian Borgeaud , Arthur Mensch , Elena Buchatskaya , Trevor Cai , Eliza Rutherford , Diego de las Casas , Lisa Anne Hendricks

show 14 more authors

Johannes Welbl Aidan Clark Tom Hennigan Eric Noland Katie Millican George van den Driessche Bogdan Damoc Aurelia Guy Simon Osindero Karen Simonyan Erich Elsen Jack W. Rae Oriol Vinyals Laurent Sifre

This is my paper

Pith reviewed 2026-05-10 15:55 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords compute optimal trainingscaling lawslarge language modelstransformerChinchillamodel sizetraining tokens

0 comments

The pith

For compute-optimal LLM training, scale model size and training tokens equally.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the best way to split a fixed compute budget between the number of model parameters and the amount of training data for transformer language models. Through experiments with over 400 models of varying sizes and data volumes, it concludes that current large models are undertrained due to keeping data fixed while increasing parameters. The central result is that optimal performance comes from scaling model size and tokens in tandem, doubling both when compute doubles. This approach yields models that outperform much larger ones on standard benchmarks while using fewer resources for inference.

Core claim

By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled. We test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4× more data. Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks.

What carries the argument

The scaling relation where optimal model size N and data D are proportional under fixed compute budget, derived from fitting loss as a function of N and D.

If this is right

Chinchilla achieves higher accuracy on benchmarks like MMLU with 70B parameters than larger models using the same compute.
Smaller optimal models reduce the compute needed for fine-tuning and inference.
Future training runs should increase data proportionally to model size rather than fixing data size.
Undertrained models can be improved by adding more data instead of just more parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Data collection efforts will need to grow in tandem with model scaling to maintain optimal performance.
Similar optimal scaling ratios may apply to other domains like vision or multimodal models.
This challenges the prior trend of ever-larger models trained on fixed amounts of data.

Load-bearing premise

The parametric form of the scaling law fitted to models up to 16B parameters and 500B tokens holds for larger scales.

What would settle it

Training a model at a larger compute budget using the equal-scaling prediction and finding that its loss or downstream performance is worse than a model using a different N-to-D ratio.

read the original abstract

We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant. By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled. We test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4$\times$ more more data. Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks. This also means that Chinchilla uses substantially less compute for fine-tuning and inference, greatly facilitating downstream usage. As a highlight, Chinchilla reaches a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, greater than a 7% improvement over Gopher.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that LLMs have been undertrained on data and that equal scaling of parameters and tokens is compute-optimal, backed by a large sweep and a direct validation run.

read the letter

The main takeaway is that current large models like Gopher are undertrained, and that for a fixed compute budget you get better results by scaling model size and training tokens together rather than mostly growing the model. They trained over 400 models from 70M to 16B parameters on 5B to 500B tokens, fitted a scaling law, and then trained Chinchilla at 70B parameters with roughly 1.4T tokens under the same compute as Gopher. Chinchilla beats Gopher, GPT-3, Jurassic-1, and Megatron-Turing NLG on a wide range of downstream tasks, including a clear gain on MMLU. That out-of-sample validation run is the strongest part of the work and makes the equal-scaling claim credible rather than just a curve fit. The study is useful because it gives a concrete, actionable rule for how to allocate compute when training transformers at this scale. The soft spot is the extrapolation: the law is fitted on models up to 16B and 500B tokens, so moving to 70B and 1.4T tokens assumes the same functional form continues to hold. The successful Chinchilla run reduces the risk, but it does not eliminate uncertainty about whether a different loss curve would shift the optimum. This is the kind of paper that changes how labs set training budgets, so it deserves a serious referee even if some details of the functional form need tightening in revision.

Referee Report

1 major / 2 minor

Summary. The paper investigates the optimal allocation of compute between model size (N) and training tokens (D) for transformer language models. By training over 400 models spanning 70M to 16B parameters and 5B to 500B tokens, the authors fit a parametric scaling law for validation loss and conclude that compute-optimal training requires scaling N and D equally. They validate the prediction by training Chinchilla (70B parameters, ~1.4T tokens) under the same compute budget as Gopher (280B parameters, 300B tokens); Chinchilla outperforms Gopher, GPT-3, Jurassic-1, and Megatron-Turing NLG on a wide range of downstream tasks, including achieving 67.5% average accuracy on MMLU.

Significance. If the derived scaling laws and equal N/D scaling hold, the work is highly significant: it provides empirical evidence that many recent large models are undertrained and demonstrates a practical method for more efficient compute allocation that reduces inference and fine-tuning costs. The strength lies in the scale of the experimental sweep (>400 models) combined with direct out-of-sample validation via the Chinchilla training run, which moves the claim beyond pure curve-fitting.

major comments (1)

[Scaling Laws section (around the derivation of optimal N and D)] The scaling-law fit is performed on models up to 16B parameters; the Chinchilla prediction extrapolates both in N (to 70B) and in D (to 1.4T tokens). While the successful Chinchilla run provides supporting evidence, the manuscript should quantify the uncertainty in the predicted optimum arising from variance in the fitted coefficients (A, B, α, β) and discuss whether alternative functional forms for L(N,D) would materially change the equal-scaling conclusion.

minor comments (2)

[Abstract] Abstract contains a typographical error: '4× more more data' should be '4× more data'.
[Figures 2–5 and associated text] Several figures (e.g., loss-vs-compute curves and downstream-task comparisons) would benefit from explicit error bars or shaded uncertainty regions to convey variability across the 400-model sweep.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive evaluation of the work and for the constructive suggestion regarding uncertainty quantification. We address the major comment below and will incorporate the requested analysis in the revised manuscript.

read point-by-point responses

Referee: [Scaling Laws section (around the derivation of optimal N and D)] The scaling-law fit is performed on models up to 16B parameters; the Chinchilla prediction extrapolates both in N (to 70B) and in D (to 1.4T tokens). While the successful Chinchilla run provides supporting evidence, the manuscript should quantify the uncertainty in the predicted optimum arising from variance in the fitted coefficients (A, B, α, β) and discuss whether alternative functional forms for L(N,D) would materially change the equal-scaling conclusion.

Authors: We agree that an explicit quantification of uncertainty in the extrapolated optimum would strengthen the presentation. In the revised manuscript we will add a short subsection to the Scaling Laws section that reports bootstrap confidence intervals on the fitted coefficients A, B, α, and β (obtained by resampling the >400 training runs with replacement). These intervals will be propagated through the closed-form expression for the optimal N*(C) and D*(C) to give a range of plausible optima at the compute budget used for Chinchilla. Regarding alternative functional forms, we will include a brief discussion showing that the equal-scaling conclusion is robust: the optimum arises from balancing the two power-law terms, so modest changes to the exponents or the use of a multiplicative interaction term leave the scaling exponents for N and D with respect to compute essentially unchanged (both remain close to 0.5). The successful Chinchilla training run, which lies well outside the fitted regime, already provides direct empirical support for the predicted allocation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; out-of-sample validation confirms scaling prediction

full rationale

The derivation fits a parametric loss function L(N, D) to empirical results from over 400 models (70M–16B parameters, 5B–500B tokens), derives the compute-optimal relation by minimizing under the constraint C ≈ 6ND, and directly tests the resulting prediction by training Chinchilla (70B parameters, ~1.4T tokens) under Gopher's compute budget. Chinchilla's superior downstream performance constitutes independent falsification outside the fitting set. No step reduces to self-definition, fitted-input renaming, or load-bearing self-citation; the functional form and optimum are externally validated rather than tautological.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The optimal allocation is obtained by fitting a parametric loss function L(N,D) to the 400-model results and minimizing under a compute constraint; the functional form and fitted coefficients are the main unverified inputs.

free parameters (1)

Scaling law coefficients
Parameters (A, B, alpha, beta, E) in the assumed loss scaling form L(N, D) = E + A/N^alpha + B/D^beta fitted to the empirical results of the 400 models.

axioms (1)

domain assumption Loss follows a power-law dependence on model size N and data D
The functional form used to fit data and derive the equal-scaling optimum.

pith-pipeline@v0.9.0 · 5617 in / 1398 out tokens · 67828 ms · 2026-05-10T15:55:30.872186+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

An Open-Source Training Dataset for Foundation Models for Black-box Optimization
cs.LG 2026-05 unverdicted novelty 8.0

BBO-Pile is the first large-scale open dataset of real optimization trajectories used to train and scale foundation models that imitate black-box optimization methods.
The Economics of Model Collapse: Equilibrium, Welfare, and Optimal Provenance Subsidies in Synthetic Data Markets
econ.GN 2026-05 unverdicted novelty 8.0

Introduces the Synthetic Data Contamination Equilibrium and derives closed-form optimal provenance subsidies s* = KL(q||p)/(2 kappa) plus watermark strengths to mitigate model collapse, validated by OLS matching struc...
The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry
cs.LG 2026-04 unverdicted novelty 8.0

Transformer weight spectra exhibit transient compression waves that propagate layer-wise, persistent non-monotonic depth gradients in power-law exponents, and Q/K-V asymmetry, with the spectral exponent alpha predicti...
Large Language Diffusion Models
cs.CL 2025-02 unverdicted novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Bridging Language and Items for Retrieval and Recommendation: Benchmarking LLMs as Semantic Encoders
cs.IR 2024-03 unverdicted novelty 8.0

BLaIR is a new benchmark and 570M-review dataset showing that LLM performance rankings on recommendation tasks have little correlation with rankings on general embedding benchmarks like MTEB.
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
cs.CL 2023-09 unverdicted novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
cs.CL 2023-05 conditional novelty 8.0

Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
cs.CL 2023-04 accept novelty 8.0

Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
Editing Models with Task Arithmetic
cs.LG 2022-12 accept novelty 8.0

Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.
Teaching Models to Express Their Uncertainty in Words
cs.CL 2022-05 unverdicted novelty 8.0

GPT-3 can learn to express well-calibrated uncertainty about its answers using natural language phrases rather than logits.
LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models
cs.LG 2026-05 unverdicted novelty 7.0

LLMForge is a NAS framework with Infinite-Head Attention, a Forge-Former surrogate, and Forge-DSE engine that discovers hardware-specific architectures for edge language models, yielding variants with improved accurac...
Scale-Dependent Collective Adaptation in Self-Amending LLM Societies: A Cross-Family Study of Emergent Governance
nlin.AO 2026-05 unverdicted novelty 7.0

LLM societies in Nomic show non-monotonic collective adaptation peaking at mid-scales, with smaller models rule-inert and larger ones restrictive.
Do Language Models Align with Brains? Prediction Scores Are Not Enough
q-bio.NC 2026-05 unverdicted novelty 7.0

Language model representations fail all L-PACT alignment gates once controls explain the apparent predictive and relational effects.
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
cs.RO 2026-05 unverdicted novelty 7.0

MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.
Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining
cs.CL 2026-05 unverdicted novelty 7.0

Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.
Sharp feature-learning transitions and Bayes-optimal neural scaling laws in extensive-width networks
stat.ML 2026-05 unverdicted novelty 7.0

In extensive-width networks, features are recovered sequentially through sharp phase transitions, yielding an effective width k_c that unifies Bayes-optimal generalization error scaling as Θ(k_c d / n).
Active Testing of Large Language Models via Approximate Neyman Allocation
cs.AI 2026-05 unverdicted novelty 7.0

Proposes surrogate semantic entropy stratification followed by approximate Neyman allocation for active testing of LLMs on generative benchmarks, reporting up to 28% MSE reduction and 22.9% average budget savings vers...
How Much is Brain Data Worth for Machine Learning?
cs.AI 2026-05 conditional novelty 7.0

Brain data is worth a variable number of task samples depending on task-brain alignment, noise levels, and latent dimension, with conditions under which it also improves robustness to test distribution shift.
DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
cs.LG 2026-05 unverdicted novelty 7.0

DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.
Spectral Dynamics in Deep Networks: Feature Learning, Outlier Escape, and Learning Rate Transfer
cond-mat.dis-nn 2026-05 unverdicted novelty 7.0

A two-level DMFT predicts width-consistent outlier escape and hyperparameter transfer under μP in deep networks, with bulk restructuring dominating for tasks with many outputs.
On the Invariance and Generality of Neural Scaling Laws
cs.LG 2026-05 unverdicted novelty 7.0

Neural scaling laws are invariant under bijective data transformations and change predictably with information resolution ρ under non-bijective transformations, enabling cross-domain transport of fitted exponents.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.
Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination
cs.AI 2026-05 unverdicted novelty 7.0

Attractor basins in transformer hidden states unify conflict and hallucination as basin competition or absence, with geometric margin outperforming entropy for detection and a scaling law governing confident hallucina...
Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination
cs.AI 2026-05 unverdicted novelty 7.0

Transformer hidden states encode facts as attractor basins; hallucinations occur from basin absence and conflicts from basin competition, detected cleanly by geometric margin rather than entropy.
The Predictive-Causal Gap: An Impossibility Theorem and Large-Scale Neural Evidence
cs.LG 2026-05 unverdicted novelty 7.0

Predictive representation learning structurally favors encoding slower or less noisy environment modes over causal system modes, as shown by an impossibility theorem for linear-Gaussian dynamics and large-scale neural...
Tempus: A Temporally Scalable Resource-Invariant GEMM Streaming Framework for Versal AI Edge
cs.DC 2026-05 unverdicted novelty 7.0

Tempus delivers 607 GOPS at 10.677 W using fixed 16 AIE cores on Versal AI Edge, with 211.2x better platform-aware utility than spatial SOTA ARIES and zero URAM/DSP utilization.
CellxPert: Inference-Time MCMC Steering of a Multi-Omics Single-Cell Foundation Model for In-Silico Perturbation
q-bio.GN 2026-04 unverdicted novelty 7.0

CellxPert uses inference-time MCMC steering on a multi-omics single-cell foundation model to predict genome-wide transcriptomic responses to gene perturbations and outperforms baselines on cell-type annotation, pertur...
The Cost of Consensus: Isolated Self-Correction Prevails Over Unguided Homogeneous Multi-Agent Debate
cs.MA 2026-04 unverdicted novelty 7.0

Homogeneous multi-agent debate introduces sycophantic conformity, contextual fragility, and consensus collapse, leading to equal or lower accuracy than isolated self-correction at 2.1-3.4x higher token cost on GSM-Har...
LoopCTR: Unlocking the Loop Scaling Power for Click-Through Rate Prediction
cs.IR 2026-04 unverdicted novelty 7.0

LoopCTR trains CTR models with recursive layer reuse and process supervision so that zero-loop inference outperforms baselines on public and industrial datasets.
Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms
cs.CL 2026-04 unverdicted novelty 7.0

Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.
Neural Garbage Collection: Learning to Forget while Learning to Reason
cs.LG 2026-04 conditional novelty 7.0

Language models learn to evict KV cache entries end-to-end via reinforcement learning from outcome reward alone, achieving 2-3x cache compression while maintaining accuracy on Countdown, AMC, and AIME tasks.
Causal inference for social network formation
econ.EM 2026-04 conditional novelty 7.0

Random team assignments in a professional firm reveal that indirect ties strongly increase new direct tie formation, while effects of degree and local density are smaller and less robust.
Rectification Difficulty and Optimal Sample Allocation in LLM-Augmented Surveys
cs.AI 2026-04 unverdicted novelty 7.0

A method using predicted rectification difficulty for optimal human sample allocation in LLM-augmented surveys captures 61-79% of theoretical efficiency gains and reduces MSE by 11% on two datasets without pilot data.
How Do LLMs and VLMs Understand Viewpoint Rotation Without Vision? An Interpretability Study
cs.AI 2026-04 unverdicted novelty 7.0

LLMs and VLMs encode viewpoint information in hidden states but fail to bind it to corresponding observations, resulting in hallucinations in final layers on text-only viewpoint rotation tasks.
STORM: End-to-End Referring Multi-Object Tracking in Videos
cs.CV 2026-04 unverdicted novelty 7.0

STORM is an end-to-end MLLM for referring multi-object tracking that uses task-composition learning to leverage sub-task data and introduces the STORM-Bench dataset, achieving SOTA results.
Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Supervised fine-tuning of LLMs often fails to fully internalize all training instances due to five recurring causes including missing prerequisites and data conflicts, as diagnosed via a new framework across multiple models.
The Shrinking Lifespan of LLMs in Science
cs.DL 2026-04 unverdicted novelty 7.0

LLM adoption in science follows a compressing inverted-U trajectory where release year predicts time-to-peak and lifespan better than model attributes.
LASER: A Data-Centric Method for Low-Cost and Efficient SQL Rewriting based on SQL-GRPO
cs.DB 2026-04 unverdicted novelty 7.0

LASER generates complex slow-query training data with MCTS and aligns small models via SQL-GRPO to deliver efficient, low-cost SQL rewriting that outperforms rules and large models.
Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation
cs.SE 2026-04 unverdicted novelty 7.0

Chain-of-Thought prompting balances high accuracy with low energy use in small language models for code generation, while multi-sampling strategies add high energy costs for small accuracy gains.
GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving
cs.DC 2026-03 unverdicted novelty 7.0

GhostServe applies erasure coding to KV cache in host memory for fast recovery from failures in LLM serving, cutting checkpointing latency up to 2.7x and recovery latency 2.1x versus prior methods.
More than the Sum: Panorama-Language Models for Adverse Omni-Scenes
cs.CV 2026-03 unverdicted novelty 7.0

Panorama-Language Models with a sparse attention module and PanoVQA dataset deliver superior holistic reasoning on 360° adverse omni-scenes compared to stitched pinhole views.
Latent Generative Solvers for Generalizable Long-Term Physics Simulation
cs.AI 2026-02 unverdicted novelty 7.0

LGS pretrained on 2.5M trajectories across 16 systems matches deterministic baselines at one step and halves 20-step error while using far less compute and adapting to held-out higher-resolution flows.
OmniMol: Transferring Particle Physics Knowledge to Molecular Dynamics with Point-Edge Transformers
physics.chem-ph 2026-01 unverdicted novelty 7.0

OmniMol transfers a billion-jet pre-trained PET foundation model from HEP to molecular dynamics via an interaction-matrix attention bias, delivering strong performance on the oMol dataset with minimal fine-tuning and ...
Scaling Latent Reasoning via Looped Language Models
cs.CL 2025-10 unverdicted novelty 7.0

Looped language models with latent iterative computation and entropy-regularized depth allocation achieve performance matching up to 12B standard LLMs through superior knowledge manipulation.
The Art of Scaling Reinforcement Learning Compute for LLMs
cs.LG 2025-10 unverdicted novelty 7.0

A 400k+ GPU-hour study shows RL scaling in LLMs follows predictable sigmoidal trajectories, with most design choices affecting efficiency rather than the performance asymptote, enabling accurate large-scale prediction...
Scaling Vision Transformers for Functional MRI with Flat Maps
cs.CV 2025-10 conditional novelty 7.0

CortexMAE adapts Vision Transformers to fMRI via cortical flat maps, shows power-law scaling on 2.1K hours of data, and outperforms priors on cognitive state decoding while failing to beat a simple functional connecti...
Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention
cs.LG 2025-10 unverdicted novelty 7.0

Low-precision Flash Attention fails due to similar low-rank attention representations combined with biased rounding errors that accumulate and corrupt weight updates; a minimal fix to reduce rounding bias stabilizes training.
On the Convergence of Muon and Beyond
cs.LG 2025-09 unverdicted novelty 7.0

Muon-MVR2 attains the optimal anytime convergence rate of ~O(T^{-1/3}) in stochastic non-convex settings under horizon-free schedules.
IAFormer: Interaction-Aware Transformer network for collider data analysis
hep-ph 2025-05 unverdicted novelty 7.0

IAFormer uses boost-invariant pairwise quantities and differential attention to create a sparse Transformer that achieves state-of-the-art classification on top-quark and quark-gluon jet datasets while using over an o...
Exact Sequence Interpolation with Transformers
cs.LG 2025-02 conditional novelty 7.0

Transformers with O(sum m^j) blocks and O(d sum m^j) parameters can exactly interpolate any finite dataset of input sequences in R^d to output sequences of lengths m^j.
Functional-level Uncertainty Quantification for Calibrated Fine-tuning on LLMs
cs.LG 2024-10 unverdicted novelty 7.0

UQ4CT integrates functional-level uncertainty calibration into mixture-of-experts LoRA fine-tuning via a dedicated loss, cutting expected calibration error by over 25% on multiple-choice and generative QA tasks.
Moshi: a speech-text foundation model for real-time dialogue
eess.AS 2024-09 accept novelty 7.0

Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
cs.CV 2024-06 conditional novelty 7.0

Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
Scaling and evaluating sparse autoencoders
cs.LG 2024-06 unverdicted novelty 7.0

K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
cs.CL 2024-05 unverdicted novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
cs.LG 2024-02 unverdicted novelty 7.0

Griffin hybrid model matches Llama-2 performance while trained on over 6 times fewer tokens and offers lower inference latency with higher throughput.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
cs.LG 2024-01 conditional novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA
cs.CL 2023-11 unverdicted novelty 7.0

LoRA adapters should be scaled by 1/sqrt(rank) rather than 1/rank to stabilize learning and enable effective use of higher ranks during fine-tuning of large language models.
A decoder-only foundation model for time-series forecasting
cs.CL 2023-10 unverdicted novelty 7.0

A pretrained decoder-only patched transformer achieves near state-of-the-art zero-shot forecasting performance across diverse time series datasets and settings.