Training Compute-Optimal Large Language Models
Pith reviewed 2026-05-10 15:55 UTC · model grok-4.3
The pith
For compute-optimal LLM training, scale model size and training tokens equally.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled. We test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4× more data. Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks.
What carries the argument
The scaling relation where optimal model size N and data D are proportional under fixed compute budget, derived from fitting loss as a function of N and D.
If this is right
- Chinchilla achieves higher accuracy on benchmarks like MMLU with 70B parameters than larger models using the same compute.
- Smaller optimal models reduce the compute needed for fine-tuning and inference.
- Future training runs should increase data proportionally to model size rather than fixing data size.
- Undertrained models can be improved by adding more data instead of just more parameters.
Where Pith is reading between the lines
- Data collection efforts will need to grow in tandem with model scaling to maintain optimal performance.
- Similar optimal scaling ratios may apply to other domains like vision or multimodal models.
- This challenges the prior trend of ever-larger models trained on fixed amounts of data.
Load-bearing premise
The parametric form of the scaling law fitted to models up to 16B parameters and 500B tokens holds for larger scales.
What would settle it
Training a model at a larger compute budget using the equal-scaling prediction and finding that its loss or downstream performance is worse than a model using a different N-to-D ratio.
read the original abstract
We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant. By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled. We test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4$\times$ more more data. Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks. This also means that Chinchilla uses substantially less compute for fine-tuning and inference, greatly facilitating downstream usage. As a highlight, Chinchilla reaches a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, greater than a 7% improvement over Gopher.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates the optimal allocation of compute between model size (N) and training tokens (D) for transformer language models. By training over 400 models spanning 70M to 16B parameters and 5B to 500B tokens, the authors fit a parametric scaling law for validation loss and conclude that compute-optimal training requires scaling N and D equally. They validate the prediction by training Chinchilla (70B parameters, ~1.4T tokens) under the same compute budget as Gopher (280B parameters, 300B tokens); Chinchilla outperforms Gopher, GPT-3, Jurassic-1, and Megatron-Turing NLG on a wide range of downstream tasks, including achieving 67.5% average accuracy on MMLU.
Significance. If the derived scaling laws and equal N/D scaling hold, the work is highly significant: it provides empirical evidence that many recent large models are undertrained and demonstrates a practical method for more efficient compute allocation that reduces inference and fine-tuning costs. The strength lies in the scale of the experimental sweep (>400 models) combined with direct out-of-sample validation via the Chinchilla training run, which moves the claim beyond pure curve-fitting.
major comments (1)
- [Scaling Laws section (around the derivation of optimal N and D)] The scaling-law fit is performed on models up to 16B parameters; the Chinchilla prediction extrapolates both in N (to 70B) and in D (to 1.4T tokens). While the successful Chinchilla run provides supporting evidence, the manuscript should quantify the uncertainty in the predicted optimum arising from variance in the fitted coefficients (A, B, α, β) and discuss whether alternative functional forms for L(N,D) would materially change the equal-scaling conclusion.
minor comments (2)
- [Abstract] Abstract contains a typographical error: '4× more more data' should be '4× more data'.
- [Figures 2–5 and associated text] Several figures (e.g., loss-vs-compute curves and downstream-task comparisons) would benefit from explicit error bars or shaded uncertainty regions to convey variability across the 400-model sweep.
Simulated Author's Rebuttal
We thank the referee for their positive evaluation of the work and for the constructive suggestion regarding uncertainty quantification. We address the major comment below and will incorporate the requested analysis in the revised manuscript.
read point-by-point responses
-
Referee: [Scaling Laws section (around the derivation of optimal N and D)] The scaling-law fit is performed on models up to 16B parameters; the Chinchilla prediction extrapolates both in N (to 70B) and in D (to 1.4T tokens). While the successful Chinchilla run provides supporting evidence, the manuscript should quantify the uncertainty in the predicted optimum arising from variance in the fitted coefficients (A, B, α, β) and discuss whether alternative functional forms for L(N,D) would materially change the equal-scaling conclusion.
Authors: We agree that an explicit quantification of uncertainty in the extrapolated optimum would strengthen the presentation. In the revised manuscript we will add a short subsection to the Scaling Laws section that reports bootstrap confidence intervals on the fitted coefficients A, B, α, and β (obtained by resampling the >400 training runs with replacement). These intervals will be propagated through the closed-form expression for the optimal N*(C) and D*(C) to give a range of plausible optima at the compute budget used for Chinchilla. Regarding alternative functional forms, we will include a brief discussion showing that the equal-scaling conclusion is robust: the optimum arises from balancing the two power-law terms, so modest changes to the exponents or the use of a multiplicative interaction term leave the scaling exponents for N and D with respect to compute essentially unchanged (both remain close to 0.5). The successful Chinchilla training run, which lies well outside the fitted regime, already provides direct empirical support for the predicted allocation. revision: yes
Circularity Check
No significant circularity; out-of-sample validation confirms scaling prediction
full rationale
The derivation fits a parametric loss function L(N, D) to empirical results from over 400 models (70M–16B parameters, 5B–500B tokens), derives the compute-optimal relation by minimizing under the constraint C ≈ 6ND, and directly tests the resulting prediction by training Chinchilla (70B parameters, ~1.4T tokens) under Gopher's compute budget. Chinchilla's superior downstream performance constitutes independent falsification outside the fitting set. No step reduces to self-definition, fitted-input renaming, or load-bearing self-citation; the functional form and optimum are externally validated rather than tautological.
Axiom & Free-Parameter Ledger
free parameters (1)
- Scaling law coefficients
axioms (1)
- domain assumption Loss follows a power-law dependence on model size N and data D
Forward citations
Cited by 60 Pith papers
-
An Open-Source Training Dataset for Foundation Models for Black-box Optimization
BBO-Pile is the first large-scale open dataset of real optimization trajectories used to train and scale foundation models that imitate black-box optimization methods.
-
The Economics of Model Collapse: Equilibrium, Welfare, and Optimal Provenance Subsidies in Synthetic Data Markets
Introduces the Synthetic Data Contamination Equilibrium and derives closed-form optimal provenance subsidies s* = KL(q||p)/(2 kappa) plus watermark strengths to mitigate model collapse, validated by OLS matching struc...
-
The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry
Transformer weight spectra exhibit transient compression waves that propagate layer-wise, persistent non-monotonic depth gradients in power-law exponents, and Q/K-V asymmetry, with the spectral exponent alpha predicti...
-
Large Language Diffusion Models
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
-
Bridging Language and Items for Retrieval and Recommendation: Benchmarking LLMs as Semantic Encoders
BLaIR is a new benchmark and 570M-review dataset showing that LLM performance rankings on recommendation tasks have little correlation with rankings on general embedding benchmarks like MTEB.
-
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
-
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.
-
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
-
Editing Models with Task Arithmetic
Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.
-
Teaching Models to Express Their Uncertainty in Words
GPT-3 can learn to express well-calibrated uncertainty about its answers using natural language phrases rather than logits.
-
LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models
LLMForge is a NAS framework with Infinite-Head Attention, a Forge-Former surrogate, and Forge-DSE engine that discovers hardware-specific architectures for edge language models, yielding variants with improved accurac...
-
Scale-Dependent Collective Adaptation in Self-Amending LLM Societies: A Cross-Family Study of Emergent Governance
LLM societies in Nomic show non-monotonic collective adaptation peaking at mid-scales, with smaller models rule-inert and larger ones restrictive.
-
Do Language Models Align with Brains? Prediction Scores Are Not Enough
Language model representations fail all L-PACT alignment gates once controls explain the apparent predictive and relational effects.
-
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.
-
Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining
Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.
-
Sharp feature-learning transitions and Bayes-optimal neural scaling laws in extensive-width networks
In extensive-width networks, features are recovered sequentially through sharp phase transitions, yielding an effective width k_c that unifies Bayes-optimal generalization error scaling as Θ(k_c d / n).
-
Active Testing of Large Language Models via Approximate Neyman Allocation
Proposes surrogate semantic entropy stratification followed by approximate Neyman allocation for active testing of LLMs on generative benchmarks, reporting up to 28% MSE reduction and 22.9% average budget savings vers...
-
How Much is Brain Data Worth for Machine Learning?
Brain data is worth a variable number of task samples depending on task-brain alignment, noise levels, and latent dimension, with conditions under which it also improves robustness to test distribution shift.
-
DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.
-
Spectral Dynamics in Deep Networks: Feature Learning, Outlier Escape, and Learning Rate Transfer
A two-level DMFT predicts width-consistent outlier escape and hyperparameter transfer under μP in deep networks, with bulk restructuring dominating for tasks with many outputs.
-
On the Invariance and Generality of Neural Scaling Laws
Neural scaling laws are invariant under bijective data transformations and change predictably with information resolution ρ under non-bijective transformations, enabling cross-domain transport of fitted exponents.
-
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.
-
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.
-
Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination
Attractor basins in transformer hidden states unify conflict and hallucination as basin competition or absence, with geometric margin outperforming entropy for detection and a scaling law governing confident hallucina...
-
Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination
Transformer hidden states encode facts as attractor basins; hallucinations occur from basin absence and conflicts from basin competition, detected cleanly by geometric margin rather than entropy.
-
The Predictive-Causal Gap: An Impossibility Theorem and Large-Scale Neural Evidence
Predictive representation learning structurally favors encoding slower or less noisy environment modes over causal system modes, as shown by an impossibility theorem for linear-Gaussian dynamics and large-scale neural...
-
Tempus: A Temporally Scalable Resource-Invariant GEMM Streaming Framework for Versal AI Edge
Tempus delivers 607 GOPS at 10.677 W using fixed 16 AIE cores on Versal AI Edge, with 211.2x better platform-aware utility than spatial SOTA ARIES and zero URAM/DSP utilization.
-
CellxPert: Inference-Time MCMC Steering of a Multi-Omics Single-Cell Foundation Model for In-Silico Perturbation
CellxPert uses inference-time MCMC steering on a multi-omics single-cell foundation model to predict genome-wide transcriptomic responses to gene perturbations and outperforms baselines on cell-type annotation, pertur...
-
The Cost of Consensus: Isolated Self-Correction Prevails Over Unguided Homogeneous Multi-Agent Debate
Homogeneous multi-agent debate introduces sycophantic conformity, contextual fragility, and consensus collapse, leading to equal or lower accuracy than isolated self-correction at 2.1-3.4x higher token cost on GSM-Har...
-
LoopCTR: Unlocking the Loop Scaling Power for Click-Through Rate Prediction
LoopCTR trains CTR models with recursive layer reuse and process supervision so that zero-loop inference outperforms baselines on public and industrial datasets.
-
Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms
Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.
-
Neural Garbage Collection: Learning to Forget while Learning to Reason
Language models learn to evict KV cache entries end-to-end via reinforcement learning from outcome reward alone, achieving 2-3x cache compression while maintaining accuracy on Countdown, AMC, and AIME tasks.
-
Causal inference for social network formation
Random team assignments in a professional firm reveal that indirect ties strongly increase new direct tie formation, while effects of degree and local density are smaller and less robust.
-
Rectification Difficulty and Optimal Sample Allocation in LLM-Augmented Surveys
A method using predicted rectification difficulty for optimal human sample allocation in LLM-augmented surveys captures 61-79% of theoretical efficiency gains and reduces MSE by 11% on two datasets without pilot data.
-
How Do LLMs and VLMs Understand Viewpoint Rotation Without Vision? An Interpretability Study
LLMs and VLMs encode viewpoint information in hidden states but fail to bind it to corresponding observations, resulting in hallucinations in final layers on text-only viewpoint rotation tasks.
-
STORM: End-to-End Referring Multi-Object Tracking in Videos
STORM is an end-to-end MLLM for referring multi-object tracking that uses task-composition learning to leverage sub-task data and introduces the STORM-Bench dataset, achieving SOTA results.
-
Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models
Supervised fine-tuning of LLMs often fails to fully internalize all training instances due to five recurring causes including missing prerequisites and data conflicts, as diagnosed via a new framework across multiple models.
-
The Shrinking Lifespan of LLMs in Science
LLM adoption in science follows a compressing inverted-U trajectory where release year predicts time-to-peak and lifespan better than model attributes.
-
LASER: A Data-Centric Method for Low-Cost and Efficient SQL Rewriting based on SQL-GRPO
LASER generates complex slow-query training data with MCTS and aligns small models via SQL-GRPO to deliver efficient, low-cost SQL rewriting that outperforms rules and large models.
-
Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation
Chain-of-Thought prompting balances high accuracy with low energy use in small language models for code generation, while multi-sampling strategies add high energy costs for small accuracy gains.
-
GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving
GhostServe applies erasure coding to KV cache in host memory for fast recovery from failures in LLM serving, cutting checkpointing latency up to 2.7x and recovery latency 2.1x versus prior methods.
-
More than the Sum: Panorama-Language Models for Adverse Omni-Scenes
Panorama-Language Models with a sparse attention module and PanoVQA dataset deliver superior holistic reasoning on 360° adverse omni-scenes compared to stitched pinhole views.
-
Latent Generative Solvers for Generalizable Long-Term Physics Simulation
LGS pretrained on 2.5M trajectories across 16 systems matches deterministic baselines at one step and halves 20-step error while using far less compute and adapting to held-out higher-resolution flows.
-
OmniMol: Transferring Particle Physics Knowledge to Molecular Dynamics with Point-Edge Transformers
OmniMol transfers a billion-jet pre-trained PET foundation model from HEP to molecular dynamics via an interaction-matrix attention bias, delivering strong performance on the oMol dataset with minimal fine-tuning and ...
-
Scaling Latent Reasoning via Looped Language Models
Looped language models with latent iterative computation and entropy-regularized depth allocation achieve performance matching up to 12B standard LLMs through superior knowledge manipulation.
-
The Art of Scaling Reinforcement Learning Compute for LLMs
A 400k+ GPU-hour study shows RL scaling in LLMs follows predictable sigmoidal trajectories, with most design choices affecting efficiency rather than the performance asymptote, enabling accurate large-scale prediction...
-
Scaling Vision Transformers for Functional MRI with Flat Maps
CortexMAE adapts Vision Transformers to fMRI via cortical flat maps, shows power-law scaling on 2.1K hours of data, and outperforms priors on cognitive state decoding while failing to beat a simple functional connecti...
-
Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention
Low-precision Flash Attention fails due to similar low-rank attention representations combined with biased rounding errors that accumulate and corrupt weight updates; a minimal fix to reduce rounding bias stabilizes training.
-
On the Convergence of Muon and Beyond
Muon-MVR2 attains the optimal anytime convergence rate of ~O(T^{-1/3}) in stochastic non-convex settings under horizon-free schedules.
-
IAFormer: Interaction-Aware Transformer network for collider data analysis
IAFormer uses boost-invariant pairwise quantities and differential attention to create a sparse Transformer that achieves state-of-the-art classification on top-quark and quark-gluon jet datasets while using over an o...
-
Exact Sequence Interpolation with Transformers
Transformers with O(sum m^j) blocks and O(d sum m^j) parameters can exactly interpolate any finite dataset of input sequences in R^d to output sequences of lengths m^j.
-
Functional-level Uncertainty Quantification for Calibrated Fine-tuning on LLMs
UQ4CT integrates functional-level uncertainty calibration into mixture-of-experts LoRA fine-tuning via a dedicated loss, cutting expected calibration error by over 25% on multiple-choice and generative QA tasks.
-
Moshi: a speech-text foundation model for real-time dialogue
Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
-
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
-
Scaling and evaluating sparse autoencoders
K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
-
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Griffin hybrid model matches Llama-2 performance while trained on over 6 times fewer tokens and offers lower inference latency with higher throughput.
-
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
-
A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA
LoRA adapters should be scaled by 1/sqrt(rank) rather than 1/rank to stabilize learning and enable effective use of higher ranks during fine-tuning of large language models.
-
A decoder-only foundation model for time-series forecasting
A pretrained decoder-only patched transformer achieves near state-of-the-art zero-shot forecasting performance across diverse time series datasets and settings.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.