hub Canonical reference

An Empirical Model of Large-Batch Training

Sam McCandlish, Jared Kaplan, Dario Amodei, OpenAI Dota Team · 2018 · cs.LG · arXiv 1812.06162

Canonical reference. 100% of citing Pith papers cite this work as background.

31 Pith papers citing it

Background 100% of classified citations

open full Pith review browse 31 citing papers arXiv PDF

abstract

In an increasing number of domains it has been demonstrated that deep learning models can be trained using relatively large batch sizes without sacrificing data efficiency. However the limits of this massive data parallelism seem to differ from domain to domain, ranging from batches of tens of thousands in ImageNet to batches of millions in RL agents that play the game Dota 2. To our knowledge there is limited conceptual understanding of why these limits to batch size differ or how we might choose the correct batch size in a new domain. In this paper, we demonstrate that a simple and easy-to-measure statistic called the gradient noise scale predicts the largest useful batch size across many domains and applications, including a number of supervised learning datasets (MNIST, SVHN, CIFAR-10, ImageNet, Billion Word), reinforcement learning domains (Atari and Dota), and even generative model training (autoencoders on SVHN). We find that the noise scale increases as the loss decreases over a training run and depends on the model size primarily through improved model performance. Our empirically-motivated theory also describes the tradeoff between compute-efficiency and time-efficiency, and provides a rough model of the benefits of adaptive batch-size training.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7

citation-polarity summary

background 7

representative citing papers

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

cs.CL · 2023-04-03 · accept · novelty 8.0

Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.

How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, and sparsity.

PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence

cs.AI · 2026-03-11 · conditional · novelty 7.0

PACED applies student pass-rate weighting w(p)=p(1-p) to distillation, concentrating on the zone of proximal development and delivering up to +8.2 gains on AIME tasks with reduced forgetting.

How does the optimizer implicitly bias the model merging loss landscape?

cs.LG · 2025-10-06 · unverdicted · novelty 7.0

Effective noise scale non-monotonically governs model merging success with an optimum, unifying effects of learning rate, weight decay, batch size, and augmentation on the loss landscape.

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

cs.LG · 2019-10-04 · accept · novelty 7.0

ZeRO removes memory redundancies in parallel training to scale deep learning models to over a trillion parameters with high throughput on current hardware.

HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

cs.CL · 2024-05-07 · unverdicted · novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

Scaling Laws for Autoregressive Generative Modeling

cs.LG · 2020-10-28 · accept · novelty 7.0

Autoregressive transformers follow power-law scaling laws for cross-entropy loss with nearly universal exponents relating optimal model size to compute budget across four domains.

Dota 2 with Large Scale Deep Reinforcement Learning

cs.LG · 2019-12-13 · accept · novelty 7.0

OpenAI Five achieved superhuman performance in Dota 2 by defeating the world champions using scaled self-play reinforcement learning.

Rethinking Language Model Scaling under Transferable Hypersphere Optimization

cs.LG · 2026-03-30 · conditional · novelty 6.0

HyperP transfers optimal learning rates across model width, depth, tokens, and MoE granularity under Frobenius-sphere constraints, delivering stable scaling and 1.58x efficiency gains.

Attention Sinks Induce Gradient Sinks: Massive Activations as Gradient Regulators in Transformers

cs.LG · 2026-03-18 · unverdicted · novelty 6.0

Attention sinks induce gradient sinks under causal masking, with massive activations serving as adaptive RMSNorm regulators that attenuate localized gradient pressure in Transformer training.

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

cs.AI · 2025-07-01 · conditional · novelty 6.0

Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.

Scaling Laws for Transfer

cs.LG · 2021-02-02 · unverdicted · novelty 6.0

Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.

Predicting Large Model Test Losses with a Noisy Quadratic System

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

A noisy quadratic system predicts large model test losses from N, B, K and outperforms Chinchilla's model for extrapolation up to 1000x compute.

Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization

stat.ML · 2026-05-07 · unverdicted · novelty 6.0

Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-only LLMs, backed by a mechanistic model.

Quantum Tilted Loss in Variational Optimization: Theory and Applications

quant-ph · 2026-05-04 · unverdicted · novelty 6.0

QTL unifies expectation-value minimization with CVaR and Gibbs heuristics under one tunable operator, amplifying gradients in structured cases while preserving global minima and shifting the bottleneck to measurement variance.

COPUS: Co-adaptive Parallelism and Batch Size Selection in Large Language Model Training

cs.DC · 2026-04-29 · unverdicted · novelty 6.0

COPUS co-adapts batch size and parallelism during LLM training via goodput to deliver 3.9-8% average faster convergence than fixing one while tuning the other.

The Recurrent Transformer: Greater Effective Depth and Efficient Decoding

cs.LG · 2026-04-23 · unverdicted · novelty 6.0

Recurrent Transformers add per-layer recurrent memory via self-attention on own activations plus a tiling algorithm that reduces training memory traffic, yielding better C4 pretraining cross-entropy than parameter-matched standard transformers with fewer layers.

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

cs.CL · 2025-05-10 · conditional · novelty 6.0

Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.

Language Models (Mostly) Know What They Know

cs.CL · 2022-07-11 · unverdicted · novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

PaLM: Scaling Language Modeling with Pathways

cs.CL · 2022-04-05 · accept · novelty 6.0

PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.

A General Language Assistant as a Laboratory for Alignment

cs.CL · 2021-12-01 · conditional · novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

GNMR: Runtime Stability Control for Low-Precision Large Language Model Training

cs.LG · 2026-05-30 · unverdicted · novelty 5.0

GNMR is a gradient-norm-based controller that maps local stability signals to budgeted recovery actions to stabilize low-precision LLM training while preserving quality.

Intelligence Inertia: Physical Isomorphism and Applications

cs.AI · 2026-03-22 · unverdicted · novelty 5.0

Intelligence Inertia models the computational resistance to structural change in neural networks via a heuristic relativistic analogy, yielding a J-shaped cost curve that diverges from classical approximations.

citing papers explorer

Showing 31 of 31 citing papers.

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling cs.CL · 2023-04-03 · accept · none · ref 126 · internal anchor
Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization cs.LG · 2026-05-13 · unverdicted · none · ref 69 · internal anchor
The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, and sparsity.
PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence cs.AI · 2026-03-11 · conditional · none · ref 7 · internal anchor
PACED applies student pass-rate weighting w(p)=p(1-p) to distillation, concentrating on the zone of proximal development and delivering up to +8.2 gains on AIME tasks with reduced forgetting.
How does the optimizer implicitly bias the model merging loss landscape? cs.LG · 2025-10-06 · unverdicted · none · ref 7 · internal anchor
Effective noise scale non-monotonically governs model merging success with an optimum, unifying effects of learning rate, weight decay, batch size, and augmentation on the loss landscape.
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models cs.LG · 2019-10-04 · accept · none · ref 8 · internal anchor
ZeRO removes memory redundancies in parallel training to scale deep learning models to over a trillion parameters with high throughput on current hardware.
HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model cs.CL · 2026-05-11 · unverdicted · none · ref 27
Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cs.CL · 2024-05-07 · unverdicted · none · ref 155
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
Scaling Laws for Autoregressive Generative Modeling cs.LG · 2020-10-28 · accept · none · ref 17
Autoregressive transformers follow power-law scaling laws for cross-entropy loss with nearly universal exponents relating optimal model size to compute budget across four domains.
Dota 2 with Large Scale Deep Reinforcement Learning cs.LG · 2019-12-13 · accept · none · ref 30
OpenAI Five achieved superhuman performance in Dota 2 by defeating the world champions using scaled self-play reinforcement learning.
Rethinking Language Model Scaling under Transferable Hypersphere Optimization cs.LG · 2026-03-30 · conditional · none · ref 14 · internal anchor
HyperP transfers optimal learning rates across model width, depth, tokens, and MoE granularity under Frobenius-sphere constraints, delivering stable scaling and 1.58x efficiency gains.
Attention Sinks Induce Gradient Sinks: Massive Activations as Gradient Regulators in Transformers cs.LG · 2026-03-18 · unverdicted · none · ref 24 · internal anchor
Attention sinks induce gradient sinks under causal masking, with massive activations serving as adaptive RMSNorm regulators that attenuate localized gradient pressure in Transformer training.
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning cs.AI · 2025-07-01 · conditional · none · ref 240 · internal anchor
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
Scaling Laws for Transfer cs.LG · 2021-02-02 · unverdicted · none · ref 92 · internal anchor
Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.
Predicting Large Model Test Losses with a Noisy Quadratic System cs.LG · 2026-05-09 · unverdicted · none · ref 21
A noisy quadratic system predicts large model test losses from N, B, K and outperforms Chinchilla's model for extrapolation up to 1000x compute.
Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization stat.ML · 2026-05-07 · unverdicted · none · ref 23
Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-only LLMs, backed by a mechanistic model.
Quantum Tilted Loss in Variational Optimization: Theory and Applications quant-ph · 2026-05-04 · unverdicted · none · ref 73
QTL unifies expectation-value minimization with CVaR and Gibbs heuristics under one tunable operator, amplifying gradients in structured cases while preserving global minima and shifting the bottleneck to measurement variance.
COPUS: Co-adaptive Parallelism and Batch Size Selection in Large Language Model Training cs.DC · 2026-04-29 · unverdicted · none · ref 29
COPUS co-adapts batch size and parallelism during LLM training via goodput to deliver 3.9-8% average faster convergence than fixing one while tuning the other.
The Recurrent Transformer: Greater Effective Depth and Efficient Decoding cs.LG · 2026-04-23 · unverdicted · none · ref 66
Recurrent Transformers add per-layer recurrent memory via self-attention on own activations plus a tiling algorithm that reduces training memory traffic, yielding better C4 pretraining cross-entropy than parameter-matched standard transformers with fewer layers.
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free cs.CL · 2025-05-10 · conditional · none · ref 18
Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.
Language Models (Mostly) Know What They Know cs.CL · 2022-07-11 · unverdicted · none · ref 182
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
PaLM: Scaling Language Modeling with Pathways cs.CL · 2022-04-05 · accept · none · ref 95
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
A General Language Assistant as a Laboratory for Alignment cs.CL · 2021-12-01 · conditional · none · ref 124
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
GNMR: Runtime Stability Control for Low-Precision Large Language Model Training cs.LG · 2026-05-30 · unverdicted · none · ref 34 · internal anchor
GNMR is a gradient-norm-based controller that maps local stability signals to budgeted recovery actions to stabilize low-precision LLM training while preserving quality.
Intelligence Inertia: Physical Isomorphism and Applications cs.AI · 2026-03-22 · unverdicted · none · ref 73 · internal anchor
Intelligence Inertia models the computational resistance to structural change in neural networks via a heuristic relativistic analogy, yielding a J-shaped cost curve that diverges from classical approximations.
GradPower: Powering Gradients for Faster Language Model Pre-Training cs.LG · 2025-05-30 · unverdicted · none · ref 12 · internal anchor
GradPower applies sign-power to gradients before optimization and achieves lower terminal loss in language model pre-training across architectures, scales, datasets, and schedules.
Artificial Adaptive Intelligence: The Missing Stage Between Narrow and General Intelligence cs.AI · 2026-05-16 · unverdicted · none · ref 11 · internal anchor
Proposes Artificial Adaptive Intelligence as the regime between narrow and general AI, defined by elimination of human-specified hyperparameters, and introduces an adaptivity index plus parametric minimality principle grounded in minimum description length.
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism cs.CL · 2024-01-05 · unverdicted · none · ref 157
DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
A Comprehensive Overview of Large Language Models cs.CL · 2023-07-12 · unverdicted · none · ref 107 · internal anchor
A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.
There Will Be a Scientific Theory of Deep Learning stat.ML · 2026-04-23 · unverdicted · none · ref 84
A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.
Scalable Reinforcement Learning via Adaptive Batch Scaling stat.ML · 2026-05-20 · unreviewed · ref 13 · internal anchor
Hierarchical Fault Detection and Diagnosis for Transformer Architectures cs.SE · 2026-04-30 · unreviewed · ref 20

An Empirical Model of Large-Batch Training

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer