Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
hub Canonical reference
An Empirical Model of Large-Batch Training
Canonical reference. 100% of citing Pith papers cite this work as background.
abstract
In an increasing number of domains it has been demonstrated that deep learning models can be trained using relatively large batch sizes without sacrificing data efficiency. However the limits of this massive data parallelism seem to differ from domain to domain, ranging from batches of tens of thousands in ImageNet to batches of millions in RL agents that play the game Dota 2. To our knowledge there is limited conceptual understanding of why these limits to batch size differ or how we might choose the correct batch size in a new domain. In this paper, we demonstrate that a simple and easy-to-measure statistic called the gradient noise scale predicts the largest useful batch size across many domains and applications, including a number of supervised learning datasets (MNIST, SVHN, CIFAR-10, ImageNet, Billion Word), reinforcement learning domains (Atari and Dota), and even generative model training (autoencoders on SVHN). We find that the noise scale increases as the loss decreases over a training run and depends on the model size primarily through improved model performance. Our empirically-motivated theory also describes the tradeoff between compute-efficiency and time-efficiency, and provides a rough model of the benefits of adaptive batch-size training.
hub tools
citation-role summary
citation-polarity summary
roles
background 7polarities
background 7representative citing papers
The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, and sparsity.
PACED applies student pass-rate weighting w(p)=p(1-p) to distillation, concentrating on the zone of proximal development and delivering up to +8.2 gains on AIME tasks with reduced forgetting.
Effective noise scale non-monotonically governs model merging success with an optimum, unifying effects of learning rate, weight decay, batch size, and augmentation on the loss landscape.
ZeRO removes memory redundancies in parallel training to scale deep learning models to over a trillion parameters with high throughput on current hardware.
Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.
DEFault++ delivers automated hierarchical fault detection, categorization into 12 transformer-specific types, and root-cause diagnosis among 45 mechanisms on a new benchmark of 3,739 mutated instances, with AUROC >0.96 and Macro-F1 0.85, plus improved developer repair accuracy in a user study.
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
Autoregressive transformers follow power-law scaling laws for cross-entropy loss with nearly universal exponents relating optimal model size to compute budget across four domains.
OpenAI Five achieved superhuman performance in Dota 2 by defeating the world champions using scaled self-play reinforcement learning.
HyperP transfers optimal learning rates across model width, depth, tokens, and MoE granularity under Frobenius-sphere constraints, delivering stable scaling and 1.58x efficiency gains.
Attention sinks induce gradient sinks under causal masking, with massive activations serving as adaptive RMSNorm regulators that attenuate localized gradient pressure in Transformer training.
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.
A noisy quadratic system predicts large model test losses from N, B, K and outperforms Chinchilla's model for extrapolation up to 1000x compute.
Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-only LLMs, backed by a mechanistic model.
QTL unifies expectation-value minimization with CVaR and Gibbs heuristics under one tunable operator, amplifying gradients in structured cases while preserving global minima and shifting the bottleneck to measurement variance.
COPUS co-adapts batch size and parallelism during LLM training via goodput to deliver 3.9-8% average faster convergence than fixing one while tuning the other.
Recurrent Transformers add per-layer recurrent memory via self-attention on own activations plus a tiling algorithm that reduces training memory traffic, yielding better C4 pretraining cross-entropy than parameter-matched standard transformers with fewer layers.
Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
Intelligence Inertia models the computational resistance to structural change in neural networks via a heuristic relativistic analogy, yielding a J-shaped cost curve that diverges from classical approximations.
citing papers explorer
-
DEFault++: Automated Fault Detection, Categorization, and Diagnosis for Transformer Architectures
DEFault++ delivers automated hierarchical fault detection, categorization into 12 transformer-specific types, and root-cause diagnosis among 45 mechanisms on a new benchmark of 3,739 mutated instances, with AUROC >0.96 and Macro-F1 0.85, plus improved developer repair accuracy in a user study.