Scaling Laws for Neural Language Models

Alec Radford; Benjamin Chess; Dario Amodei; Jared Kaplan; Jeffrey Wu; Rewon Child; Sam McCandlish; Scott Gray; Tom B. Brown; Tom Henighan

arxiv: 2001.08361 · v1 · submitted 2020-01-23 · 💻 cs.LG · stat.ML

Scaling Laws for Neural Language Models

Jared Kaplan , Sam McCandlish , Tom Henighan , Tom B. Brown , Benjamin Chess , Rewon Child , Scott Gray , Alec Radford

show 2 more authors

Jeffrey Wu Dario Amodei

This is my paper

Pith reviewed 2026-05-24 15:29 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords scaling lawslanguage modelspower-law scalingcompute efficiencymodel sizedataset sizeoverfittingtraining dynamics

0 comments

The pith

Neural language model loss scales as a power law with model size, data size, and training compute.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures how cross-entropy loss changes when the number of parameters, the number of training tokens, and the total floating-point operations are varied over many orders of magnitude. It finds simple power-law relations that describe the loss in each case, with only weak additional dependence on network depth or width. These relations also yield equations for how quickly a model overfits its data and how fast training proceeds for a given model size. The relations in turn predict the model size and data volume that minimize loss for any fixed compute budget.

Core claim

Cross-entropy loss L follows power-law scaling in model size N, dataset size D, and compute C, with the functional forms L(N) ~ N^(-α), L(D) ~ D^(-β), and L(C) ~ C^(-γ) holding across more than seven orders of magnitude; architectural details such as width and depth exert only minimal influence inside wide ranges, and the same relations determine optimal compute allocation, sample efficiency, and the point at which training should stop.

What carries the argument

Empirical power-law fits that relate loss directly to model size, dataset size, and compute.

If this is right

For any fixed compute budget the lowest loss is achieved by training a very large model on a relatively small dataset and stopping well before convergence.
Larger models require fewer training examples to reach a given loss level.
The amount of overfitting is governed by a simple function of model size and dataset size.
Training speed itself follows a predictable dependence on model size alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same scaling relations could be tested on non-language tasks to check whether the exponents are domain-specific.
If the laws remain accurate at still larger scales they would let researchers forecast the loss of a model before any training begins.
The preference for large models on modest data shifts the economic trade-off between hardware and data collection.

Load-bearing premise

The power-law trends measured inside the tested range of sizes will continue to hold when models and datasets grow much larger.

What would settle it

Training a model whose parameter count lies an order of magnitude beyond the largest model studied and finding that its achieved loss lies well outside the band predicted by the fitted power laws.

Figures

Figures reproduced from arXiv: 2001.08361 by Alec Radford, Benjamin Chess, Dario Amodei, Jared Kaplan, Jeffrey Wu, Rewon Child, Sam McCandlish, Scott Gray, Tom B. Brown, Tom Henighan.

read the original abstract

We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper measures power-law scaling of LM loss with N, D, and C over wide ranges and derives compute-optimal allocation rules from the fits.

read the letter

The main thing to know is that the authors ran a large set of Transformer runs and found clean power-law relationships: loss drops as N to the -0.076, D to the -0.103, and C to the -0.05, with the trends holding across seven orders of magnitude in the measured regime. From those fits they derive the optimal N*(C) and D*(C) that minimize loss for a fixed compute budget, showing that larger models trained on less data and stopped early are more efficient than smaller models trained longer on more data. That allocation rule is the practical takeaway most people remember. What the work does well is the experimental scope. They vary model size, data size, and compute systematically, include enough points to fit the exponents reliably, and check that width and depth changes inside the same family do not move the curves much. The figures show the fits are tight once you are above the very smallest models. The soft spots are exactly where the stress-test note flags them. The optimal-allocation formulas come directly from the fitted exponents, so they are not independent predictions; they will shift if the functional form changes at larger scales. The paper itself notes small deviations at low N and D, and all the data come from one model family, so claims about architectural invariance are limited to that family. Extrapolation beyond the largest measured point (roughly 10^9 parameters and 10^23 FLOPs) is an assumption, not a result. Readers who need to plan runs at 10-100x larger budgets should treat the numbers as a starting point rather than a guarantee. This paper is for groups that train or study large language models and want quantitative guidance on budget allocation. It is worth a serious referee because the empirical coverage is substantial and the measurements are reproducible enough to check. I would send it to review.

Referee Report

2 major / 1 minor

Summary. The manuscript reports empirical scaling laws showing that cross-entropy loss for neural language models follows power-law dependence on model size N, dataset size D, and compute C, with some trends spanning more than seven orders of magnitude. Architectural details such as width or depth have minimal effects within the tested range. Simple equations describe overfitting and training speed, which are used to derive optimal allocation of a fixed compute budget, favoring training of very large models on modest data and stopping before convergence.

Significance. If the observed power laws and derived allocations hold, the work supplies a quantitative basis for predicting performance and optimizing training efficiency across scales, with the broad empirical coverage (N up to ~10^9, C up to ~10^23 FLOPs) constituting a clear strength for guiding resource allocation in large-model development.

major comments (2)

[§6, Eq. (6.3)–(6.5)] §6, Eq. (6.3)–(6.5): The optimal N*(C) and D*(C) are obtained by minimizing the fitted loss L(N,D) using the exponents reported in §3–4 (e.g., N^{-0.076}, D^{-0.103}, C^{-0.050}). Because these formulas are applied to budgets 10–100× beyond the measured range, the central claim that 'optimally compute-efficient training involves training very large models on a relatively modest amount of data' requires explicit bounds or sensitivity analysis on how deviations from power-law behavior (noted at low N/D) or a change of regime would shift the predicted minimum.
[§3–4] §3–4: The power-law fits are reported to be good within the observed range, yet the manuscript notes small deviations at low N/D. The load-bearing step of extrapolating these same functional forms to derive the compute-efficiency optimum in §6 would be strengthened by a quantitative propagation of fit residuals or by hold-out validation at the largest scales tested.

minor comments (1)

The notation for the loss function and the precise definition of compute C should be introduced with an equation number in the main text before the scaling plots are presented.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of the empirical coverage and for the constructive comments on extrapolation. We respond to each major comment below.

read point-by-point responses

Referee: [§6, Eq. (6.3)–(6.5)] §6, Eq. (6.3)–(6.5): The optimal N*(C) and D*(C) are obtained by minimizing the fitted loss L(N,D) using the exponents reported in §3–4 (e.g., N^{-0.076}, D^{-0.103}, C^{-0.050}). Because these formulas are applied to budgets 10–100× beyond the measured range, the central claim that 'optimally compute-efficient training involves training very large models on a relatively modest amount of data' requires explicit bounds or sensitivity analysis on how deviations from power-law behavior (noted at low N/D) or a change of regime would shift the predicted minimum.

Authors: We agree that the central claim in §6 rests on extrapolation. The noted deviations from power-law scaling occur at low N/D; the derived optima lie well outside that regime. In the revised manuscript we will add an explicit sensitivity analysis that varies the fitted exponents within their reported uncertainties and recomputes N*(C) and D*(C) to quantify how the location of the minimum shifts. revision: yes
Referee: [§3–4] §3–4: The power-law fits are reported to be good within the observed range, yet the manuscript notes small deviations at low N/D. The load-bearing step of extrapolating these same functional forms to derive the compute-efficiency optimum in §6 would be strengthened by a quantitative propagation of fit residuals or by hold-out validation at the largest scales tested.

Authors: The manuscript already reports fit quality metrics and residuals for the power-law regimes in §3–4. To further support the extrapolation step, the revised version will include a quantitative propagation of the fit residuals into the uncertainty of the derived N*(C) and D*(C) curves. revision: yes

Circularity Check

0 steps flagged

Empirical scaling laws from direct experimental fits; optimal allocation is a derived consequence, not a reduction to inputs.

full rationale

The paper reports direct experimental measurements of cross-entropy loss across model sizes N, dataset sizes D, and compute C (spanning >7 orders of magnitude), then fits power-law forms L(N), L(D), and L(C) to those data points in sections 3-4. The optimal allocation rules in section 6 are obtained by analytically minimizing the fitted functional forms subject to a compute constraint C = 6ND; this is a straightforward mathematical consequence of the empirical fits rather than a self-definitional loop or a 'prediction' that is statistically forced to match the input data. No self-citations, imported uniqueness theorems, or ansatzes are invoked to justify the central claims. The work is therefore self-contained against its own experimental benchmarks within the measured regime.

Axiom & Free-Parameter Ledger

4 free parameters · 2 axioms · 0 invented entities

The paper contributes fitted scaling exponents and derived allocation rules; the power-law functional form and separability of size/data/compute effects are assumptions fitted to data rather than derived from first principles.

free parameters (4)

power-law exponent for model size
Fitted from empirical loss versus parameter count curves
power-law exponent for dataset size
Fitted from empirical loss versus data volume curves
power-law exponent for compute
Fitted from empirical loss versus total compute curves
scaling equation coefficients
Multiple constants fitted to match observed loss values across experiments

axioms (2)

domain assumption Loss follows a power-law functional form in model size, data, and compute
Chosen because it fits the observed empirical trends over the tested range
domain assumption Architectural details such as width and depth have minimal effects within wide ranges
Invoked to attribute scaling primarily to size, data, and compute

pith-pipeline@v0.9.0 · 5661 in / 1503 out tokens · 65535 ms · 2026-05-24T15:29:24.613175+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The loss scales as a power-law with model size, dataset size, and the amount of compute... L(N)=(N_c/N)^α_N ; α_N∼0.076, N_c∼8.8×10^13
IndisputableMonolith/Foundation/RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

L(N,D)=[(N_c/N)^(α_N/α_D)+D_c/D]^α_D

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
cs.CL 2022-01 accept novelty 9.0

Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.
An Open-Source Training Dataset for Foundation Models for Black-box Optimization
cs.LG 2026-05 unverdicted novelty 8.0

BBO-Pile is the first large-scale open dataset of real optimization trajectories used to train and scale foundation models that imitate black-box optimization methods.
The Economics of Model Collapse: Equilibrium, Welfare, and Optimal Provenance Subsidies in Synthetic Data Markets
econ.GN 2026-05 unverdicted novelty 8.0

Introduces the Synthetic Data Contamination Equilibrium and derives closed-form optimal provenance subsidies s* = KL(q||p)/(2 kappa) plus watermark strengths to mitigate model collapse, validated by OLS matching struc...
Tokens-per-Parameter Coverage Is Critical for Robust LLM Scaling Law Extrapolation
cs.LG 2026-05 unverdicted novelty 8.0

Fixed tokens-per-parameter ratios in scaling law experiments induce ill-conditioned least-squares fits due to Jacobian geometry, making scale coefficients unidentifiable and extrapolations unreliable; diverse TPP cove...
Quantum-enhanced Large Language Models on Quantum Hardware via Cayley Unitary Adapters
quant-ph 2026-05 unverdicted novelty 8.0

Cayley unitary adapters executed on real quantum hardware improve LLM perplexity by 1.4% on Llama 3.1 8B with 6000 parameters and recover 83% of compression-induced degradation on SmolLM2.
Nearly Optimal Attention Coresets
cs.DS 2026-05 unverdicted novelty 8.0

ε-coresets for attention exist of size O(√d e^{ρ+o(ρ)}/ε) for unit-norm keys/values and queries of norm ≤ρ, nearly matching the Ω(√d e^ρ/ε) lower bound.
Efficient Training on Multiple Consumer GPUs with RoundPipe
cs.DC 2026-04 conditional novelty 8.0

RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on...
The Query Channel: Information-Theoretic Limits of Masking-Based Explanations
cs.AI 2026-04 unverdicted novelty 8.0

Masking-based explanations are governed by the information capacity of the query channel, with reliable recovery achievable below capacity via sparse maximum-likelihood decoding but impossible above it.
The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry
cs.LG 2026-04 unverdicted novelty 8.0

Transformer weight spectra exhibit transient compression waves that propagate layer-wise, persistent non-monotonic depth gradients in power-law exponents, and Q/K-V asymmetry, with the spectral exponent alpha predicti...
Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking
cs.LG 2026-02 unverdicted novelty 8.0

Grokking reflects escape from a metastable low-dimensional regime where transverse curvature accumulates before generalization, with subspace motion necessary but curvature boost insufficient.
Evaluating Large Language Models in Scientific Discovery
cs.AI 2025-12 unverdicted novelty 8.0

The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.
Understanding In-Context Learning on Structured Manifolds: Bridging Attention to Kernel Methods
cs.LG 2025-06 unverdicted novelty 8.0

Transformers perform kernel-based prediction for Hölder regression on manifolds and achieve intrinsic-dimension-dependent minimax rates with sufficient training tasks.
Privacy Amplification in Differentially Private Zeroth-Order Optimization with Hidden States
cs.LG 2025-05 unverdicted novelty 8.0

Introduces hybrid noise and novel coupling analysis to achieve the first convergent hidden-state DP bound for zeroth-order optimization.
Large Language Diffusion Models
cs.CL 2025-02 unverdicted novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection
cs.CL 2024-10 unverdicted novelty 8.0

ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
cs.LG 2024-07 conditional novelty 8.0

TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.
KAN: Kolmogorov-Arnold Networks
cs.LG 2024-04 conditional novelty 8.0

KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
cs.CL 2023-05 conditional novelty 8.0

Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
cs.CL 2023-04 accept novelty 8.0

Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
Discovering Language Model Behaviors with Model-Written Evaluations
cs.CL 2022-12 unverdicted novelty 8.0

Language models can automatically generate high-quality evaluation datasets that reveal new cases of inverse scaling, sycophancy, and concerning goal-seeking behaviors, including some worsened by RLHF.
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
cs.CL 2020-12 conditional novelty 8.0

The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl ...
Tokenisation via Convex Relaxations
cs.CL 2026-05 unverdicted novelty 7.0

ConvexTok uses convex relaxation of tokenization to a linear program, improving intrinsic metrics, bits-per-byte, and some downstream tasks while certifying near-optimality within 1% at typical vocabulary sizes.
Forecasting Scientific Progress with Artificial Intelligence
cs.AI 2026-05 unverdicted novelty 7.0

Introduces the CUSP benchmark across 4760 events and finds frontier AI models can pick plausible directions but fail to predict whether or when scientific advances will occur, with performance varying by domain and in...
Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most
cs.AI 2026-05 unverdicted novelty 7.0

More capable LLMs produce worse distributional forecasts on superlinear growth time series with tail risks of regime change, with the error concentrated in the upper tail; this reverses on conventional threshold metrics.
Uniform-in-Time Weak Propagation-of-Chaos in Shallow Neural Networks
stat.ML 2026-05 unverdicted novelty 7.0

Finite-width shallow networks remain within poly(d) m^{-min(1,c/6)} of their mean-field limit uniformly in time when mean-field excess loss decays as t^{-c} under standard regularity and an integral condition on the loss.
Closed Loop Dynamic Driving Data Mixture for Real-Synthetic Co-Training
cs.CV 2026-05 unverdicted novelty 7.0

AutoScale is a closed-loop data engine using Graph-RAE for scene representation and Cluster-GA for importance-based retrieval to improve real-synthetic co-training for autonomous driving.
RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution
cs.CV 2026-05 conditional novelty 7.0

RankE co-evolves AR policy and decoder via alternating ranking optimization, improving both FID and CLIP scores on LlamaGen-XL and Janus-Pro where policy-only RL degrades FID.
Provable Joint Decontamination for Benchmarking Multiple Large Language Models
cs.LG 2026-05 unverdicted novelty 7.0

JECS aggregates per-model conformal p-values via their maximum and reconstructs a conservative envelope of the max-p null distribution to select benchmarks with global contamination rate control.
LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging
cs.LG 2026-05 unverdicted novelty 7.0

LOSCAR-SGD combines local updates, sparse model averaging, and communication-computation overlap with a delay-corrected merge rule, providing convergence rates for smooth non-convex objectives under worker heterogeneity.
Trusted Weights, Treacherous Optimizations? Optimization-Triggered Backdoor Attacks on LLMs
cs.CR 2026-05 conditional novelty 7.0

Compilation optimizations can be exploited to create stealthy backdoors in LLMs that remain dormant without optimization but achieve ~90% attack success while preserving clean accuracy near 100%.
PilotWiMAE: Pilot-Native Representation Learning for Wireless Channels
eess.SP 2026-05 unverdicted novelty 7.0

PilotWiMAE pretrains an encoder on noisy pilots with factorized attention, 99% masking, patch-normalized reconstruction, scale loss, and AWGN curriculum to outperform supervised baselines in cross-frequency beam selec...
The Economics of AI Inference: Inflation Dynamics, Welfare Costs, and Optimal Monetary Policy under the Inference-Cost Phillips Curve
econ.GN 2026-05 unverdicted novelty 7.0

Develops the Inference-Cost Phillips Curve linking AI inference costs to inflation dynamics, derives structural slopes and optimal monetary policy, and reports empirical estimates from US and G7 data that align with t...
JanusPipe: Efficient Pipeline Parallel Training for Machine Learning Interatomic Potentials
cs.DC 2026-05 unverdicted novelty 7.0

JanusPipe introduces SymFold and WaveK to enable efficient 3D-parallel training for conservative MLIPs, reporting 1.51x and 1.45x average throughput gains over 1F1B and Hanayo baselines on 32 GPUs.
Ringmaster LMO: Asynchronous Linear Minimization Oracle Momentum Method
cs.LG 2026-05 unverdicted novelty 7.0

Ringmaster LMO extends delay-thresholding from ASGD to LMO-based momentum updates, providing convergence guarantees under (L0, L1)-smoothness and time-complexity bounds that recover optimal rates in the Euclidean case.
Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers
math.OC 2026-05 conditional novelty 7.0

Proposes equivariant optimizers matched to the symmetry groups of embeddings, SwiGLU projections and MoE routers, with experiments showing consistent gains over AdamW on language model pre-training.
A Data-Efficient Path to Multilingual LLMs: Language Expansion via Post-training PARAM$\Delta$ Integration into Upcycled MoE
cs.CL 2026-05 unverdicted novelty 7.0

PARAMΔ upcycles dense models to MoE for per-language experts and grafts post-training deltas to enable data-efficient language expansion while preserving original capabilities.
SNLP: Layer-Parallel Inference via Structured Newton Corrections
cs.LG 2026-05 unverdicted novelty 7.0

SNLP enables layer-parallel Transformer inference by replacing sequential layer execution with structured Newton corrections and SNLP-aware training regularization, yielding up to 2.3x wall-clock speedup on 0.5B model...
PEIRA: Learning Predictive Encoders through Inter-View Regressor Alignment
cs.LG 2026-05 unverdicted novelty 7.0

PEIRA learns predictive encoders by optimizing the trace of the optimal inter-view linear regressor, with only nontrivial global minimizers as stable equilibria that recover leading nonlinear canonical correlation subspaces.
Scale-Dependent Collective Adaptation in Self-Amending LLM Societies: A Cross-Family Study of Emergent Governance
nlin.AO 2026-05 unverdicted novelty 7.0

LLM societies in Nomic show non-monotonic collective adaptation peaking at mid-scales, with smaller models rule-inert and larger ones restrictive.
Olivia: Harmonizing Time Series Foundation Models with Power Spectral Density
cs.LG 2026-05 unverdicted novelty 7.0

Olivia harmonizes time series datasets via normalized power spectral density using a Harmonizer module and resonator-based HarmonicAttention, achieving state-of-the-art zero-shot, few-shot, and full-shot forecasting o...
Characterizing Learning in Deep Neural Networks using Tractable Algorithmic Complexity Analysis
cs.LG 2026-05 unverdicted novelty 7.0

QuBD extends algorithmic complexity estimation to quantized DNN weights, revealing that complexity decreases during learning, increases with overfitting, follows grokking patterns, and correlates with generalization.
How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization
cs.LG 2026-05 unverdicted novelty 7.0

The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, an...
Do Language Models Align with Brains? Prediction Scores Are Not Enough
q-bio.NC 2026-05 unverdicted novelty 7.0

Language model representations fail all L-PACT alignment gates once controls explain the apparent predictive and relational effects.
Scaling Laws for Mixture Pretraining Under Data Constraints
cs.LG 2026-05 conditional novelty 7.0

Repetition-aware scaling laws show scarce target data in pretraining mixtures can be repeated 15-20 times optimally, with the best count depending on data size, compute, and model scale.
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
cs.RO 2026-05 unverdicted novelty 7.0

MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.
AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration - Learning from Cheap, Optimizing Expensive
cs.AI 2026-05 unverdicted novelty 7.0

AutoLLMResearch trains agents via a multi-fidelity environment and MDP pipeline to extrapolate configuration principles from inexpensive to costly LLM experiments.
Uniform Scaling Limits in AdamW-Trained Transformers
stat.ML 2026-05 unverdicted novelty 7.0

AdamW-trained transformer hidden states and backpropagated variables converge uniformly in L2 to a forward-backward ODE system (McKean-Vlasov when non-causal) at rate O(L^{-1}+L^{-1/3}H^{-1/2}) as depth L and heads H ...
Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining
cs.CL 2026-05 unverdicted novelty 7.0

Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.
Sharp feature-learning transitions and Bayes-optimal neural scaling laws in extensive-width networks
stat.ML 2026-05 unverdicted novelty 7.0

In extensive-width networks, features are recovered sequentially through sharp phase transitions, yielding an effective width k_c that unifies Bayes-optimal generalization error scaling as Θ(k_c d / n).
GraphInstruct: A Progressive Benchmark for Diagnosing Capability Gaps in LLM Graph Generation
cs.SI 2026-05 unverdicted novelty 7.0

GraphInstruct introduces a six-level progressive benchmark with 800 instructions and 1,582 references to diagnose LLM graph generation gaps, plus a verification-guided iterative prompting framework that improves performance.
GraphInstruct: A Progressive Benchmark for Diagnosing Capability Gaps in LLM Graph Generation
cs.SI 2026-05 unverdicted novelty 7.0

GraphInstruct is a progressive benchmark with six complexity levels for LLM graph generation that identifies multi-constraint composition as the hardest point and shows a verification-guided iterative framework outper...
Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception
cs.CV 2026-05 unverdicted novelty 7.0

Urban-ImageNet is a 2-million-image multi-modal dataset with HUSIC 10-class taxonomy enabling benchmarks for urban scene classification, cross-modal retrieval, and instance segmentation.
The Wittgensteinian Representation Hypothesis: Is Language the Attractor of Multimodal Convergence?
cs.AI 2026-05 unverdicted novelty 7.0

Language representations serve as the asymptotic attractor for convergence in independently trained multimodal neural networks due to feature density asymmetry.
How Much is Brain Data Worth for Machine Learning?
cs.AI 2026-05 conditional novelty 7.0

Brain data is worth a variable number of task samples depending on task-brain alignment, noise levels, and latent dimension, with conditions under which it also improves robustness to test distribution shift.
DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
cs.LG 2026-05 unverdicted novelty 7.0

DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.
Spectral Dynamics in Deep Networks: Feature Learning, Outlier Escape, and Learning Rate Transfer
cond-mat.dis-nn 2026-05 unverdicted novelty 7.0

A two-level DMFT predicts width-consistent outlier escape and hyperparameter transfer under μP in deep networks, with bulk restructuring dominating for tasks with many outputs.
Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences
cs.LG 2026-05 unverdicted novelty 7.0

Recursive generative retraining with pluralistic preferences converges to a stable diverse distribution that satisfies a weighted Nash bargaining solution.
On the Invariance and Generality of Neural Scaling Laws
cs.LG 2026-05 unverdicted novelty 7.0

Neural scaling laws are invariant under bijective data transformations and change predictably with information resolution ρ under non-bijective transformations, enabling cross-domain transport of fitted exponents.
Agentick: A Unified Benchmark for General Sequential Decision-Making Agents
cs.AI 2026-05 unverdicted novelty 7.0

Agentick is a new benchmark for sequential decision-making agents that evaluates RL, LLM, VLM, hybrid, and human approaches across 37 tasks and finds no single method dominates.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · cited by 614 Pith papers · 4 internal anchors

[1]

High-dimensional dynamics of generalization error in neural networks

25 [AS17] Madhu S. Advani and Andrew M. Saxe. High-dimensional dynamics of generalization error in neural networks. arXiv, 2017, 1710.03667. 11, 18, 22 [BB01] Michele Banko and Eric Brill. Scaling to very very large corpora for natural language disam- biguation. In Proceedings of the 39th annual meeting on association for computational linguis- tics, page...

work page Pith review Pith/arXiv arXiv 2017
[2]

Proceedings of the National Academy of Sciences , volume =

18 [BHMM18] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine learning and the bias-variance trade-off. arXiv, 2018, 1812.11118. 18 [Bia12] GÃŠrard Biau. Analysis of a random forests model. Journal of Machine Learning Research , 13(Apr):1063–1095,

work page arXiv 2018
[3]

Generating Long Sequences with Sparse Transformers

18 [CGRS19] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. CoRR, abs/1904.10509, 2019, 1904.10509. URL http://arxiv.org/ abs/1904.10509. 19 [DCLT18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understandi...

work page internal anchor Pith review Pith/arXiv arXiv 1904
[4]

Gradient Descent Happens in a Tiny Subspace

25 [Fou] The Common Crawl Foundation. Common crawl. URL http://commoncrawl.org. 7 [GARD18] Guy Gur-Ari, Daniel A. Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace. 2018, arXiv:1812.04754. 18 [GJS+19] Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stéphane d’Ascoli, Giulio Biroli, Clément Hongler, and Matthie...

work page Pith review Pith/arXiv arXiv 2018
[5]

18 [GRK17] Scott Gray, Alec Radford, and Diederik P Kingma

URL http://arxiv.org/abs/cs.CL/0108005. 18 [GRK17] Scott Gray, Alec Radford, and Diederik P Kingma. Gpu kernels for block-sparse weights. ope- nai.com,

work page arXiv
[6]

Sadayappan

ACM. doi:10.1145/3293883.3295710. 18 28 [HCC+18] Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V . Le, and Zhifeng Chen. Gpipe: Efﬁcient training of giant neural networks using pipeline parallelism. CoRR, abs/1811.06965, 2018, 1811.06965. URL http://arxiv.org/abs/1811.06965. 19 [HNA+17] Joel Hestness, Sharan Narang, Newsha ...

work page doi:10.1145/3293883.3295710 2018
[7]

Adam: A Method for Stochastic Optimization

18 [KB14] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2014, 1412.6980. 7 [Kom19] Aran Komatsuzaki. One epoch is all you need, 2019, arXiv:1906.06669. 18 [KSH12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In Proceedings of the 25th International C...

work page internal anchor Pith review Pith/arXiv arXiv 2014
[8]

URL http://dl.acm.org/citation.cfm?id=2999134.2999257

Curran Associates Inc. URL http://dl.acm.org/citation.cfm?id=2999134.2999257. 19 [LCG+19] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations, 2019, 1909.11942. 9 [LOG+19] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi...

work page arXiv 2019
[9]

Wide neural networks of any depth evolv e as linear models under gradient descent

25 [LXS+19] Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl- Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent, 2019, arXiv:1902.06720. 18 [MKAT18] Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model of large-batch tra...

work page arXiv 2019
[10]

arXiv preprint arXiv:1909.12673 , year=

2, 6 [RRBS19a] Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive prediction of the generalization error across scales, 2019, 1909.12673. 18 [RRBS19b] Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive prediction of the generalization error across scales, 2019, arXiv:1909.12673. 18 ...

work page arXiv 2019
[11]

Mesh-TensorFlow: Deep Learning for Supercomputers

2, 5, 6, 7, 8 [SCP+18] Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanan- takool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sepassi, and Blake Hechtman. Mesh-tensorﬂow: Deep learning for supercomputers, 2018, 1811.02084. 19 [SHB15] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine t...

work page Pith review Pith/arXiv arXiv 2018
[12]

18 [TL19] Mingxing Tan and Quoc V . Le. Efﬁcientnet: Rethinking model scaling for convolutional neural networks. CoRR, abs/1905.11946, 2019, 1905.11946. URL http://arxiv.org/abs/1905. 11946. 18 [VSP+17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. I...

work page internal anchor Pith review Pith/arXiv arXiv 1905
[13]

2, 6 [VWB16] Andreas Veit, Michael Wilber, and Serge Belongie

URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf . 2, 6 [VWB16] Andreas Veit, Michael Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively shallow networks, 2016, arXiv:1605.06431. 8, 18 [Was06] Larry Wasserman. All of nonparametric statistics. Springer Science & Business Media,

work page Pith/arXiv arXiv 2016
[14]

18 [WPN+19] Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems, 2019, 1905.00537. 2 [WRH17] Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. Growing a brain: Fine-tuning by in- creasing model capacity....

work page internal anchor Pith review Pith/arXiv arXiv 2019
[15]

Growing a Brain: Fine-Tuning by Increasing Model Capacity , Url =

doi:10.1109/cvpr.2017.323. 19 [WYL19] Wei Wen, Feng Yan, and Hai Li. Autogrow: Automatic layer growing in deep convolutional networks, 2019, 1906.02909. 19 [YDY+19] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V . Le. Xlnet: Generalized autoregressive pretraining for language understanding, 2019, arXiv:1906.08237. ...

work page doi:10.1109/cvpr.2017.323 2017
[16]

Wide residual networks

doi:10.5244/c.30.87. 18 [ZKZ+15] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Tor- ralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. 2015 IEEE International Conference on Computer Vision (ICCV), Dec

work page doi:10.5244/c.30.87 2015
[17]

2015 IEEE International Conference on Computer Vision (ICCV), 19–27 (2015) https://doi.org/10.1109/iccv.2015.11

doi:10.1109/iccv.2015.11. 7 [ZLN+19] Guodong Zhang, Lala Li, Zachary Nado, James Martens, Sushant Sachdeva, George E. Dahl, Christopher J. Shallue, and Roger B. Grosse. Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model. CoRR, abs/1907.04164, 2019, 1907.04164. URL http://arxiv.org/abs/1907.04164. 12, 18 30

work page doi:10.1109/iccv.2015.11 2015

[1] [1]

High-dimensional dynamics of generalization error in neural networks

25 [AS17] Madhu S. Advani and Andrew M. Saxe. High-dimensional dynamics of generalization error in neural networks. arXiv, 2017, 1710.03667. 11, 18, 22 [BB01] Michele Banko and Eric Brill. Scaling to very very large corpora for natural language disam- biguation. In Proceedings of the 39th annual meeting on association for computational linguis- tics, page...

work page Pith review Pith/arXiv arXiv 2017

[2] [2]

Proceedings of the National Academy of Sciences , volume =

18 [BHMM18] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine learning and the bias-variance trade-off. arXiv, 2018, 1812.11118. 18 [Bia12] GÃŠrard Biau. Analysis of a random forests model. Journal of Machine Learning Research , 13(Apr):1063–1095,

work page arXiv 2018

[3] [3]

Generating Long Sequences with Sparse Transformers

18 [CGRS19] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. CoRR, abs/1904.10509, 2019, 1904.10509. URL http://arxiv.org/ abs/1904.10509. 19 [DCLT18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understandi...

work page internal anchor Pith review Pith/arXiv arXiv 1904

[4] [4]

Gradient Descent Happens in a Tiny Subspace

25 [Fou] The Common Crawl Foundation. Common crawl. URL http://commoncrawl.org. 7 [GARD18] Guy Gur-Ari, Daniel A. Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace. 2018, arXiv:1812.04754. 18 [GJS+19] Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stéphane d’Ascoli, Giulio Biroli, Clément Hongler, and Matthie...

work page Pith review Pith/arXiv arXiv 2018

[5] [5]

18 [GRK17] Scott Gray, Alec Radford, and Diederik P Kingma

URL http://arxiv.org/abs/cs.CL/0108005. 18 [GRK17] Scott Gray, Alec Radford, and Diederik P Kingma. Gpu kernels for block-sparse weights. ope- nai.com,

work page arXiv

[6] [6]

Sadayappan

ACM. doi:10.1145/3293883.3295710. 18 28 [HCC+18] Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V . Le, and Zhifeng Chen. Gpipe: Efﬁcient training of giant neural networks using pipeline parallelism. CoRR, abs/1811.06965, 2018, 1811.06965. URL http://arxiv.org/abs/1811.06965. 19 [HNA+17] Joel Hestness, Sharan Narang, Newsha ...

work page doi:10.1145/3293883.3295710 2018

[7] [7]

Adam: A Method for Stochastic Optimization

18 [KB14] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2014, 1412.6980. 7 [Kom19] Aran Komatsuzaki. One epoch is all you need, 2019, arXiv:1906.06669. 18 [KSH12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In Proceedings of the 25th International C...

work page internal anchor Pith review Pith/arXiv arXiv 2014

[8] [8]

URL http://dl.acm.org/citation.cfm?id=2999134.2999257

Curran Associates Inc. URL http://dl.acm.org/citation.cfm?id=2999134.2999257. 19 [LCG+19] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations, 2019, 1909.11942. 9 [LOG+19] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi...

work page arXiv 2019

[9] [9]

Wide neural networks of any depth evolv e as linear models under gradient descent

25 [LXS+19] Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl- Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent, 2019, arXiv:1902.06720. 18 [MKAT18] Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model of large-batch tra...

work page arXiv 2019

[10] [10]

arXiv preprint arXiv:1909.12673 , year=

2, 6 [RRBS19a] Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive prediction of the generalization error across scales, 2019, 1909.12673. 18 [RRBS19b] Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive prediction of the generalization error across scales, 2019, arXiv:1909.12673. 18 ...

work page arXiv 2019

[11] [11]

Mesh-TensorFlow: Deep Learning for Supercomputers

2, 5, 6, 7, 8 [SCP+18] Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanan- takool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sepassi, and Blake Hechtman. Mesh-tensorﬂow: Deep learning for supercomputers, 2018, 1811.02084. 19 [SHB15] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine t...

work page Pith review Pith/arXiv arXiv 2018

[12] [12]

18 [TL19] Mingxing Tan and Quoc V . Le. Efﬁcientnet: Rethinking model scaling for convolutional neural networks. CoRR, abs/1905.11946, 2019, 1905.11946. URL http://arxiv.org/abs/1905. 11946. 18 [VSP+17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. I...

work page internal anchor Pith review Pith/arXiv arXiv 1905

[13] [13]

2, 6 [VWB16] Andreas Veit, Michael Wilber, and Serge Belongie

URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf . 2, 6 [VWB16] Andreas Veit, Michael Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively shallow networks, 2016, arXiv:1605.06431. 8, 18 [Was06] Larry Wasserman. All of nonparametric statistics. Springer Science & Business Media,

work page Pith/arXiv arXiv 2016

[14] [14]

18 [WPN+19] Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems, 2019, 1905.00537. 2 [WRH17] Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. Growing a brain: Fine-tuning by in- creasing model capacity....

work page internal anchor Pith review Pith/arXiv arXiv 2019

[15] [15]

Growing a Brain: Fine-Tuning by Increasing Model Capacity , Url =

doi:10.1109/cvpr.2017.323. 19 [WYL19] Wei Wen, Feng Yan, and Hai Li. Autogrow: Automatic layer growing in deep convolutional networks, 2019, 1906.02909. 19 [YDY+19] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V . Le. Xlnet: Generalized autoregressive pretraining for language understanding, 2019, arXiv:1906.08237. ...

work page doi:10.1109/cvpr.2017.323 2017

[16] [16]

Wide residual networks

doi:10.5244/c.30.87. 18 [ZKZ+15] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Tor- ralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. 2015 IEEE International Conference on Computer Vision (ICCV), Dec

work page doi:10.5244/c.30.87 2015

[17] [17]

2015 IEEE International Conference on Computer Vision (ICCV), 19–27 (2015) https://doi.org/10.1109/iccv.2015.11

doi:10.1109/iccv.2015.11. 7 [ZLN+19] Guodong Zhang, Lala Li, Zachary Nado, James Martens, Sushant Sachdeva, George E. Dahl, Christopher J. Shallue, and Roger B. Grosse. Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model. CoRR, abs/1907.04164, 2019, 1907.04164. URL http://arxiv.org/abs/1907.04164. 12, 18 30

work page doi:10.1109/iccv.2015.11 2015