arxiv: 2203.15556 · v1 · submitted 2022-03-29 · 💻 cs.CL · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Training Compute-Optimal Large Language Models

Aidan Clark, Arthur Mensch, Aurelia Guy, Bogdan Damoc, Diego de las Casas, Elena Buchatskaya, Eliza Rutherford, Erich Elsen, Eric Noland, George van den Driessche, Jack W. Rae, Johannes Welbl, Jordan Hoffmann, Karen Simonyan, Katie Millican, Laurent Sifre, Lisa Anne Hendricks, Oriol Vinyals, Sebastian Borgeaud, Simon Osindero, Tom Hennigan, Trevor Cai

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:55 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords compute optimal trainingscaling lawslarge language modelstransformerChinchillamodel sizetraining tokens

0 comments

The pith

For compute-optimal LLM training, scale model size and training tokens equally.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the best way to split a fixed compute budget between the number of model parameters and the amount of training data for transformer language models. Through experiments with over 400 models of varying sizes and data volumes, it concludes that current large models are undertrained due to keeping data fixed while increasing parameters. The central result is that optimal performance comes from scaling model size and tokens in tandem, doubling both when compute doubles. This approach yields models that outperform much larger ones on standard benchmarks while using fewer resources for inference.

Core claim

By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled. We test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4× more data. Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks.

What carries the argument

The scaling relation where optimal model size N and data D are proportional under fixed compute budget, derived from fitting loss as a function of N and D.

If this is right

Chinchilla achieves higher accuracy on benchmarks like MMLU with 70B parameters than larger models using the same compute.
Smaller optimal models reduce the compute needed for fine-tuning and inference.
Future training runs should increase data proportionally to model size rather than fixing data size.
Undertrained models can be improved by adding more data instead of just more parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Data collection efforts will need to grow in tandem with model scaling to maintain optimal performance.
Similar optimal scaling ratios may apply to other domains like vision or multimodal models.
This challenges the prior trend of ever-larger models trained on fixed amounts of data.

Load-bearing premise

The parametric form of the scaling law fitted to models up to 16B parameters and 500B tokens holds for larger scales.

What would settle it

Training a model at a larger compute budget using the equal-scaling prediction and finding that its loss or downstream performance is worse than a model using a different N-to-D ratio.

read the original abstract

We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant. By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled. We test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4$\times$ more more data. Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks. This also means that Chinchilla uses substantially less compute for fine-tuning and inference, greatly facilitating downstream usage. As a highlight, Chinchilla reaches a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, greater than a 7% improvement over Gopher.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that LLMs have been undertrained on data and that equal scaling of parameters and tokens is compute-optimal, backed by a large sweep and a direct validation run.

read the letter

The main takeaway is that current large models like Gopher are undertrained, and that for a fixed compute budget you get better results by scaling model size and training tokens together rather than mostly growing the model. They trained over 400 models from 70M to 16B parameters on 5B to 500B tokens, fitted a scaling law, and then trained Chinchilla at 70B parameters with roughly 1.4T tokens under the same compute as Gopher. Chinchilla beats Gopher, GPT-3, Jurassic-1, and Megatron-Turing NLG on a wide range of downstream tasks, including a clear gain on MMLU. That out-of-sample validation run is the strongest part of the work and makes the equal-scaling claim credible rather than just a curve fit. The study is useful because it gives a concrete, actionable rule for how to allocate compute when training transformers at this scale. The soft spot is the extrapolation: the law is fitted on models up to 16B and 500B tokens, so moving to 70B and 1.4T tokens assumes the same functional form continues to hold. The successful Chinchilla run reduces the risk, but it does not eliminate uncertainty about whether a different loss curve would shift the optimum. This is the kind of paper that changes how labs set training budgets, so it deserves a serious referee even if some details of the functional form need tightening in revision.

Referee Report

1 major / 2 minor

Summary. The paper investigates the optimal allocation of compute between model size (N) and training tokens (D) for transformer language models. By training over 400 models spanning 70M to 16B parameters and 5B to 500B tokens, the authors fit a parametric scaling law for validation loss and conclude that compute-optimal training requires scaling N and D equally. They validate the prediction by training Chinchilla (70B parameters, ~1.4T tokens) under the same compute budget as Gopher (280B parameters, 300B tokens); Chinchilla outperforms Gopher, GPT-3, Jurassic-1, and Megatron-Turing NLG on a wide range of downstream tasks, including achieving 67.5% average accuracy on MMLU.

Significance. If the derived scaling laws and equal N/D scaling hold, the work is highly significant: it provides empirical evidence that many recent large models are undertrained and demonstrates a practical method for more efficient compute allocation that reduces inference and fine-tuning costs. The strength lies in the scale of the experimental sweep (>400 models) combined with direct out-of-sample validation via the Chinchilla training run, which moves the claim beyond pure curve-fitting.

major comments (1)

[Scaling Laws section (around the derivation of optimal N and D)] The scaling-law fit is performed on models up to 16B parameters; the Chinchilla prediction extrapolates both in N (to 70B) and in D (to 1.4T tokens). While the successful Chinchilla run provides supporting evidence, the manuscript should quantify the uncertainty in the predicted optimum arising from variance in the fitted coefficients (A, B, α, β) and discuss whether alternative functional forms for L(N,D) would materially change the equal-scaling conclusion.

minor comments (2)

[Abstract] Abstract contains a typographical error: '4× more more data' should be '4× more data'.
[Figures 2–5 and associated text] Several figures (e.g., loss-vs-compute curves and downstream-task comparisons) would benefit from explicit error bars or shaded uncertainty regions to convey variability across the 400-model sweep.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive evaluation of the work and for the constructive suggestion regarding uncertainty quantification. We address the major comment below and will incorporate the requested analysis in the revised manuscript.

read point-by-point responses

Referee: [Scaling Laws section (around the derivation of optimal N and D)] The scaling-law fit is performed on models up to 16B parameters; the Chinchilla prediction extrapolates both in N (to 70B) and in D (to 1.4T tokens). While the successful Chinchilla run provides supporting evidence, the manuscript should quantify the uncertainty in the predicted optimum arising from variance in the fitted coefficients (A, B, α, β) and discuss whether alternative functional forms for L(N,D) would materially change the equal-scaling conclusion.

Authors: We agree that an explicit quantification of uncertainty in the extrapolated optimum would strengthen the presentation. In the revised manuscript we will add a short subsection to the Scaling Laws section that reports bootstrap confidence intervals on the fitted coefficients A, B, α, and β (obtained by resampling the >400 training runs with replacement). These intervals will be propagated through the closed-form expression for the optimal N*(C) and D*(C) to give a range of plausible optima at the compute budget used for Chinchilla. Regarding alternative functional forms, we will include a brief discussion showing that the equal-scaling conclusion is robust: the optimum arises from balancing the two power-law terms, so modest changes to the exponents or the use of a multiplicative interaction term leave the scaling exponents for N and D with respect to compute essentially unchanged (both remain close to 0.5). The successful Chinchilla training run, which lies well outside the fitted regime, already provides direct empirical support for the predicted allocation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; out-of-sample validation confirms scaling prediction

full rationale

The derivation fits a parametric loss function L(N, D) to empirical results from over 400 models (70M–16B parameters, 5B–500B tokens), derives the compute-optimal relation by minimizing under the constraint C ≈ 6ND, and directly tests the resulting prediction by training Chinchilla (70B parameters, ~1.4T tokens) under Gopher's compute budget. Chinchilla's superior downstream performance constitutes independent falsification outside the fitting set. No step reduces to self-definition, fitted-input renaming, or load-bearing self-citation; the functional form and optimum are externally validated rather than tautological.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The optimal allocation is obtained by fitting a parametric loss function L(N,D) to the 400-model results and minimizing under a compute constraint; the functional form and fitted coefficients are the main unverified inputs.

free parameters (1)

Scaling law coefficients
Parameters (A, B, alpha, beta, E) in the assumed loss scaling form L(N, D) = E + A/N^alpha + B/D^beta fitted to the empirical results of the 400 models.

axioms (1)

domain assumption Loss follows a power-law dependence on model size N and data D
The functional form used to fit data and derive the equal-scaling optimum.

pith-pipeline@v0.9.0 · 5617 in / 1398 out tokens · 67828 ms · 2026-05-10T15:55:30.872186+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry
cs.LG 2026-04 unverdicted novelty 8.0

Transformer weight spectra exhibit transient compression waves that propagate layer-wise, persistent non-monotonic depth gradients in power-law exponents, and Q/K-V asymmetry, with the spectral exponent alpha predicti...
Large Language Diffusion Models
cs.CL 2025-02 unverdicted novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Editing Models with Task Arithmetic
cs.LG 2022-12 accept novelty 8.0

Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
cs.RO 2026-05 unverdicted novelty 7.0

MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.
Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining
cs.CL 2026-05 unverdicted novelty 7.0

Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.
Sharp feature-learning transitions and Bayes-optimal neural scaling laws in extensive-width networks
stat.ML 2026-05 unverdicted novelty 7.0

In extensive-width networks, features are recovered sequentially through sharp phase transitions, yielding an effective width k_c that unifies Bayes-optimal generalization error scaling as Θ(k_c d / n).
How Much is Brain Data Worth for Machine Learning?
cs.AI 2026-05 conditional novelty 7.0

Brain data is worth a variable number of task samples depending on task-brain alignment, noise levels, and latent dimension, with conditions under which it also improves robustness to test distribution shift.
DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
cs.LG 2026-05 unverdicted novelty 7.0

DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.
Spectral Dynamics in Deep Networks: Feature Learning, Outlier Escape, and Learning Rate Transfer
cond-mat.dis-nn 2026-05 unverdicted novelty 7.0

A two-level DMFT predicts width-consistent outlier escape and hyperparameter transfer under μP in deep networks, with bulk restructuring dominating for tasks with many outputs.
On the Invariance and Generality of Neural Scaling Laws
cs.LG 2026-05 unverdicted novelty 7.0

Neural scaling laws are invariant under bijective data transformations and change predictably with information resolution ρ under non-bijective transformations, enabling cross-domain transport of fitted exponents.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.
Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination
cs.AI 2026-05 unverdicted novelty 7.0

Transformer hidden states encode facts as attractor basins; hallucinations occur from basin absence and conflicts from basin competition, detected cleanly by geometric margin rather than entropy.
The Predictive-Causal Gap: An Impossibility Theorem and Large-Scale Neural Evidence
cs.LG 2026-05 unverdicted novelty 7.0

Predictive representation learning structurally favors encoding slower or less noisy environment modes over causal system modes, as shown by an impossibility theorem for linear-Gaussian dynamics and large-scale neural...
Tempus: A Temporally Scalable Resource-Invariant GEMM Streaming Framework for Versal AI Edge
cs.DC 2026-05 unverdicted novelty 7.0

Tempus delivers 607 GOPS at 10.677 W using fixed 16 AIE cores on Versal AI Edge, with 211.2x better platform-aware utility than spatial SOTA ARIES and zero URAM/DSP utilization.
CellxPert: Inference-Time MCMC Steering of a Multi-Omics Single-Cell Foundation Model for In-Silico Perturbation
q-bio.GN 2026-04 unverdicted novelty 7.0

CellxPert uses inference-time MCMC steering on a multi-omics single-cell foundation model to predict genome-wide transcriptomic responses to gene perturbations and outperforms baselines on cell-type annotation, pertur...
The Cost of Consensus: Isolated Self-Correction Prevails Over Unguided Homogeneous Multi-Agent Debate
cs.MA 2026-04 unverdicted novelty 7.0

Homogeneous multi-agent debate introduces sycophantic conformity, contextual fragility, and consensus collapse, leading to equal or lower accuracy than isolated self-correction at 2.1-3.4x higher token cost on GSM-Har...
LoopCTR: Unlocking the Loop Scaling Power for Click-Through Rate Prediction
cs.IR 2026-04 unverdicted novelty 7.0

LoopCTR trains CTR models with recursive layer reuse and process supervision so that zero-loop inference outperforms baselines on public and industrial datasets.
Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms
cs.CL 2026-04 unverdicted novelty 7.0

Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.
Neural Garbage Collection: Learning to Forget while Learning to Reason
cs.LG 2026-04 conditional novelty 7.0

Language models learn to evict KV cache entries end-to-end via reinforcement learning from outcome reward alone, achieving 2-3x cache compression while maintaining accuracy on Countdown, AMC, and AIME tasks.
Causal inference for social network formation
econ.EM 2026-04 conditional novelty 7.0

Random team assignments in a professional firm reveal that indirect ties strongly increase new direct tie formation, while effects of degree and local density are smaller and less robust.
Rectification Difficulty and Optimal Sample Allocation in LLM-Augmented Surveys
cs.AI 2026-04 unverdicted novelty 7.0

A method using predicted rectification difficulty for optimal human sample allocation in LLM-augmented surveys captures 61-79% of theoretical efficiency gains and reduces MSE by 11% on two datasets without pilot data.
How Do LLMs and VLMs Understand Viewpoint Rotation Without Vision? An Interpretability Study
cs.AI 2026-04 unverdicted novelty 7.0

LLMs and VLMs encode viewpoint information in hidden states but fail to bind it to corresponding observations, resulting in hallucinations in final layers on text-only viewpoint rotation tasks.
STORM: End-to-End Referring Multi-Object Tracking in Videos
cs.CV 2026-04 unverdicted novelty 7.0

STORM is an end-to-end MLLM for referring multi-object tracking that uses task-composition learning to leverage sub-task data and introduces the STORM-Bench dataset, achieving SOTA results.
Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Supervised fine-tuning of LLMs often fails to fully internalize all training instances due to five recurring causes including missing prerequisites and data conflicts, as diagnosed via a new framework across multiple models.
The Shrinking Lifespan of LLMs in Science
cs.DL 2026-04 unverdicted novelty 7.0

LLM adoption in science follows a compressing inverted-U trajectory where release year predicts time-to-peak and lifespan better than model attributes.
LASER: A Data-Centric Method for Low-Cost and Efficient SQL Rewriting based on SQL-GRPO
cs.DB 2026-04 unverdicted novelty 7.0

LASER generates complex slow-query training data with MCTS and aligns small models via SQL-GRPO to deliver efficient, low-cost SQL rewriting that outperforms rules and large models.
Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation
cs.SE 2026-04 unverdicted novelty 7.0

Chain-of-Thought prompting balances high accuracy with low energy use in small language models for code generation, while multi-sampling strategies add high energy costs for small accuracy gains.
GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving
cs.DC 2026-03 unverdicted novelty 7.0

GhostServe applies erasure coding to KV cache in host memory for fast recovery from failures in LLM serving, cutting checkpointing latency up to 2.7x and recovery latency 2.1x versus prior methods.
Moshi: a speech-text foundation model for real-time dialogue
eess.AS 2024-09 accept novelty 7.0

Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
cs.CV 2024-06 conditional novelty 7.0

Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
Scaling and evaluating sparse autoencoders
cs.LG 2024-06 unverdicted novelty 7.0

K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
cs.CL 2024-05 unverdicted novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
cs.LG 2024-01 conditional novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
C-Pack: Packed Resources For General Chinese Embeddings
cs.CL 2023-09 accept novelty 7.0

C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
Segment Anything
cs.CV 2023-04 unverdicted novelty 7.0

A promptable model trained on 1B masks achieves competitive zero-shot segmentation performance across tasks and is released publicly with its dataset.
Accelerating Large Language Model Decoding with Speculative Sampling
cs.CL 2023-02 accept novelty 7.0

Speculative sampling accelerates LLM decoding 2-2.5x by letting a draft model propose short sequences that the target model scores in parallel, then applies modified rejection sampling to keep the exact target distribution.
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
cs.CV 2023-01 unverdicted novelty 7.0

BLIP-2 bootstraps vision-language pre-training from frozen image encoders and LLMs via a lightweight two-stage Querying Transformer, delivering SOTA results with 54x fewer trainable parameters than Flamingo80B on zero...
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
cs.LG 2022-08 conditional novelty 7.0

LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
A Generalist Agent
cs.AI 2022-05 accept novelty 7.0

Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
OPT: Open Pre-trained Transformer Language Models
cs.CL 2022-05 unverdicted novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Flamingo: a Visual Language Model for Few-Shot Learning
cs.CV 2022-04 unverdicted novelty 7.0

Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
Set-Aggregated Genome Embeddings for Microbiome Abundance Prediction
q-bio.GN 2026-05 unverdicted novelty 6.0

Set-aggregated genome embeddings from genomic language models predict microbiome abundance profiles with improved generalization to novel genomes over classical bioinformatics methods.
EvoNav: Evolutionary Reward Function Design for Robot Navigation with Large Language Models
cs.RO 2026-05 unverdicted novelty 6.0

EvoNav automates the design of reward functions for RL robot navigation by evolving LLM proposals through a three-stage cheap-to-expensive evaluation process and claims better policies than hand-crafted or prior autom...
Active Testing of Large Language Models via Approximate Neyman Allocation
cs.AI 2026-05 unverdicted novelty 6.0

Active testing via surrogate semantic entropy stratification and approximate Neyman allocation reduces MSE by up to 28% versus uniform sampling and saves about 23% of the labeling budget on language and multimodal benchmarks.
Annotations Mitigate Post-Training Mode Collapse
cs.CL 2026-05 unverdicted novelty 6.0

Annotation-anchored training reduces semantic diversity collapse in post-trained language models by a factor of six compared to standard supervised fine-tuning while preserving instruction-following and improving with scale.
Sparse Layers are Critical to Scaling Looped Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Looped MoE models scale better than standard transformers because different experts activate on each loop pass, recovering expressivity without extra parameters, and support superior early exits.
Predicting Large Model Test Losses with a Noisy Quadratic System
cs.LG 2026-05 unverdicted novelty 6.0

A noisy quadratic system predicts large model test losses from N, B, K and outperforms Chinchilla's model for extrapolation up to 1000x compute.
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI
cs.LG 2026-05 unverdicted novelty 6.0

MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.
The Propagation Field: A Geometric Substrate Theory of Deep Learning
cs.LG 2026-05 unverdicted novelty 6.0

Neural networks possess a propagation field of trajectories and Jacobians whose quality can be measured and optimized independently of endpoint loss, yielding better unseen-path generalization and reduced forgetting i...
A Qualitative Test-Risk Mechanism for Scaling Behavior in Normalized Residual Networks
cs.LG 2026-05 unverdicted novelty 6.0

Depth expansion in normalized residual networks yields provable test-risk improvement through representational, optimization, and generalization gains under first-order descent and norm-control conditions.
Pretraining Induces a Reusable Spectral Basis for Downstream Task Adaptation
cs.LG 2026-05 unverdicted novelty 6.0

Pretraining induces stable leading singular vectors that form a reusable spectral basis inherited by downstream tasks, enabling competitive performance with 0.2% trainable parameters on GLUE.
An Interpretable and Scalable Framework for Evaluating Large Language Models
stat.ML 2026-05 unverdicted novelty 6.0

A majorization-minimization framework turns IRT into scalable matrix factorization subproblems for LLM evaluation, delivering orders-of-magnitude speedups with identifiability guarantees.
Target-Aware Data Augmentation for SAT Prediction
cs.LG 2026-05 unverdicted novelty 6.0

A target-aware solver-free data generation pipeline plus an LPGNN that uses linear-programming residuals produces fast, correctly labeled training data and improves GNN-based SAT prediction.
Knowledge Transfer Scaling Laws for 3D Medical Imaging
cs.CV 2026-05 conditional novelty 6.0

Transfer-aware data allocation derived from observed power-law scaling laws for asymmetric knowledge transfer in 3D medical imaging outperforms standard proportional sampling by up to 58% and generalizes to new budgets.
Earth-o1: A Grid-free Observation-native Atmospheric World Model
cs.CV 2026-05 unverdicted novelty 6.0

Earth-o1 learns continuous atmospheric dynamics from ungridded observations and matches operational IFS forecast skill in hindcasts.
Federation of Experts: Communication Efficient Distributed Inference for Large Language Models
cs.LG 2026-05 unverdicted novelty 6.0

FoE restructures MoE blocks into per-KV-head clusters with sum-based synchronization, removing all-to-all communication in single-node settings and limiting it to intra-node in multi-node settings for up to 5.2x faste...
On Training in Imagination
cs.LG 2026-05 unverdicted novelty 6.0

The work derives the optimal ratio of dynamics-to-reward samples that minimizes a bound on return error and characterizes the tradeoff between noisy but cheap rewards versus accurate but expensive ones in imagination-...
Revealing Modular Gradient Noise Imbalance in LLMs: Calibrating Adam via Signal-to-Noise Ratio
cs.LG 2026-05 unverdicted novelty 6.0

MoLS scales Adam updates using module-level SNR estimates to correct gradient noise imbalance and improve LLM training convergence and generalization.
Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization
stat.ML 2026-05 unverdicted novelty 6.0

Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-...