Averaging Weights Leads to Wider Optima and Better Generalization

Pavel Izmailov , Dmitrii Podoprikhin , Timur Garipov , Dmitry Vetrov , Andrew Gordon Wilson

Authors on Pith no claims yet

classification 💻 cs.LG cs.AIcs.CVstat.ML

keywords averaginggeneralizationnetworksbetterconventionalleadslearningrate

read the original abstract

Deep neural networks are typically trained by optimizing a loss function with an SGD variant, in conjunction with a decaying learning rate, until convergence. We show that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better generalization than conventional training. We also show that this Stochastic Weight Averaging (SWA) procedure finds much flatter solutions than SGD, and approximates the recent Fast Geometric Ensembling (FGE) approach with a single model. Using SWA we achieve notable improvement in test accuracy over conventional SGD training on a range of state-of-the-art residual networks, PyramidNets, DenseNets, and Shake-Shake networks on CIFAR-10, CIFAR-100, and ImageNet. In short, SWA is extremely easy to implement, improves generalization, and has almost no computational overhead.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Enjoy Your Layer Normalization with the Computational Efficiency of RMSNorm
cs.LG 2026-05 conditional novelty 7.0

A framework to identify and convert foldable layer normalizations to RMSNorm for exact equivalence and faster inference in deep neural networks.
A foundation model of vision, audition, and language for in-silico neuroscience
q-bio.NC 2026-05 unverdicted novelty 7.0

TRIBE v2 is a multimodal AI model that predicts human brain activity more accurately than linear encoding models and recovers established neuroscientific findings through in-silico testing.
Differentially Private Model Merging
cs.LG 2026-04 unverdicted novelty 7.0

Post-processing via random selection or linear combination generates differentially private models for arbitrary privacy parameters from pre-trained models on the same dataset.
Privatar: Scalable Privacy-preserving Multi-user VR via Secure Offloading
cs.CR 2026-04 unverdicted novelty 7.0

Privatar uses horizontal frequency partitioning and distribution-aware minimal perturbation to enable private offloading of VR avatar reconstruction, supporting 2.37x more users with modest overhead.
Scalable Neural Decoders for Practical Fault-Tolerant Quantum Computation
quant-ph 2026-04 unverdicted novelty 7.0

Neural decoder for quantum LDPC codes achieves ~10^{-10} logical error at 0.1% physical error with 17x improvement and high throughput, enabling practical fault tolerance at modest code sizes.
Spatial Adapter: Structured Spatial Decomposition and Closed-Form Covariance for Frozen Predictors
stat.ML 2026-05 unverdicted novelty 6.0

The Spatial Adapter equips frozen predictors with a spatially regularized orthonormal basis for residuals and derives a closed-form low-rank-plus-noise covariance for spatial prediction and kriging.
TopoGeoScore: A Self-Supervised Source-Only Geometric Framework for OOD Checkpoint Selection
cs.LG 2026-05 unverdicted novelty 6.0

TopoGeoScore combines a torsion-inspired Laplacian log-determinant, Ollivier-Ricci curvature, and higher-order topological summaries from source embeddings, with weights learned via self-supervised invariance to geome...
MELD: Multi-Task Equilibrated Learning Detector for AI-Generated Text
cs.CL 2026-05 unverdicted novelty 6.0

MELD is a multi-task AI-text detector using auxiliary heads, uncertainty-weighted losses, EMA distillation, and pairwise ranking that reaches 99.9% TPR at 1% FPR on a new held-out benchmark while remaining competitive...
CPCANet: Deep Unfolding Common Principal Component Analysis for Domain Generalization
cs.CV 2026-05 unverdicted novelty 6.0

CPCANet deep-unfolds Common PCA to learn domain-invariant subspaces, achieving state-of-the-art zero-shot domain generalization on standard benchmarks.
Perturb and Correct: Post-Hoc Ensembles using Affine Redundancy
cs.LG 2026-05 unverdicted novelty 6.0

Perturb-and-Correct generates epistemically diverse predictors from a single pretrained network via hidden-layer perturbations followed by affine least-squares corrections that enforce agreement on calibration data.
Using Graph Neural Networks for hadronic clustering and to reduce beam background in the Belle~II electromagnetic calorimeter
hep-ex 2026-04 unverdicted novelty 6.0

Graph neural networks can identify and remove unwanted beam background depositions in the Belle II calorimeter to improve hadronic clustering and reduce fake photon clusters.
FastAT Benchmark: A Comprehensive Framework for Fair Evaluation of Fast Adversarial Training Methods
cs.CV 2026-04 conditional novelty 6.0

The FastAT Benchmark standardizes evaluation of over twenty fast adversarial training methods under unified conditions, showing that well-designed single-step approaches can match or exceed PGD-AT robustness at lower ...
Generalization at the Edge of Stability
cs.LG 2026-04 unverdicted novelty 6.0

Training at the edge of stability causes neural network optimizers to converge on fractal attractors whose effective dimension, measured via a new sharpness dimension from the Hessian spectrum, bounds generalization e...
Benchmarking Optimizers for MLPs in Tabular Deep Learning
cs.LG 2026-04 unverdicted novelty 6.0

Muon optimizer outperforms AdamW across 17 tabular datasets when training MLPs under a shared protocol.
AIRA_2: Overcoming Bottlenecks in AI Research Agents
cs.AI 2026-03 conditional novelty 6.0

AIRA₂ improves AI research agents via asynchronous multi-GPU workers, hidden consistent evaluation, and interactive ReAct agents, reaching 81.5-83.1% percentile rank on MLE-bench-30 and exceeding human SOTA on 6 of 20...
Vision Transformers Need Registers
cs.CV 2023-09 unverdicted novelty 6.0

Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
Biologically-Grounded Multi-Encoder Architectures as Developability Oracles for Antibody Design
q-bio.BM 2026-04 unverdicted novelty 5.0

CrossAbSense oracles using frozen PLM encoders plus self- or cross-attention decoders improve prediction accuracy by 12-20% on three of five developability assays for therapeutic IgGs, with architecture choices reveal...
MOMO: Mars Orbital Model Foundation Model for Mars Orbital Applications
cs.CV 2026-04 unverdicted novelty 5.0

MOMO merges sensor-specific models from three Mars orbital instruments at matched validation loss stages to form a foundation model that outperforms ImageNet, Earth observation, sensor-specific, and supervised baselin...
The Platonic Representation Hypothesis
cs.LG 2024-05 unverdicted novelty 5.0

Representations learned by large AI models are converging toward a shared statistical model of reality.
Revitalizing the Beginning: Avoiding Storage Dependency for Model Merging in Continual Learning
cs.LG 2026-05 unverdicted novelty 4.0

The paper proposes Trajectory Regularized Merging (TRM) to enable storage-free model merging in continual learning by optimizing in an augmented trajectory subspace with task alignment, prediction consistency, and gra...
Momentum-Anchored Multi-Scale Fusion Model for Long-Tailed Chest X-Ray Classification
cs.CV 2026-05 unverdicted novelty 4.0

A new neural network stabilizes features for rare chest X-ray diseases via momentum anchoring and multi-scale fusion on EfficientNet, achieving 0.8682 AUC on ChestX-ray14.
Phoenix-VL 1.5 Medium Technical Report
cs.CL 2026-05 unverdicted novelty 3.0

Phoenix-VL 1.5 Medium is a 123B-parameter natively multimodal model that reaches state-of-the-art results on Singapore multimodal, legal, and policy benchmarks after localized training on 1T+ tokens while staying comp...
LLMs Struggle with Abstract Meaning Comprehension More Than Expected
cs.CL 2026-04 unverdicted novelty 3.0

LLMs struggle with abstract meaning comprehension on SemEval-2021 Task 4 more than fine-tuned models, and a new bidirectional attention classifier yields small accuracy gains of 3-4%.