Recognition: 2 theorem links
· Lean TheoremSharpness-Aware Minimization for Efficiently Improving Generalization
Pith reviewed 2026-05-16 20:10 UTC · model grok-4.3
The pith
Sharpness-Aware Minimization finds parameters in flat loss neighborhoods to improve generalization over standard training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SAM seeks parameters that lie in neighborhoods having uniformly low loss; this formulation results in a min-max optimization problem on which gradient descent can be performed efficiently, leading to improved generalization across benchmark datasets and models.
What carries the argument
The min-max objective that minimizes the maximum loss value inside a neighborhood of fixed radius around the current parameters, approximated via a first-order Taylor expansion for efficient gradient computation.
Load-bearing premise
That parameters whose neighborhoods have uniformly low loss will reliably generalize better than parameters found by minimizing training loss alone.
What would settle it
An experiment on a standard benchmark where SAM training produces worse test accuracy than standard gradient descent while using identical model size, data, and hyperparameter budgets.
read the original abstract
In today's heavily overparameterized models, the value of the training loss provides few guarantees on model generalization ability. Indeed, optimizing only the training loss value, as is commonly done, can easily lead to suboptimal model quality. Motivated by prior work connecting the geometry of the loss landscape and generalization, we introduce a novel, effective procedure for instead simultaneously minimizing loss value and loss sharpness. In particular, our procedure, Sharpness-Aware Minimization (SAM), seeks parameters that lie in neighborhoods having uniformly low loss; this formulation results in a min-max optimization problem on which gradient descent can be performed efficiently. We present empirical results showing that SAM improves model generalization across a variety of benchmark datasets (e.g., CIFAR-10, CIFAR-100, ImageNet, finetuning tasks) and models, yielding novel state-of-the-art performance for several. Additionally, we find that SAM natively provides robustness to label noise on par with that provided by state-of-the-art procedures that specifically target learning with noisy labels. We open source our code at \url{https://github.com/google-research/sam}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Sharpness-Aware Minimization (SAM), a min-max optimization procedure that seeks parameters lying in neighborhoods of uniformly low loss to simultaneously minimize training loss value and loss sharpness. The inner maximization over perturbations of size at most rho is approximated via a single gradient-ascent step, after which the outer minimization is performed with gradient descent. Empirical results on CIFAR-10, CIFAR-100, ImageNet, and finetuning tasks show consistent generalization improvements and new state-of-the-art performance for several models, plus robustness to label noise comparable to specialized methods.
Significance. If the reported gains are reproducible under standard controls, SAM offers a practical, geometry-motivated regularizer that improves generalization in overparameterized models without requiring architectural changes. The open-sourced implementation is a clear strength that enables direct verification and extension.
major comments (2)
- [§3] §3 (SAM formulation and Algorithm 1): the single gradient-ascent step used to approximate the inner maximization over the rho-ball is presented without error bounds or analysis of how closely it tracks the true worst-case loss in non-convex high-dimensional landscapes; because the central claim equates neighborhood flatness with generalization, this approximation error is load-bearing and requires either a supporting lemma or explicit empirical validation that the surrogate correlates with true sharpness.
- [§4] §4 (experimental protocol): the reported improvements lack details on the number of independent runs, standard deviations, or statistical significance tests; without these, it is impossible to assess whether the gains over baselines are robust or could be explained by the implicit regularization induced by the perturbation step itself rather than the intended geometric principle.
minor comments (2)
- [§3] Notation for the perturbation step size and projection is introduced without an explicit equation reference in the main text; adding a numbered display equation would improve clarity.
- [§4] The abstract states 'novel state-of-the-art performance for several' tasks; the main text should list the exact prior SOTA numbers and the precise margins achieved for each.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (SAM formulation and Algorithm 1): the single gradient-ascent step used to approximate the inner maximization over the rho-ball is presented without error bounds or analysis of how closely it tracks the true worst-case loss in non-convex high-dimensional landscapes; because the central claim equates neighborhood flatness with generalization, this approximation error is load-bearing and requires either a supporting lemma or explicit empirical validation that the surrogate correlates with true sharpness.
Authors: We acknowledge that the single-step gradient ascent approximation to the inner maximization lacks formal error bounds, which would be difficult to derive in the non-convex high-dimensional setting. This choice prioritizes computational efficiency, consistent with one-step approximations commonly used in related min-max problems such as adversarial training. To address the concern, we will add explicit empirical validation in the revised manuscript: we will compare the single-step surrogate sharpness to multi-step (e.g., 5-10 step) approximations on smaller models and subsets of CIFAR-10, showing strong correlation between the surrogate and true worst-case loss within the rho-ball, as well as alignment with observed generalization gains. revision: partial
-
Referee: [§4] §4 (experimental protocol): the reported improvements lack details on the number of independent runs, standard deviations, or statistical significance tests; without these, it is impossible to assess whether the gains over baselines are robust or could be explained by the implicit regularization induced by the perturbation step itself rather than the intended geometric principle.
Authors: We agree that additional statistical details are necessary for assessing robustness. In the revised version, we will report all main results as averages over at least 3 independent random seeds, including standard deviations. We will also add paired statistical significance tests (e.g., t-tests) against baselines to confirm the improvements. On the question of implicit regularization from the perturbation step, our existing ablations (random perturbation baselines and varying rho) already help isolate the geometric effect; we will expand this discussion and add further controls in the revision to more directly address this alternative explanation. revision: yes
Circularity Check
SAM min-max objective directly encodes neighborhood-loss goal by definition; no reduction to fitted inputs or self-citation chains
full rationale
The paper's core formulation defines SAM explicitly as the procedure that minimizes the worst-case loss inside a rho-neighborhood, yielding the min-max problem without any intermediate derivation that collapses back to a fitted parameter or prior result by construction. Empirical gains on CIFAR/ImageNet are reported as separate validation rather than forced by the equations themselves. No load-bearing self-citation, uniqueness theorem, or ansatz smuggling appears in the derivation chain for the central claim. The single-step inner-max approximation is an efficiency choice, not a circular step. This yields a low but non-zero score reflecting the definitional nature of the objective while preserving independent empirical content.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Parameters lying in neighborhoods of uniformly low loss generalize better than those at sharp minima
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
SAM seeks parameters that lie in neighborhoods having uniformly low loss; this formulation results in a min-max optimization problem
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Motivated by prior work connecting the geometry of the loss landscape and generalization
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 17 Pith papers
-
Estimating Implicit Regularization in Deep Learning
Gradient matching empirically recovers implicit regularization effects such as l2 penalties from early stopping and dropout in neural networks.
-
iGENE: A Differentiable Flux-Tube Gyrokinetic Code in TensorFlow
A fully differentiable TensorFlow gyrokinetic code allows approximate gradients of nonlinear turbulence quantities to be used for outer-loop tasks such as profile prediction despite stochasticity.
-
When Flat Minima Fail: Characterizing INT4 Quantization Collapse After FP32 Convergence
FP32-converged language models enter a post-convergence phase where INT4 quantization error explodes while FP32 perplexity remains stable, with onset tied to fine convergence rather than learning rate decay.
-
Hierarchical Text-Conditional Image Generation with CLIP Latents
A hierarchical prior-decoder model using CLIP latents generates more diverse text-conditional images than direct methods while preserving photorealism and caption fidelity.
-
TopoGeoScore: A Self-Supervised Source-Only Geometric Framework for OOD Checkpoint Selection
TopoGeoScore combines a torsion-inspired Laplacian log-determinant, Ollivier-Ricci curvature, and higher-order topological summaries from source embeddings, with weights learned via self-supervised invariance to geome...
-
Synthetic Data Generation for Long-Tail Medical Image Classification: A Case Study in Skin Lesions
A diffusion-based synthetic data pipeline using inpainting and OOD post-selection improves long-tail skin lesion classification on ISIC2019, delivering over 28% accuracy gain on the rarest class.
-
Geometric and Spectral Alignment for Deep Neural Network II
The work establishes margin-verified certificates for physical alignment of residual Jacobian chains by bounding truncation errors and decomposing the Physical Alignment Matrix orthogonally under fitted effective-rank...
-
Generalization at the Edge of Stability
Training at the edge of stability causes neural network optimizers to converge on fractal attractors whose effective dimension, measured via a new sharpness dimension from the Hessian spectrum, bounds generalization e...
-
Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima
Nexus optimizer improves LLM downstream performance by converging to common minima across data sources despite identical pretraining loss.
-
LAA-X: Unified Localized Artifact Attention for Quality-Agnostic and Generalizable Face Forgery Detection
LAA-X uses multi-task learning with explicit localized artifact attention and blending synthesis to build a deepfake detector that generalizes to high-quality and unseen manipulations after training only on real and p...
-
Robust Policy Optimization to Prevent Catastrophic Forgetting
FRPO applies a max-min robust optimization over KL-bounded policy neighborhoods during RLHF to reduce catastrophic forgetting of safety and accuracy under subsequent SFT or RL fine-tuning.
-
MER-DG: Modality-Entropy Regularization for Multimodal Domain Generalization
MER-DG applies modality-entropy regularization to reduce fusion overfitting in multimodal domain generalization, reporting average gains of 5% over standard fusion and 2% over prior methods on EPIC-Kitchens and HAC be...
-
Secure and Privacy-Preserving Vertical Federated Learning
Three optimized MPC protocols for privacy-preserving vertical federated learning that support global and global-local updates while reducing computation versus naive full-MPC delegation.
-
A Faster Path to Continual Learning
C-Flat Turbo accelerates continual learning by skipping redundant flatness gradients via direction-invariance observations and linear adaptive scheduling, delivering 1-1.25x speedup with comparable accuracy.
-
Wolkowicz-Styan Upper Bound on the Hessian Eigenspectrum for Cross-Entropy Loss in Nonlinear Smooth Neural Networks
A closed-form upper bound on the maximum Hessian eigenvalue of cross-entropy loss is derived for smooth nonlinear neural networks.
-
MOMO: Mars Orbital Model Foundation Model for Mars Orbital Applications
MOMO merges sensor-specific models from three Mars orbital instruments at matched validation loss stages to form a foundation model that outperforms ImageNet, Earth observation, sensor-specific, and supervised baselin...
-
FedNSAM:Consistency of Local and Global Flatness for Federated Learning
FedNSAM uses global Nesterov momentum to make local flatness consistent with global flatness in federated learning, yielding tighter convergence than FedSAM and better empirical performance.
Reference graph
Works this paper leans on
-
[1]
URL https://openreview.net/forum? id=BJl6t64tvr. 8https://github.com/google/spectral-density 9https://github.com/davda54/sam 9 Published as a conference paper at ICLR 2021 James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, and Skye Wanderman-Milne. JAX: composable transformations of Python+NumPy programs,
work page 2021
-
[2]
Entropy-SGD: Biasing Gradient Descent Into Wide Valleys
Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-SGD: Biasing Gradi- ent Descent Into Wide Valleys. arXiv e-prints, art. arXiv:1611.01838, November
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Understanding and Utilizing Deep Neural Networks Trained with Noisy Labels
URL http: //arxiv.org/abs/1905.05040. Ekin Dogus Cubuk, Barret Zoph, Dandelion Man ´e, Vijay Vasudevan, and Quoc V . Le. Au- toaugment: Learning augmentation policies from data. CoRR, abs/1805.09501,
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[5]
URL http://arxiv.org/abs/1805.09501. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Improved Regularization of Convolutional Neural Networks with Cutout
URL http://arxiv.org/abs/1708.04552. Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. arXiv preprint arXiv:1703.04933,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszko- reit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv e-prints, art. arXiv:2010.11929, October
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[9]
Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
URL http: //arxiv.org/abs/1705.07485. Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An Investigation into Neural Net Optimiza- tion via Hessian Eigenvalue Density. arXiv e-prints, art. arXiv:1901.10159, January
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[13]
Deep Pyramidal Residual Networks
URL http://arxiv.org/abs/1610.02915. Ethan Harris, Antonia Marcu, Matthew Painter, Mahesan Niranjan, Adam Pr ¨ugel-Bennett, and Jonathon Hare. FMix: Enhancing Mixed Sample Data Augmentation. arXiv e-prints , art. arXiv:2002.12047, February
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[15]
Deep Residual Learning for Image Recognition
URL http://arxiv.org/abs/1512.03385. Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Sepp Hochreiter and J ¨urgen Schmidhuber. Simplifying neural nets by discovering flat minima. In Advances in neural information processing systems , pp. 529–536,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
10 Published as a conference paper at ICLR 2021 Sepp Hochreiter and J¨urgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42,
work page 2021
-
[17]
Deep Networks with Stochastic Depth
Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Weinberger. Deep Networks with Stochastic Depth. arXiv e-prints, art. arXiv:1603.09382, March
work page internal anchor Pith review Pith/arXiv arXiv
- [18]
-
[19]
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, Hy- oukJoong Lee, Jiquan Ngiam, Quoc V . Le, Yonghui Wu, and Zhifeng Chen. GPipe: Ef- ficient Training of Giant Neural Networks using Pipeline Parallelism. arXiv e-prints , art. arXiv:1811.06965, November
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Averaging Weights Leads to Wider Optima and Better Generalization
Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wil- son. Averaging Weights Leads to Wider Optima and Better Generalization. arXiv e-prints, art. arXiv:1803.05407, March
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
URL http://arxiv.org/abs/ 1806.07572. Stanislaw Jastrzebski, Maciej Szymczak, Stanislav Fort, Devansh Arpit, Jacek Tabor, Kyunghyun Cho, and Krzysztof Geras. The Break-Even Point on Optimization Trajectories of Deep Neural Networks. arXiv e-prints, art. arXiv:2002.09572, February
-
[25]
MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels
URL http: //arxiv.org/abs/1712.05055. Lu Jiang, Di Huang, Mason Liu, and Weilong Yang. Beyond Synthetic Noise: Deep Learning on Controlled Noisy Labels. arXiv e-prints, art. arXiv:1911.09781, November
work page internal anchor Pith review Pith/arXiv arXiv 1911
-
[26]
Fantastic generalization measures and where to find them
Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to find them. arXiv preprint arXiv:1912.02178,
-
[27]
Adam: A Method for Stochastic Optimization
Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv e-prints, art. arXiv:1412.6980, December
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Simon Kornblith, Jonathon Shlens, and Quoc V . Le. Do Better ImageNet Models Transfer Better? arXiv e-prints, art. arXiv:1805.08974, May
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Visualizing the Loss Landscape of Neural Nets
URL http://arxiv.org/abs/1712.09913. Sungbin Lim, Ildoo Kim, Taesup Kim, Chiheon Kim, and Sungwoong Kim. Fast autoaugment. CoRR, abs/1905.00397,
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[31]
James Martens and Roger Grosse
URL http://arxiv.org/abs/1905.00397. James Martens and Roger Grosse. Optimizing Neural Networks with Kronecker-factored Approxi- mate Curvature. arXiv e-prints, art. arXiv:1503.05671, March
-
[32]
11 Published as a conference paper at ICLR 2021 David A McAllester. Pac-bayesian model averaging. In Proceedings of the twelfth annual confer- ence on Computational learning theory , pp. 164–170,
work page 2021
-
[34]
URL http://arxiv.org/abs/1601.04114. Y . E. Nesterov. A method for solving the convex programming problem with convergence rate o(1/k2). Dokl. Akad. Nauk SSSR , 269:543–547,
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng
URL https://ci.nii.ac.jp/ naid/10029946121/en/. Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning
-
[37]
Domain Adaptive Transfer Learning with Specialist Models
URL http://arxiv.org/abs/1811.07056. Eric Arazo Sanchez, Diego Ortego, Paul Albert, Noel E. O’Connor, and Kevin McGuinness. Un- supervised label noise modeling and loss correction. CoRR, abs/1904.11238,
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[38]
Unsupervised Label Noise Modeling and Loss Correction
URL http://arxiv.org/abs/1904.11238. Nitish Shirish Keskar and Richard Socher. Improving Generalization Performance by Switching from Adam to SGD. arXiv e-prints, art. arXiv:1712.07628, December
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[39]
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Pe- ter Tang. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. arXiv e-prints, art. arXiv:1609.04836, September
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
Dropout: a simple way to prevent neural networks from overfitting
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958,
work page 1929
-
[41]
Exploring the Vul- nerability of Deep Neural Networks: A Study of Parameter Corruption
Xu Sun, Zhiyuan Zhang, Xuancheng Ren, Ruixuan Luo, and Liangyou Li. Exploring the Vul- nerability of Deep Neural Networks: A Study of Parameter Corruption. arXiv e-prints , art. arXiv:2006.05620, June
-
[42]
Mingxing Tan and Quoc V . Le. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv e-prints, art. arXiv:1905.11946, May
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[43]
Circumventing Outliers of AutoAugment with Knowledge Distillation
Longhui Wei, An Xiao, Lingxi Xie, Xin Chen, Xiaopeng Zhang, and Qi Tian. Circumventing Outliers of AutoAugment with Knowledge Distillation. arXiv e-prints , art. arXiv:2003.11342, March
-
[45]
Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms
URL http://arxiv.org/ abs/1708.07747. Yoshihiro Yamada, Masakazu Iwamura, and Koichi Kise. Shakedrop regularization. CoRR, abs/1802.02375,
work page internal anchor Pith review Pith/arXiv arXiv
-
[46]
12 Published as a conference paper at ICLR 2021 Sergey Zagoruyko and Nikos Komodakis
URL http://arxiv.org/abs/1802.02375. 12 Published as a conference paper at ICLR 2021 Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. CoRR, abs/1605.07146,
-
[47]
URL http://arxiv.org/abs/1605.07146. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. CoRR, abs/1611.03530,
work page internal anchor Pith review Pith/arXiv arXiv
-
[48]
Understanding deep learning requires rethinking generalization
URL http: //arxiv.org/abs/1611.03530. Fan Zhang, Meng Li, Guisheng Zhai, and Yizhao Liu. Multi-branch and multi-scale attention learn- ing for fine-grained visual categorization,
work page internal anchor Pith review Pith/arXiv arXiv
-
[49]
mixup: Beyond Empirical Risk Minimization
Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412,
work page internal anchor Pith review Pith/arXiv arXiv
-
[50]
Zhilu Zhang and Mert R. Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels. CoRR, abs/1805.07836,
work page internal anchor Pith review Pith/arXiv arXiv
-
[51]
URL http://arxiv.org/abs/1805. 07836. 13 Published as a conference paper at ICLR 2021 A A PPENDIX A.1 PAC B AYESIAN GENERALIZATION BOUND Below, we state a generalization bound based on sharpness. Theorem
work page 2021
-
[52]
F or anyρ >0 and any distribution D, with probability 1−δ over the choice of the training setS∼ D, LD (w)≤ max ∥ϵ∥2≤ρ LS (w +ϵ) + √ k log ( 1 +∥w∥2 2 ρ2 ( 1 + √ log(n) k )2) + 4 log n δ + ˜O(1) n− 1 (4) wheren =|S|,k is the number of parameters and we assumed LD (w)≤ Eϵi∼N (0,ρ)[LD (w +ϵ)]. The conditionLD (w)≤ Eϵi∼N (0,ρ)[LD (w +ϵ)] means that addin...
work page 2020
-
[53]
We would then have σ∗ P 2 = σ2 Q +∥µP−µQ∥2 2/k
(5) Moreover, if P =N (µP,σ 2 PI) and Q =N (µQ,σ 2 QI), then the KL divergence can be written as follows: KL(P||Q) = 1 2 [kσ2 Q +∥µP−µQ∥2 2 σ2 P −k +k log ( σ2 P σ2 Q ) ] (6) Given a posterior standard deviationσQ, one could choose a prior standard deviationσP to minimize the above KL divergence and hence the generalization bound by taking the derivative1...
work page 2002
-
[54]
We can ensure thatj∈ N using inequality equation 7 and by setting c = ρ2(1 + exp(4n/k))
Therefore, we have: σ2 Q +∥µP−µQ∥2 2/k≤ρ2 +∥w∥2 2/k≤ρ2(1 + exp(4n/k)) (7) We now consider the bound that corresponds toj =⌊1−k log((ρ2 +∥w∥2 2/k)/c)⌋. We can ensure thatj∈ N using inequality equation 7 and by setting c = ρ2(1 + exp(4n/k)). Furthermore, for σ2 P =c exp((1−j)/k), we have: ρ2 +∥w∥2 2/k≤σ2 P≤ exp(1/k) ( ρ2 +∥w∥2 2/k ) (8) 10Despite the noncon...
work page 2021
-
[55]
For Fashion-MNIST, the auto-augmentation line correspond to cutout only
plus cutout (Devries & Taylor, 2017). For Fashion-MNIST, the auto-augmentation line correspond to cutout only. Table 5: Results on SVHN and Fashion-MNIST. SVHN Fashion-MNIST Model Augmentation SAM Baseline SAM Baseline Wide-ResNet-28-10 Basic 1.42±0.02 1.58±0.03 3.98±0.05 4.57±0.07 Wide-ResNet-28-10 Auto augment 0.99±0.01 1.14±0.04 3.61±0.06 3.86±0.14 Sha...
work page 2017
-
[56]
16 Published as a conference paper at ICLR 2021 Table 6: Hyper-parameter used to produce the CIFAR-{10,100} results CIFAR Dataset LR WD ρ (CIFAR-10) ρ (CIFAR-100) WRN 28-10 (200 epochs) 0.1 0.0005 0.05 0.1 WRN 28-10 (1800 epochs) 0.05 0.001 0.05 0.1 WRN 26-2x6 ShakeShake 0.02 0.0010 0.02 0.05 Pyramid vanilla 0.05 0.0005 0.05 0.2 Pyramid ShakeDrop (CIFAR-1...
work page 2021
-
[57]
However, when the model nears convergence, the similarity between 11We found anecdotal evidence that this makes the finetuning more robust to overtraining. 17 Published as a conference paper at ICLR 2021 20% 40% 60% 80% 0 15.0% 31.2% 52.3% 73.5% 0.01 13.7% 28.7% 50.1% 72.9% 0.02 12.8% 27.8% 48.9% 73.1% 0.05 11.6% 25.6% 47.1% 21.0% 0.1 4.6% 6.0% 8.7% 56.1% ...
work page 2021
-
[58]
We observe that adversarial perturbations outperform random perturbations, and that using p = 2 yield superior accuracy on this example. C.6 S EVERAL ITERATIONS IN THE INNER MAXIMIZATION To empirically verify that the linearization of the inner problem is sensible, we trained a WideResNet on the CIFAR datasets using a variant of SAM that performs several ...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.