Flat Minima

Sepp Hochreiter, Jürgen Schmidhuber · 1997 · Neural Computation · DOI 10.1162/neco.1997.9.1.1 · arXiv gov/9117894

10 Pith papers cite this work, alongside 479 external citations. Polarity classification is still indexing.

10 Pith papers citing it

479 external citations · Crossref

open at publisher browse 10 citing papers arXiv PDF

representative citing papers

cs.LG · 2026-03-24 · unverdicted · novelty 8.0

Flat minima are illusory; generalization is driven by weakness, a reparameterization-invariant measure of compatible completions that predicts performance better than sharpness on MNIST and Fashion-MNIST.

On the Generalization of Knowledge Distillation: An Information-Theoretic View

cs.IT · 2026-05-13 · unverdicted · novelty 7.0

Knowledge distillation generalization bounds are derived via a new distillation divergence measuring teacher-student kernel difference, with tighter bounds from teacher loss flatness.

Estimating Implicit Regularization in Deep Learning

stat.ML · 2026-05-06 · unverdicted · novelty 7.0

Gradient matching empirically recovers implicit regularization effects such as l2 penalties from early stopping and dropout in neural networks.

When Flat Minima Fail: Characterizing INT4 Quantization Collapse After FP32 Convergence

cs.LG · 2026-04-16 · conditional · novelty 7.0

FP32-converged language models enter a post-convergence phase where INT4 quantization error explodes while FP32 perplexity remains stable, with onset tied to fine convergence rather than learning rate decay.

Product-Stability: Provable Convergence for Gradient Descent on the Edge of Stability

cs.LG · 2026-04-03 · unverdicted · novelty 7.0

For losses with product-stable minima, gradient descent on l(xy) converges provably at the edge of stability, with bifurcation diagrams characterizing the resulting stable oscillations and sharpness.

Feature Starvation as Geometric Instability in Sparse Autoencoders

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

Adaptive elastic net SAEs (AEN-SAEs) mitigate feature starvation in SAEs by combining ℓ2 structural stability with adaptive ℓ1 reweighting, producing a Lipschitz-continuous sparse coding map that recovers global feature support under mild assumptions.

Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.

The Role of Symmetry in Optimizing Overparameterized Networks

cs.LG · 2026-04-28 · unverdicted · novelty 6.0 · 2 refs

Overparameterization adds symmetries that precondition the Hessian for better minima and increase the probability mass of global minima near typical initializations.

From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity

cs.LG · 2026-05-01 · unverdicted · novelty 5.0

EPGS detects high-confidence factual errors in LLMs by using embedding perturbations to measure gradient sensitivity as a proxy for sharp versus flat minima.

Wolkowicz-Styan Upper Bound on the Hessian Eigenspectrum for Cross-Entropy Loss in Nonlinear Smooth Neural Networks

cs.LG · 2026-04-11 · unverdicted · novelty 5.0

A closed-form upper bound on the maximum Hessian eigenvalue of cross-entropy loss is derived for smooth nonlinear neural networks.

citing papers explorer

Showing 10 of 10 citing papers.

Are Flat Minima an Illusion? cs.LG · 2026-03-24 · unverdicted · none · ref 5
Flat minima are illusory; generalization is driven by weakness, a reparameterization-invariant measure of compatible completions that predicts performance better than sharpness on MNIST and Fashion-MNIST.
On the Generalization of Knowledge Distillation: An Information-Theoretic View cs.IT · 2026-05-13 · unverdicted · none · ref 14
Knowledge distillation generalization bounds are derived via a new distillation divergence measuring teacher-student kernel difference, with tighter bounds from teacher loss flatness.
Estimating Implicit Regularization in Deep Learning stat.ML · 2026-05-06 · unverdicted · none · ref 17
Gradient matching empirically recovers implicit regularization effects such as l2 penalties from early stopping and dropout in neural networks.
When Flat Minima Fail: Characterizing INT4 Quantization Collapse After FP32 Convergence cs.LG · 2026-04-16 · conditional · none · ref 8
FP32-converged language models enter a post-convergence phase where INT4 quantization error explodes while FP32 perplexity remains stable, with onset tied to fine convergence rather than learning rate decay.
Product-Stability: Provable Convergence for Gradient Descent on the Edge of Stability cs.LG · 2026-04-03 · unverdicted · none · ref 1
For losses with product-stable minima, gradient descent on l(xy) converges provably at the edge of stability, with bifurcation diagrams characterizing the resulting stable oscillations and sharpness.
Feature Starvation as Geometric Instability in Sparse Autoencoders cs.LG · 2026-05-06 · unverdicted · none · ref 19
Adaptive elastic net SAEs (AEN-SAEs) mitigate feature starvation in SAEs by combining ℓ2 structural stability with adaptive ℓ1 reweighting, producing a Lipschitz-continuous sparse coding map that recovers global feature support under mild assumptions.
Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting cs.LG · 2026-05-04 · unverdicted · none · ref 52
Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.
The Role of Symmetry in Optimizing Overparameterized Networks cs.LG · 2026-04-28 · unverdicted · none · ref 20 · 2 links
Overparameterization adds symmetries that precondition the Hessian for better minima and increase the probability mass of global minima near typical initializations.
From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity cs.LG · 2026-05-01 · unverdicted · none · ref 2
EPGS detects high-confidence factual errors in LLMs by using embedding perturbations to measure gradient sensitivity as a proxy for sharp versus flat minima.
Wolkowicz-Styan Upper Bound on the Hessian Eigenspectrum for Cross-Entropy Loss in Nonlinear Smooth Neural Networks cs.LG · 2026-04-11 · unverdicted · none · ref 6
A closed-form upper bound on the maximum Hessian eigenvalue of cross-entropy loss is derived for smooth nonlinear neural networks.

Flat Minima

fields

years

verdicts

representative citing papers

citing papers explorer