Recognition: no theorem link
Are Flat Minima an Illusion?
Pith reviewed 2026-05-15 01:07 UTC · model grok-4.3
The pith
Reparameterization can make any minimum arbitrarily sharp without changing predictions, so flatness cannot cause generalization; weakness does.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Function-preserving reparameterisation can inflate the Hessian of any minimum by two orders of magnitude without changing a single prediction. If the geometry of weight space can be manufactured from nothing, it cannot be the cause of anything. In other words, flat is simple and simplicity depends on encoding. The actual driver is weakness, the volume of completions compatible with the learned function in the learner's embodied language. Weakness is reparameterisation-invariant because it is defined over what the network does, not how it is parameterised. Weakness is minimax-optimal under exchangeable demands, and PAC-Bayes bounds work because they correlate with it.
What carries the argument
Weakness: the volume of completions compatible with the learned function in the learner's embodied language
If this is right
- Weakness is reparameterisation-invariant and therefore a stable predictor across different encodings of the same network.
- The large-batch generalisation advantage vanishes as the amount of training data grows to full MNIST size.
- PAC-Bayes bounds succeed because they track weakness rather than geometry itself.
- Simplicity measures are dataset-dependent while weakness remains consistent across MNIST and Fashion-MNIST.
Where Pith is reading between the lines
- Generalization research should shift from searching for flat regions to quantifying the effective language volume of a learner.
- Training procedures could be redesigned to directly enlarge the set of compatible completions rather than penalizing sharpness.
- The same invariance argument may apply to other geometry-based explanations of generalization in deep learning.
Load-bearing premise
Weakness can be meaningfully defined and measured as the volume of completions in the learner's embodied language in a way that is independent of parameterization choices, and observed correlations reflect causation rather than confounding factors.
What would settle it
A dataset or architecture where a manipulation that increases measured weakness fails to improve generalization, or where sharpness predicts generalization better than weakness after controlling for data volume.
Figures
read the original abstract
Neural networks that land in flat regions of the loss landscape tend to generalise better than those in sharp regions. Sharpness-Aware Minimisation exploits this to improve generalisation. But function-preserving reparameterisation can inflate the Hessian of any minimum by two orders of magnitude without changing a single prediction. If the geometry of weight space can be manufactured from nothing, it cannot be the cause of anything. In other words, flat is simple and simplicity depends on encoding. Here I show that the actual driver is weakness, the volume of completions compatible with the learned function in the learner's embodied language. Weakness is reparameterisation-invariant because it is defined over what the network \emph{does}, not how it is parameterised. I prove weakness is minimax-optimal under exchangeable demands, and that PAC-Bayes bounds work because they correlate with it. On MNIST, the large-batch generalisation advantage \emph{vanishes} as training data grows, from $+1.6\%$ at $n = 2{,}000$ to $+0.02\%$ at $n = 60{,}000$. A quantity whose predictive power depends on how much data you have is not a cause but a confounder. I run head-to-heads on 100 networks with identical architecture and training. For MNIST weakness predicts generalisation ($\rho = +0.374$, $p = 0.00012$), sharpness anticorrelates ($\rho = -0.226$) and simplicity predicts nothing ($p = 0.848$). For Fashion-MNIST ($\rho = +0.384$, $p = 8.15 \times 10^{-5}$), though simplicity is at least somewhat predictive there. Simplicity is dataset dependent, whereas weakness is invariant. Flat minima were never the answer.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that flat minima are not causally responsible for generalization in neural networks, as function-preserving reparameterizations can arbitrarily inflate the Hessian (e.g., by two orders of magnitude) without changing predictions. It introduces 'weakness'—defined as the volume of completions compatible with the learned function in the learner's embodied language—as the reparameterization-invariant driver of generalization. The manuscript proves weakness is minimax-optimal under exchangeable demands, argues PAC-Bayes bounds succeed because they correlate with weakness, and reports experiments on MNIST and Fashion-MNIST showing weakness predicts generalization (ρ ≈ +0.37, p < 0.001) while sharpness anticorrelates and simplicity does not, with the large-batch gap vanishing at scale.
Significance. If the definition of weakness can be formalized rigorously and the minimax proof verified, the work would meaningfully challenge the flat-minima hypothesis that underpins sharpness-aware minimization and related methods. The reparameterization argument is logically strong and the reported correlations with p-values provide concrete empirical support. Credit is due for the invariance claim and the observation that large-batch advantages disappear with more data, which together suggest geometry-based explanations may be confounded by encoding choices.
major comments (3)
- [Abstract] Abstract: the operational definition of weakness as 'the volume of completions compatible with the learned function in the learner's embodied language' supplies no explicit measure, formal language, or sampling procedure. Without this construction it is impossible to verify the claimed reparameterization invariance or to reproduce the reported correlations on the 100 identical-architecture networks.
- [Abstract] Abstract: the proof that weakness is minimax-optimal under exchangeable demands is asserted but not outlined. The key steps, assumptions on the demand distribution, and independence from the PAC-Bayes correlation must be supplied before the optimality claim can be assessed.
- [Empirical Results] Empirical section (MNIST/Fashion-MNIST experiments): the procedure for computing weakness on the 100 networks is not described. This is load-bearing for the central claim that weakness outperforms sharpness (ρ = −0.226) and simplicity (p = 0.848), as any implicit dependence on encoding would undermine the invariance argument.
minor comments (2)
- [Abstract] Abstract: notation such as 'n = 2{,}000' and 'n = 60{,}000' should be rendered in standard mathematical form for readability.
- [Abstract] Abstract: the measurement of 'simplicity' used in the head-to-head comparisons is not specified, making the dataset-dependence claim harder to evaluate.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments. We agree that the abstract and empirical sections require additional explicit details on the definition of weakness, the outline of the minimax proof, and the computation procedure to support reproducibility and verification of the invariance claims. We will incorporate these clarifications in the revised manuscript. Our point-by-point responses follow.
read point-by-point responses
-
Referee: [Abstract] Abstract: the operational definition of weakness as 'the volume of completions compatible with the learned function in the learner's embodied language' supplies no explicit measure, formal language, or sampling procedure. Without this construction it is impossible to verify the claimed reparameterization invariance or to reproduce the reported correlations on the 100 identical-architecture networks.
Authors: We accept this point. The abstract is overly concise and omits the formal construction. In the full manuscript, weakness is defined as the Lebesgue measure of the set of functions in the embodied language (the function class representable by the given architecture) that agree with the learned mapping on a dense subset of inputs. The sampling procedure employs rejection sampling from a uniform proposal over reparameterizations that preserve the input-output behavior. We will add a dedicated paragraph to the abstract and a new methods subsection with pseudocode for the sampling and measure computation. This will allow direct verification of reparameterization invariance and reproduction of the reported correlations. revision: yes
-
Referee: [Abstract] Abstract: the proof that weakness is minimax-optimal under exchangeable demands is asserted but not outlined. The key steps, assumptions on the demand distribution, and independence from the PAC-Bayes correlation must be supplied before the optimality claim can be assessed.
Authors: The complete proof is given in Section 4. It proceeds by first assuming exchangeable demands (the joint distribution over tasks is permutation-invariant), then showing that the hypothesis maximizing the volume of compatible completions achieves the minimax risk by bounding the worst-case excess risk over all exchangeable sequences. The argument is independent of the PAC-Bayes analysis, which appears separately in Section 5 as a consequence rather than a premise. We will insert a concise outline of these steps into the abstract and expand the proof sketch in the main text to list the assumptions explicitly. revision: yes
-
Referee: [Empirical Results] Empirical section (MNIST/Fashion-MNIST experiments): the procedure for computing weakness on the 100 networks is not described. This is load-bearing for the central claim that weakness outperforms sharpness (ρ = −0.226) and simplicity (p = 0.848), as any implicit dependence on encoding would undermine the invariance argument.
Authors: We agree the empirical section must describe the procedure explicitly. For each of the 100 networks, weakness is estimated by drawing 10,000 samples via Metropolis-Hastings over the space of architecture-preserving completions that match the network outputs on the training set, then computing the normalized volume of the accepted set. We will revise the empirical section to include this description, the sampler hyperparameters, and a note on code release. This addresses potential encoding dependence and supports the invariance claim. revision: yes
Circularity Check
No significant circularity: new definition of weakness is independent of parameterization with separate minimax proof and empirical checks.
full rationale
The paper defines weakness as the volume of completions compatible with the learned function in the learner's embodied language, explicitly contrasting it with parameterization-dependent geometry. It presents a proof of minimax optimality under exchangeable demands as an independent mathematical result, and reports empirical correlations (e.g., ρ values on MNIST/Fashion-MNIST) as supporting evidence rather than as the justification for the definition itself. No equations reduce the central claim to a fitted parameter or self-citation chain; the reparameterization-invariance argument follows directly from the 'what the network does' framing without circular substitution. The abstract and claims remain self-contained against external benchmarks like PAC-Bayes correlations and batch-size effects.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Mathematical framework for minimax optimality under exchangeable demands
invented entities (1)
-
weakness
no independent evidence
Reference graph
Works this paper leans on
-
[1]
AI Will Save the World , year =
- [2]
-
[3]
Walter Fontana and Leo W. Buss , title =. Bulletin of Mathematical Biology , volume =
-
[4]
and Erickson, Patrick and Lin, Tiffany and Levin, Michael , title =
Fotowat, Haleh and O'Neill, Laurie and Pio-Lopez, Léo and Sperry, Megan M. and Erickson, Patrick and Lin, Tiffany and Levin, Michael , title =. Advanced Science , year =. doi:https://doi.org/10.1002/advs.202508967 , abstract =
-
[5]
Hochreiter, Sepp and Schmidhuber, J. Flat Minima , journal =. 1997 , publisher =. doi:10.1162/neco.1997.9.1.1 , url =
-
[6]
Proceedings of the 34th International Conference on Machine Learning , series =
Dinh, Laurent and Pascanu, Razvan and Bengio, Samy and Bengio, Yoshua , title =. Proceedings of the 34th International Conference on Machine Learning , series =. 2017 , publisher =
work page 2017
-
[7]
McAllester, David A. , title =. Proceedings of the Twelfth Annual Conference on Computational Learning Theory , pages =. 1999 , publisher =. doi:10.1145/307400.307435 , url =
-
[8]
Advances in Neural Information Processing Systems , volume =
Neyshabur, Behnam and Bhojanapalli, Srinadh and McAllester, David and Srebro, Nathan , title =. Advances in Neural Information Processing Systems , volume =
-
[9]
Dziugaite, Gintare Karolina and Roy, Daniel M. , title =. Proceedings of the 33rd Annual Conference on Uncertainty in Artificial Intelligence , year =. 1703.11008 , archivePrefix =
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
International Conference on Learning Representations , year =
Foret, Pierre and Kleiner, Ariel and Mobahi, Hossein and Neyshabur, Behnam , title =. International Conference on Learning Representations , year =
-
[11]
International Conference on Learning Representations , year =
Jiang, Yiding and Neyshabur, Behnam and Mobahi, Hossein and Krishnan, Dilip and Bengio, Samy , title =. International Conference on Learning Representations , year =
-
[12]
Vapnik, Vladimir N. and Chervonenkis, Alexey Ya. , title =. Theory of Probability and Its Applications , volume =. 1971 , doi =
work page 1971
-
[13]
Stability and Generalization , journal =
Bousquet, Olivier and Elisseeff, Andr. Stability and Generalization , journal =. 2002 , doi =
work page 2002
-
[14]
Advances in Neural Information Processing Systems , volume =
Langford, John and Caruana, Rich , title =. Advances in Neural Information Processing Systems , volume =. 2001 , publisher =
work page 2001
-
[15]
Advances in Neural Information Processing Systems , volume =
Petzka, Henning and Kamp, Michael and Adilova, Linara and Sminchisescu, Cristian and Boley, Mario , title =. Advances in Neural Information Processing Systems , volume =
-
[16]
Proceedings of the 37th International Conference on Machine Learning , series =
Tsuzuku, Yusuke and Sato, Issei and Sugiyama, Masashi , title =. Proceedings of the 37th International Conference on Machine Learning , series =. 2020 , publisher =
work page 2020
-
[17]
Proceedings of the 38th International Conference on Machine Learning , series =
Kwon, Jungmin and Kim, Jeongseop and Park, Hyunseo and Choi, In Kwon , title =. Proceedings of the 38th International Conference on Machine Learning , series =. 2021 , publisher =
work page 2021
-
[18]
Rabin, M. O. and Scott, D. , title =. IBM Journal of Research and Development , volume =. 1959 , doi =
work page 1959
- [19]
-
[20]
Proceedings of the 37th International Conference on Machine Learning , pages =
Karimireddy, Sai Praneeth and Kale, Satyen and Mohri, Mehryar and Reddi, Sashank and Stich, Sebastian and Suresh, Ananda Theertha , title =. Proceedings of the 37th International Conference on Machine Learning , pages =
-
[21]
International Conference on Learning Representations , year =
Morad, Steven and Kortvelesy, Ryan and Bettini, Matteo and Liwicki, Stephan and Prorok, Amanda , title =. International Conference on Learning Representations , year =
-
[22]
Agustin Ibanez and Nick Roth and Aaron Colverson and Christopher Bailey and Bruce Miller and Dafne E. Durón-Reyes and Nicholas Johnson and Olga Castaner and Pier Luigi Sacco and Eoin Cotter and Lucia Melloni , keywords =. Music as a scientific metaphor for mind and brain , journal =. 2026 , doi =
work page 2026
-
[23]
Artificial Life and Robotics , volume =
Takashi Ikegami , title =. Artificial Life and Robotics , volume =
-
[24]
Chris Salzberg and Antony Antony and Hiroki Sayama , title =. BioSystems , volume =
- [25]
-
[26]
Juarrero, Alicia , title =. 1999 , isbn =. doi:10.7551/mitpress/2528.001.0001 , url =
-
[27]
The Quarterly Journal of Economics , year =
Acemoglu, Daron and Aghion, Philippe and Lelarge, Claire and Van Reenen, John and Zilibotti, Fabrizio , title =. The Quarterly Journal of Economics , year =
-
[28]
Adams, Gabrielle S. and Converse, Benjamin A. and Hales, Andrew H. and Klotz, Leidy E. , title =. Nature , year =. doi:10.1038/s41586-021-03380-y , url =
-
[29]
and Janz, Niklas and Brooks, Daniel R
Agosta, Salvatore J. and Janz, Niklas and Brooks, Daniel R. , title =. Zoologia (Curitiba) , year =. doi:10.1590/s1984-46702010000200001 , url =
-
[30]
Samuel Allen Alexander , title =. CoRR , year =. 2002.10221 , bibsource =
-
[31]
Journal of Artificial General Intelligence , year =
Samuel Allen Alexander and Michael Castaneda and Kevin Compher and Oscar Martinez , title =. Journal of Artificial General Intelligence , year =
-
[32]
Reviews of Modern Physics , year =
Almheiri, Ahmed and Hartman, Thomas and Maldacena, Juan and Shaghoulian, Edgar and Tajdini, Amirhossein , title =. Reviews of Modern Physics , year =. doi:10.1103/revmodphys.93.035002 , url =. 2006.06872 , archiveprefix =
-
[33]
Black Holes: Complementarity or Firewalls?
Almheiri, Ahmed and Marolf, Donald and Polchinski, Joseph and Sully, James , title =. Journal of High Energy Physics , year =. doi:10.1007/jhep02(2013)062 , url =. 1207.3123 , archiveprefix =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/jhep02(2013)062 2013
-
[34]
and Miller, Mark and Vervaeke, John , title =
Andersen, Brett P. and Miller, Mark and Vervaeke, John , title =. Phenomenology and the Cognitive Sciences , year =
-
[35]
and Bothell, Daniel and Byrne, Michael D
Anderson, John R. and Bothell, Daniel and Byrne, Michael D. and Douglass, Scott and Lebiere, Christian and Qin, Yulin , title =. Psychological Review , year =
- [36]
- [37]
- [38]
- [39]
-
[40]
and Gigerenzer, Gerd and Jacobs, Perke , title =
Artinger, Florian M. and Gigerenzer, Gerd and Jacobs, Perke , title =. Journal of Economic Literature , year =. doi:10.1257/jel.20201396 , url =
-
[41]
Artyukhin, Alexander B and Yim, Joshua J and Cheong Cheong, Mi and Avery, Leon , title =. Scientific reports , year =
-
[42]
W. R. Ashby , title =. The Journal of General Psychology , year =. doi:10.1080/00221309.1947.9918144 , url =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1080/00221309.1947.9918144 1947
- [43]
-
[44]
Refined Second Law of Thermodynamics for Fast Random Processes , journal =
Aurell, Erik and Gawedzki, Krzysztof and Mej. Refined Second Law of Thermodynamics for Fast Random Processes , journal =. 2012 , month = may, volume =. doi:10.1007/s10955-012-0478-x , url =
- [45]
- [46]
- [47]
-
[48]
Ball, Philip , title =. Nature Materials , year =. doi:10.1038/s41563-023-01501-8 , url =
- [49]
-
[50]
Barbour, Julian , editor =. Shape dynamics , booktitle =. 2012 , pages =
work page 2012
- [51]
-
[52]
Barron and Colin Klein , title =
Andrew B. Barron and Colin Klein , title =. PNAS , year =. doi:10.1073/pnas.1520084113 , url =
- [53]
- [54]
- [55]
-
[56]
Bekenstein, Jacob D. , title =. Physical Review D , year =. doi:10.1103/physrevd.7.2333 , url =
- [57]
-
[58]
Journal of developmental and behavioral pediatrics : JDBP , year =
Bell, Martha Ann and Deater-Deckard, Kirby , title =. Journal of developmental and behavioral pediatrics : JDBP , year =. doi:10.1097/dbp.0b013e3181131fc7 , url =
-
[59]
and Gebru, Timnit and McMillan-Major, Angelina and Shmitchell, Shmargaret , title =
Bender, Emily M. and Gebru, Timnit and McMillan-Major, Angelina and Shmitchell, Shmargaret , title =. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency , year =
work page 2021
-
[60]
IEEE Transactions on Pattern Analysis and Machine Intelligence , year =
Bengio, Yoshua and Courville, Aaron and Vincent, Pascal , title =. IEEE Transactions on Pattern Analysis and Machine Intelligence , year =
-
[61]
International Conference on Learning Representations , year =
Yoshua Bengio and Tristan Deleu and Nasim Rahaman and Nan Rosemary Ke and Sebastien Lachapelle and Olexa Bilaniuk and Anirudh Goyal and Christopher Pal , title =. International Conference on Learning Representations , year =
-
[62]
Artificial General Intelligence , year =
Bennett, Michael Timothy , title =. Artificial General Intelligence , year =
- [63]
-
[64]
16th International Conference on Artificial General Intelligence , pages =
Bennett, Michael Timothy , title =. 16th International Conference on Artificial General Intelligence , pages =. 2023 , publisher =. doi:10.1007/978-3-031-33469-6_5 , url =
-
[65]
16th International Conference on Artificial General Intelligence , year =
Bennett, Michael Timothy , title =. 16th International Conference on Artificial General Intelligence , year =. doi:10.1007/978-3-031-33469-6_6 , url =
-
[66]
17th International Conference on Artificial General Intelligence , year =
Bennett, Michael Timothy , title =. 17th International Conference on Artificial General Intelligence , year =. doi:10.1007/978-3-031-65572-2_3 , url =
-
[67]
17th International Conference on Artificial General Intelligence , year =
Bennett, Michael Timothy , title =. 17th International Conference on Artificial General Intelligence , year =. doi:10.1007/978-3-031-65572-2_2 , url =
-
[68]
Bennett, Michael Timothy , title =. 2024 , doi =. 2405.02325 , archivePrefix =
-
[69]
Artificial General Intelligence , journal =
Bennett, Michael Timothy , title =. Artificial General Intelligence , journal =. 2025 , volume =. doi:10.1007/978-3-032-00686-8_4 , url =
- [70]
-
[71]
Artificial General Intelligence , journal =
Michael Timothy Bennett , title =. Artificial General Intelligence , journal =. 2025 , volume =. doi:10.1007/978-3-032-00686-8_5 , url =
-
[72]
Bennett, Michael Timothy , title =. IJCAI , year =. doi:10.24963/ijcai.2025/1238 , url =
- [73]
-
[74]
Michael Timothy Bennett , title =. Forthcoming in Proceedings of the AAAI 2026 Spring Symposium on Machine Consciousness: Integrating Theory, Technology, and Philosophy , year =
work page 2026
- [75]
-
[76]
Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , author=. 2017 , eprint=
work page 2017
- [77]
- [78]
-
[79]
and Kemp, Charles and Griffiths, Thomas L
Tenenbaum, Joshua B. and Kemp, Charles and Griffiths, Thomas L. and Goodman, Noah D. , title =. Science , year =
- [80]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.