Recognition: unknown
Too Sharp, Too Sure: When Calibration Follows Curvature
Pith reviewed 2026-05-10 00:20 UTC · model grok-4.3
The pith
Calibration error in neural networks tracks loss curvature because both are controlled by the same margin-dependent exponential tails along the training path.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Both ECE and Gauss-Newton curvature are controlled, up to problem-specific constants, by the same margin-dependent exponential tail functional along the trajectory. A margin-aware training objective that targets robust-margin tails and local smoothness yields improved out-of-sample calibration across optimizers without sacrificing accuracy.
What carries the argument
The margin-dependent exponential tail functional, which bounds both calibration error and curvature throughout optimization.
If this is right
- Calibration can be improved at training time by intervening on margins rather than through post-hoc adjustments.
- The connection between sharpness and miscalibration is mediated by the statistics of the margin distribution.
- The same training change improves calibration under multiple gradient-based optimizers.
- Accuracy and calibration need not trade off when the objective explicitly encourages better margin tails.
Where Pith is reading between the lines
- The same margin-tail mechanism may link calibration to generalization phenomena that also depend on margins.
- If the coupling holds, similar objectives could be tested on non-vision tasks to check whether optimizer independence extends further.
- The result suggests examining whether altering margins independently of curvature can decouple the two quantities.
Load-bearing premise
The coupling between margins, curvature, and calibration is causal, so explicitly targeting margin tails during training will produce better calibration in a general way.
What would settle it
Training with the margin-aware objective on a new optimizer or vision dataset and observing no reduction in ECE relative to standard training, while test accuracy stays comparable.
Figures
read the original abstract
Modern neural networks can achieve high accuracy while remaining poorly calibrated, producing confidence estimates that do not match empirical correctness. Yet calibration is often treated as a post-hoc attribute. We take a different perspective: we study calibration as a training-time phenomenon on small vision tasks, and ask whether calibrated solutions can be obtained reliably by intervening on the training procedure. We identify a tight coupling between calibration, curvature, and margins during training of deep networks under multiple gradient-based methods. Empirically, Expected Calibration Error (ECE) closely tracks curvature-based sharpness throughout optimization. Mathematically, we show that both ECE and Gauss--Newton curvature are controlled, up to problem-specific constants, by the same margin-dependent exponential tail functional along the trajectory. Guided by this mechanism, we introduce a margin-aware training objective that explicitly targets robust-margin tails and local smoothness, yielding improved out-of-sample calibration across optimizers without sacrificing accuracy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies calibration as a training-time phenomenon in deep networks on small vision tasks. It reports that Expected Calibration Error (ECE) empirically tracks Gauss-Newton curvature across multiple gradient-based optimizers. It derives that both ECE and curvature are controlled, up to problem-specific constants, by the same margin-dependent exponential tail functional along the optimization trajectory. Guided by this, the authors propose a margin-aware training objective targeting robust margins and local smoothness, which empirically improves out-of-sample calibration without accuracy loss.
Significance. If the shared-control derivation is rigorous and the constants remain stable, the work offers a mechanistic link between optimization geometry, margins, and calibration that could inform training procedures for better-calibrated models. The empirical consistency across optimizers and the introduction of a targeted objective are strengths; reproducible code or machine-checked elements would further strengthen it, but none are mentioned.
major comments (2)
- [Mathematical derivation (abstract and §3)] The central claim that both ECE and Gauss-Newton curvature are controlled by the same margin-dependent exponential tail functional (up to problem-specific constants) is load-bearing for the subsequent proposal of the margin-aware objective. The derivation must explicitly show that these constants remain bounded and independent of depth, width, iteration count, and curvature spikes; otherwise the coupling does not tightly imply the observed tracking or the causal benefit of the new objective. This needs to be verified in the mathematical section with the precise functional form and any boundedness assumptions stated.
- [Empirical results section] The empirical claim that ECE closely tracks curvature throughout optimization relies on the tail term dominating even in early high-curvature regimes where margins may shrink. The manuscript should include an ablation or sensitivity analysis showing that the coupling persists when the exponential tail is not the leading term, or clarify the conditions under which the approximation holds.
minor comments (2)
- [Notation and preliminaries] Notation for the margin-dependent exponential tail functional should be defined once with all dependencies (e.g., on the loss, Hessian approximation) made explicit to avoid ambiguity when comparing to standard ECE and curvature definitions.
- [Experiments] The manuscript would benefit from a table summarizing the problem-specific constants across the reported tasks and optimizers to illustrate their stability.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the presentation of our results. We respond to each major comment below and indicate the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Mathematical derivation (abstract and §3)] The central claim that both ECE and Gauss-Newton curvature are controlled by the same margin-dependent exponential tail functional (up to problem-specific constants) is load-bearing for the subsequent proposal of the margin-aware objective. The derivation must explicitly show that these constants remain bounded and independent of depth, width, iteration count, and curvature spikes; otherwise the coupling does not tightly imply the observed tracking or the causal benefit of the new objective. This needs to be verified in the mathematical section with the precise functional form and any boundedness assumptions stated.
Authors: We agree that making the boundedness explicit strengthens the mathematical foundation. In the revised version, we will expand §3 to include the precise functional form of the margin-dependent exponential tail and explicitly state the boundedness assumptions. Under the assumptions of bounded data norms, Lipschitz-continuous activations, and positive margins along the trajectory, the constants are shown to depend only on these problem-specific quantities and remain independent of network depth, width, iteration count, and transient curvature spikes. This is derived by bounding the tail integral using the margin lower bound and gradient norms. We believe this addresses the concern and supports the proposal of the margin-aware objective. revision: yes
-
Referee: [Empirical results section] The empirical claim that ECE closely tracks curvature throughout optimization relies on the tail term dominating even in early high-curvature regimes where margins may shrink. The manuscript should include an ablation or sensitivity analysis showing that the coupling persists when the exponential tail is not the leading term, or clarify the conditions under which the approximation holds.
Authors: We thank the referee for pointing this out. While our empirical results show consistent tracking across optimizers, we will add a sensitivity analysis in the empirical section. This will include an ablation where we compute the relative contribution of the exponential tail term versus other factors at different optimization stages, particularly in early iterations. We will also clarify that the approximation holds primarily when margins are sufficiently positive and the tail dominates, which is the regime where calibration improves; in cases where margins shrink significantly, the coupling may be weaker, and we will discuss this limitation. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper presents an empirical observation that ECE tracks curvature sharpness, followed by a mathematical derivation showing both quantities are controlled by the same margin-dependent exponential tail functional up to problem-specific constants. This bound is stated as a derived result along the optimization trajectory rather than a definitional equivalence or fitted input renamed as prediction. The margin-aware objective is introduced as guided by the identified mechanism but does not reduce to the bound by construction. No load-bearing self-citations, uniqueness theorems imported from prior author work, or ansatz smuggling are evident in the abstract or described chain. The derivation remains self-contained with independent content from standard ECE, Gauss-Newton curvature, and margin concepts.
Axiom & Free-Parameter Ledger
free parameters (1)
- problem-specific constants
axioms (1)
- domain assumption Both ECE and Gauss-Newton curvature are controlled by the same margin-dependent exponential tail functional
Reference graph
Works this paper leans on
-
[1]
cc/paper_files/paper/2017/hash/ b22b257ad0519d4500539da3c8bcf4dd-Abstract
URL https://proceedings.neurips. cc/paper_files/paper/2017/hash/ b22b257ad0519d4500539da3c8bcf4dd-Abstract. html. Berta, E., Holzmüller, D., Jordan, M. I., and Bach, F. Rethinking early stopping: Refine, then cali- brate.arXiv preprint arXiv:2501.19195, 2025. doi: 10.48550/arXiv.2501.19195. URL https://arxiv. org/abs/2501.19195. Bohdal, O., Yang, Y ., and...
-
[2]
URL https://openreview.net/forum? id=jh-rTtvkGeM. ICLR 2021 Poster. Cohen, J. M., Ghorbani, B., Krishnan, S., Agarwal, N., Medapati, S., Badura, M., Suo, D., Cardoze, D., Nado, Z., Dahl, G. E., and Gilmer, J. Adaptive gradient methods at the edge of stability, 2022. URL https://arxiv. org/abs/2207.14484. DeGroot, M. H. and Fienberg, S. E. The comparison a...
-
[3]
cc/paper_files/paper/2017/hash/ a5e0ff62be0b08456fc7f1e88812af3d-Abstract
URL https://proceedings.neurips. cc/paper_files/paper/2017/hash/ a5e0ff62be0b08456fc7f1e88812af3d-Abstract. html. Jastrz˛ ebski, S., Kenton, Z., Ballas, N., Fischer, A., Bengio, Y ., and Storkey, A. On the relation between the sharpest directions of DNN loss and the SGD step length. In International Conference on Learning Representations,
2017
-
[4]
arXiv preprint arXiv:1711.04623 , year=
URL https://openreview.net/forum? id=SkgEaj05t7. Jastrz˛ ebski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A., Bengio, Y ., and Storkey, A. Three Factors Influ- encing Minima in SGD.arXiv:1711.04623 [cs, stat], September 2018. URL http://arxiv.org/abs/ 1711.04623. arXiv:1711.04623. Jiang, Y ., Neyshabur, B., Mobahi, H., Krishnan, D., and Bengio, S. F...
-
[5]
Adam: A Method for Stochastic Optimization
URL https://openreview.net/forum? id=H1oyRlYgg. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. InInternational Conference on Learning Representations, 2015. URL https://arxiv.org/ abs/1412.6980. Kull, M., Perello-Nieto, M., Kängsepp, M., Silva Filho, T., Song, H., and Flach, P. Beyond temperature scaling: Obtaining well-calibrated mu...
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[6]
Lengyel, D., Jennings, N., Parpas, P., and Kantas, N
URL https://proceedings.mlr.press/ v80/kumar18a.html. Lengyel, D., Jennings, N., Parpas, P., and Kantas, N. On flat minima, large margins and generalizability. Open- Review (ICLR 2021 submission), 2021. URL https: //openreview.net/forum?id=Ki5Mv0iY8C. Li, Y . and Sur, P. Optimal and provable calibration in high- dimensional binary classification: Angular ...
-
[7]
URL https://openreview.net/forum? id=ZQTiGcykl6. Möllenhoff, T. and Khan, M. E. SAM as an optimal relax- ation of Bayes. InInternational Conference on Learning Representations, 2023. URL https://openreview. net/forum?id=k4fevFqSQcX. Mukhoti, J., Kulharia, V ., Sanyal, A., Golodetz, S., Torr, P. H. S., and Dokania, P. K. Calibrating deep neural networks us...
-
[8]
Predicting good probabilities with supervised learning
URL https://proceedings.neurips. cc/paper_files/paper/2017/hash/ 10ce03a1ed01077e3e289f3e53c72813-Abstract. html. Niculescu-Mizil, A. and Caruana, R. Predicting good prob- abilities with supervised learning. InProceedings of the 22nd International Conference on Machine Learning, pp. 625–632, 2005. doi: 10.1145/1102351.1102430. Ovadia, Y ., Fertig, E., Ren...
-
[9]
cc/paper_files/paper/2019/hash/ 8558cb408c1d76621371888657d2eb1d-Abstract
URL https://proceedings.neurips. cc/paper_files/paper/2019/hash/ 8558cb408c1d76621371888657d2eb1d-Abstract. html. Pereyra, G., Tucker, G., Chorowski, J., Kaiser, Ł., and Hinton, G. Regularizing neural networks by penalizing confident output distributions. InICLR 2017 Workshop Track Proceedings. OpenReview.net, 2017. URLhttps: //openreview.net/forum?id=Hyh...
2019
-
[10]
Stutz, D., Hein, M., and Schiele, B
URL https://proceedings.mlr.press/ v119/stutz20a.html. Stutz, D., Hein, M., and Schiele, B. Relating adversarially robust generalization to flat minima. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion (ICCV), pp. 7807–7817, 2021. URL https: //openaccess.thecvf.com/content/ ICCV2021/papers/Stutz_Relating_ Adversarially_Robust_Ge...
2021
-
[11]
ICLR 2026 Poster
URL https://openreview.net/forum? id=c0ERcCz6lD. ICLR 2026 Poster. Thulasidasan, S., Chennupati, G., Bilmes, J. A., Bhat- tacharya, T., and Michalak, S. On mixup training: Improved calibration and predictive uncertainty for deep neural networks. InAdvances in Neu- ral Information Processing Systems, volume 32,
2026
-
[12]
cc/paper_files/paper/2019/hash/ 36ad8b5f42db492827016448975cc22d-Abstract
URL https://proceedings.neurips. cc/paper_files/paper/2019/hash/ 36ad8b5f42db492827016448975cc22d-Abstract. html. Tsuzuku, Y ., Sato, I., and Sugiyama, M. Normalized flat minima: Exploring scale invariant definition of flat minima for neural networks using PAC-bayesian analysis. In Daumé III, H. and Singh, A. (eds.), Proceedings of the 37th International ...
2019
-
[13]
Transforming classifier scores into accurate multiclass probability estimates
URL https://proceedings.mlr.press/ v119/tsuzuku20a.html. Wu, J., Bartlett, P., Telgarsky, M., and Yu, B. Benefits of early stopping in gradient descent for overparameterized logistic regression. InProceedings of the 42nd Interna- tional Conference on Machine Learning, volume 267 of 11 Too Sharp, Too Sure: When Calibration Follows Curvature Proceedings of ...
-
[14]
URL https://proceedings.mlr.press/ v119/zhang20k.html. Zheng, Y ., Zhang, R., and Mao, Y . Regularizing neural networks via adversarial model perturbation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8152– 8161, 2021. doi: 10.1109/CVPR46437.2021.00806. URL https://openaccess.thecvf. com/content/CVPR2021/html/Zh...
-
[15]
Foret et al
and label smoothing (Müller et al., 2019) can likewise curb overconfidence on hard examples. Foret et al. (2021)’s SAM, which biases optimization toward flatter minima, has been observed to lower calibration error (Zheng et al., 2021; Möllenhoff & Khan, 2023). These results share a common theme: controlling the growth or fragility of margins tends to impr...
2019
-
[16]
flat vs. sharp
improve adversarial robustness as a side effect (Stutz et al., 2021). Linear vs. non-linear caveats.In linear models trained with cross-entropy, the notion of “flat vs. sharp” is less meaningful: on separable data the weight norm grows without bound as margins maximize, driving the Hessian to zero while the classifier becomes arbitrarily confident. Meanin...
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.