arxiv: 2604.20614 · v1 · submitted 2026-04-22 · 💻 cs.LG · math.DS· math.OC· stat.ML

Recognition: unknown

Too Sharp, Too Sure: When Calibration Follows Curvature

Alessandro Morosini , Matea Gjika , Tomaso Poggio , Pierfrancesco Beneventano

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:20 UTC · model grok-4.3

classification 💻 cs.LG math.DSmath.OCstat.ML

keywords calibrationcurvaturemarginsexpected calibration errorneural network trainingsharpnessoptimization dynamics

0 comments

The pith

Calibration error in neural networks tracks loss curvature because both are controlled by the same margin-dependent exponential tails along the training path.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that expected calibration error and Gauss-Newton curvature during deep network training are both governed by a shared functional of the margin distribution tails. This means calibration emerges as a training-time property tied to how optimization shapes margins and local smoothness, not merely a post-training fix. Guided by this, the authors design a margin-aware objective that improves calibration on held-out data across multiple optimizers while preserving accuracy on small vision tasks.

Core claim

Both ECE and Gauss-Newton curvature are controlled, up to problem-specific constants, by the same margin-dependent exponential tail functional along the trajectory. A margin-aware training objective that targets robust-margin tails and local smoothness yields improved out-of-sample calibration across optimizers without sacrificing accuracy.

What carries the argument

The margin-dependent exponential tail functional, which bounds both calibration error and curvature throughout optimization.

If this is right

Calibration can be improved at training time by intervening on margins rather than through post-hoc adjustments.
The connection between sharpness and miscalibration is mediated by the statistics of the margin distribution.
The same training change improves calibration under multiple gradient-based optimizers.
Accuracy and calibration need not trade off when the objective explicitly encourages better margin tails.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same margin-tail mechanism may link calibration to generalization phenomena that also depend on margins.
If the coupling holds, similar objectives could be tested on non-vision tasks to check whether optimizer independence extends further.
The result suggests examining whether altering margins independently of curvature can decouple the two quantities.

Load-bearing premise

The coupling between margins, curvature, and calibration is causal, so explicitly targeting margin tails during training will produce better calibration in a general way.

What would settle it

Training with the margin-aware objective on a new optimizer or vision dataset and observing no reduction in ECE relative to standard training, while test accuracy stays comparable.

Figures

Figures reproduced from arXiv: 2604.20614 by Alessandro Morosini, Matea Gjika, Pierfrancesco Beneventano, Tomaso Poggio.

**Figure 1.** Figure 1: Training dynamics for Gradient Descent and Stochastic Gradient Descent across learning rates on CIFAR-10. Expected Calibration Error closely tracks sharpness throughout training: both rise as the model enters the edge of stability regime, peak around the same time, and decay together as training progresses. works is post-hoc calibration: models are trained for accuracy and their predicted probabilities a… view at source ↗

**Figure 2.** Figure 2: Training dynamics for SAM and Muon across learning rates on CIFAR-10. absolute flatness of the loss landscape. However, BulkSGD induces oscillatory dynamics and a sharpness divergence when too many dominant directions are projected out, making it impractical as a standalone optimizer. Together, these results support Hypothesis 2 over Hypothesis 1: suppressing updates along high-curvature directions durin… view at source ↗

**Figure 3.** Figure 3: Training dynamics for BulkSGD across different learning rates and number of projected-out gradients on CIFAR10. the other hand, in-sample improvements do not automatically transfer to test data, pointing to a fundamental train– test gap. In the following section, we formalize this train– test gap and use it to design a training-time intervention that improves out-of-sample calibration. 4. Curvature and Ca… view at source ↗

read the original abstract

Modern neural networks can achieve high accuracy while remaining poorly calibrated, producing confidence estimates that do not match empirical correctness. Yet calibration is often treated as a post-hoc attribute. We take a different perspective: we study calibration as a training-time phenomenon on small vision tasks, and ask whether calibrated solutions can be obtained reliably by intervening on the training procedure. We identify a tight coupling between calibration, curvature, and margins during training of deep networks under multiple gradient-based methods. Empirically, Expected Calibration Error (ECE) closely tracks curvature-based sharpness throughout optimization. Mathematically, we show that both ECE and Gauss--Newton curvature are controlled, up to problem-specific constants, by the same margin-dependent exponential tail functional along the trajectory. Guided by this mechanism, we introduce a margin-aware training objective that explicitly targets robust-margin tails and local smoothness, yielding improved out-of-sample calibration across optimizers without sacrificing accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper links ECE and curvature to one margin-tail functional and offers a training objective, but the constants and narrow experiments are the weak points.

read the letter

The core claim is that expected calibration error and Gauss-Newton curvature are both controlled by the same margin-dependent exponential tail along the training path, up to constants, and that a margin-aware objective can improve calibration as a result. They back this with empirical tracking of ECE against curvature on small vision tasks across several optimizers, plus a derivation that reduces both quantities to the shared functional. The new objective targets margin tails and local smoothness explicitly and shows better held-out calibration without accuracy drops. That reduction and the resulting objective are the genuinely new pieces; most prior calibration work is post-hoc, and the curvature link is usually observed rather than derived this way. The empirical consistency across optimizers is also useful to see. The soft spots are exactly where the stress-test note points: the problem-specific constants have to stay well-behaved for the shared-control story to be tight, yet the abstract gives no evidence they remain stable with depth, width, or iteration count. If the derivation leans on local approximations that loosen when margins shrink or curvature spikes early in training, the causal step from functional to observed tracking weakens. Experiments stay on small vision tasks, so it is unclear whether the objective generalizes or simply exploits the same limited regime. This is for people working on training-time calibration or sharpness-aware optimization who want a concrete mechanism to test. It is not broad enough to reshape the field, but the angle is coherent enough that a serious editor should send it to referees to check the derivation and run wider experiments.

Referee Report

2 major / 2 minor

Summary. The paper studies calibration as a training-time phenomenon in deep networks on small vision tasks. It reports that Expected Calibration Error (ECE) empirically tracks Gauss-Newton curvature across multiple gradient-based optimizers. It derives that both ECE and curvature are controlled, up to problem-specific constants, by the same margin-dependent exponential tail functional along the optimization trajectory. Guided by this, the authors propose a margin-aware training objective targeting robust margins and local smoothness, which empirically improves out-of-sample calibration without accuracy loss.

Significance. If the shared-control derivation is rigorous and the constants remain stable, the work offers a mechanistic link between optimization geometry, margins, and calibration that could inform training procedures for better-calibrated models. The empirical consistency across optimizers and the introduction of a targeted objective are strengths; reproducible code or machine-checked elements would further strengthen it, but none are mentioned.

major comments (2)

[Mathematical derivation (abstract and §3)] The central claim that both ECE and Gauss-Newton curvature are controlled by the same margin-dependent exponential tail functional (up to problem-specific constants) is load-bearing for the subsequent proposal of the margin-aware objective. The derivation must explicitly show that these constants remain bounded and independent of depth, width, iteration count, and curvature spikes; otherwise the coupling does not tightly imply the observed tracking or the causal benefit of the new objective. This needs to be verified in the mathematical section with the precise functional form and any boundedness assumptions stated.
[Empirical results section] The empirical claim that ECE closely tracks curvature throughout optimization relies on the tail term dominating even in early high-curvature regimes where margins may shrink. The manuscript should include an ablation or sensitivity analysis showing that the coupling persists when the exponential tail is not the leading term, or clarify the conditions under which the approximation holds.

minor comments (2)

[Notation and preliminaries] Notation for the margin-dependent exponential tail functional should be defined once with all dependencies (e.g., on the loss, Hessian approximation) made explicit to avoid ambiguity when comparing to standard ECE and curvature definitions.
[Experiments] The manuscript would benefit from a table summarizing the problem-specific constants across the reported tasks and optimizers to illustrate their stability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our results. We respond to each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Mathematical derivation (abstract and §3)] The central claim that both ECE and Gauss-Newton curvature are controlled by the same margin-dependent exponential tail functional (up to problem-specific constants) is load-bearing for the subsequent proposal of the margin-aware objective. The derivation must explicitly show that these constants remain bounded and independent of depth, width, iteration count, and curvature spikes; otherwise the coupling does not tightly imply the observed tracking or the causal benefit of the new objective. This needs to be verified in the mathematical section with the precise functional form and any boundedness assumptions stated.

Authors: We agree that making the boundedness explicit strengthens the mathematical foundation. In the revised version, we will expand §3 to include the precise functional form of the margin-dependent exponential tail and explicitly state the boundedness assumptions. Under the assumptions of bounded data norms, Lipschitz-continuous activations, and positive margins along the trajectory, the constants are shown to depend only on these problem-specific quantities and remain independent of network depth, width, iteration count, and transient curvature spikes. This is derived by bounding the tail integral using the margin lower bound and gradient norms. We believe this addresses the concern and supports the proposal of the margin-aware objective. revision: yes
Referee: [Empirical results section] The empirical claim that ECE closely tracks curvature throughout optimization relies on the tail term dominating even in early high-curvature regimes where margins may shrink. The manuscript should include an ablation or sensitivity analysis showing that the coupling persists when the exponential tail is not the leading term, or clarify the conditions under which the approximation holds.

Authors: We thank the referee for pointing this out. While our empirical results show consistent tracking across optimizers, we will add a sensitivity analysis in the empirical section. This will include an ablation where we compute the relative contribution of the exponential tail term versus other factors at different optimization stages, particularly in early iterations. We will also clarify that the approximation holds primarily when margins are sufficiently positive and the tail dominates, which is the regime where calibration improves; in cases where margins shrink significantly, the coupling may be weaker, and we will discuss this limitation. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an empirical observation that ECE tracks curvature sharpness, followed by a mathematical derivation showing both quantities are controlled by the same margin-dependent exponential tail functional up to problem-specific constants. This bound is stated as a derived result along the optimization trajectory rather than a definitional equivalence or fitted input renamed as prediction. The margin-aware objective is introduced as guided by the identified mechanism but does not reduce to the bound by construction. No load-bearing self-citations, uniqueness theorems imported from prior author work, or ansatz smuggling are evident in the abstract or described chain. The derivation remains self-contained with independent content from standard ECE, Gauss-Newton curvature, and margin concepts.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on an unproven mathematical equivalence between ECE and curvature through a margin-tail functional, plus empirical observation on small tasks. No new physical entities are introduced; the constants are problem-specific and the new objective adds margin and smoothness terms.

free parameters (1)

problem-specific constants
The mathematical control holds only up to these constants, which are not derived from first principles and must be treated as fitted or task-dependent.

axioms (1)

domain assumption Both ECE and Gauss-Newton curvature are controlled by the same margin-dependent exponential tail functional
This is the load-bearing mathematical claim stated in the abstract.

pith-pipeline@v0.9.0 · 5470 in / 1170 out tokens · 33708 ms · 2026-05-10T00:20:54.181140+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 9 canonical work pages · 1 internal anchor

[1]

cc/paper_files/paper/2017/hash/ b22b257ad0519d4500539da3c8bcf4dd-Abstract

URL https://proceedings.neurips. cc/paper_files/paper/2017/hash/ b22b257ad0519d4500539da3c8bcf4dd-Abstract. html. Berta, E., Holzmüller, D., Jordan, M. I., and Bach, F. Rethinking early stopping: Refine, then cali- brate.arXiv preprint arXiv:2501.19195, 2025. doi: 10.48550/arXiv.2501.19195. URL https://arxiv. org/abs/2501.19195. Bohdal, O., Yang, Y ., and...

work page doi:10.48550/arxiv.2501.19195 2017
[2]

DeGroot and Stephen E

URL https://openreview.net/forum? id=jh-rTtvkGeM. ICLR 2021 Poster. Cohen, J. M., Ghorbani, B., Krishnan, S., Agarwal, N., Medapati, S., Badura, M., Suo, D., Cardoze, D., Nado, Z., Dahl, G. E., and Gilmer, J. Adaptive gradient methods at the edge of stability, 2022. URL https://arxiv. org/abs/2207.14484. DeGroot, M. H. and Fienberg, S. E. The comparison a...

work page doi:10.2307/2987588 2021
[3]

cc/paper_files/paper/2017/hash/ a5e0ff62be0b08456fc7f1e88812af3d-Abstract

URL https://proceedings.neurips. cc/paper_files/paper/2017/hash/ a5e0ff62be0b08456fc7f1e88812af3d-Abstract. html. Jastrz˛ ebski, S., Kenton, Z., Ballas, N., Fischer, A., Bengio, Y ., and Storkey, A. On the relation between the sharpest directions of DNN loss and the SGD step length. In International Conference on Learning Representations,

2017
[4]

arXiv preprint arXiv:1711.04623 , year=

URL https://openreview.net/forum? id=SkgEaj05t7. Jastrz˛ ebski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A., Bengio, Y ., and Storkey, A. Three Factors Influ- encing Minima in SGD.arXiv:1711.04623 [cs, stat], September 2018. URL http://arxiv.org/abs/ 1711.04623. arXiv:1711.04623. Jiang, Y ., Neyshabur, B., Mobahi, H., Krishnan, D., and Bengio, S. F...

work page arXiv 2018
[5]

Adam: A Method for Stochastic Optimization

URL https://openreview.net/forum? id=H1oyRlYgg. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. InInternational Conference on Learning Representations, 2015. URL https://arxiv.org/ abs/1412.6980. Kull, M., Perello-Nieto, M., Kängsepp, M., Silva Filho, T., Song, H., and Flach, P. Beyond temperature scaling: Obtaining well-calibrated mu...

work page internal anchor Pith review Pith/arXiv arXiv 2015
[6]

Lengyel, D., Jennings, N., Parpas, P., and Kantas, N

URL https://proceedings.mlr.press/ v80/kumar18a.html. Lengyel, D., Jennings, N., Parpas, P., and Kantas, N. On flat minima, large margins and generalizability. Open- Review (ICLR 2021 submission), 2021. URL https: //openreview.net/forum?id=Ki5Mv0iY8C. Li, Y . and Sur, P. Optimal and provable calibration in high- dimensional binary classification: Angular ...

work page arXiv 2021
[7]

2015 , month=

URL https://openreview.net/forum? id=ZQTiGcykl6. Möllenhoff, T. and Khan, M. E. SAM as an optimal relax- ation of Bayes. InInternational Conference on Learning Representations, 2023. URL https://openreview. net/forum?id=k4fevFqSQcX. Mukhoti, J., Kulharia, V ., Sanyal, A., Golodetz, S., Torr, P. H. S., and Dokania, P. K. Calibrating deep neural networks us...

work page doi:10.1609/aaai.v29i1.9602 2023
[8]

Predicting good probabilities with supervised learning

URL https://proceedings.neurips. cc/paper_files/paper/2017/hash/ 10ce03a1ed01077e3e289f3e53c72813-Abstract. html. Niculescu-Mizil, A. and Caruana, R. Predicting good prob- abilities with supervised learning. InProceedings of the 22nd International Conference on Machine Learning, pp. 625–632, 2005. doi: 10.1145/1102351.1102430. Ovadia, Y ., Fertig, E., Ren...

work page doi:10.1145/1102351.1102430 2017
[9]

cc/paper_files/paper/2019/hash/ 8558cb408c1d76621371888657d2eb1d-Abstract

URL https://proceedings.neurips. cc/paper_files/paper/2019/hash/ 8558cb408c1d76621371888657d2eb1d-Abstract. html. Pereyra, G., Tucker, G., Chorowski, J., Kaiser, Ł., and Hinton, G. Regularizing neural networks by penalizing confident output distributions. InICLR 2017 Workshop Track Proceedings. OpenReview.net, 2017. URLhttps: //openreview.net/forum?id=Hyh...

2019
[10]

Stutz, D., Hein, M., and Schiele, B

URL https://proceedings.mlr.press/ v119/stutz20a.html. Stutz, D., Hein, M., and Schiele, B. Relating adversarially robust generalization to flat minima. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion (ICCV), pp. 7807–7817, 2021. URL https: //openaccess.thecvf.com/content/ ICCV2021/papers/Stutz_Relating_ Adversarially_Robust_Ge...

2021
[11]

ICLR 2026 Poster

URL https://openreview.net/forum? id=c0ERcCz6lD. ICLR 2026 Poster. Thulasidasan, S., Chennupati, G., Bilmes, J. A., Bhat- tacharya, T., and Michalak, S. On mixup training: Improved calibration and predictive uncertainty for deep neural networks. InAdvances in Neu- ral Information Processing Systems, volume 32,

2026
[12]

cc/paper_files/paper/2019/hash/ 36ad8b5f42db492827016448975cc22d-Abstract

URL https://proceedings.neurips. cc/paper_files/paper/2019/hash/ 36ad8b5f42db492827016448975cc22d-Abstract. html. Tsuzuku, Y ., Sato, I., and Sugiyama, M. Normalized flat minima: Exploring scale invariant definition of flat minima for neural networks using PAC-bayesian analysis. In Daumé III, H. and Singh, A. (eds.), Proceedings of the 37th International ...

2019
[13]

Transforming classifier scores into accurate multiclass probability estimates

URL https://proceedings.mlr.press/ v119/tsuzuku20a.html. Wu, J., Bartlett, P., Telgarsky, M., and Yu, B. Benefits of early stopping in gradient descent for overparameterized logistic regression. InProceedings of the 42nd Interna- tional Conference on Machine Learning, volume 267 of 11 Too Sharp, Too Sure: When Calibration Follows Curvature Proceedings of ...

work page doi:10.1145/775047.775151 2025
[14]

2021 , url =

URL https://proceedings.mlr.press/ v119/zhang20k.html. Zheng, Y ., Zhang, R., and Mao, Y . Regularizing neural networks via adversarial model perturbation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8152– 8161, 2021. doi: 10.1109/CVPR46437.2021.00806. URL https://openaccess.thecvf. com/content/CVPR2021/html/Zh...

work page doi:10.1109/cvpr46437.2021.00806 2021
[15]

Foret et al

and label smoothing (Müller et al., 2019) can likewise curb overconfidence on hard examples. Foret et al. (2021)’s SAM, which biases optimization toward flatter minima, has been observed to lower calibration error (Zheng et al., 2021; Möllenhoff & Khan, 2023). These results share a common theme: controlling the growth or fragility of margins tends to impr...

2019
[16]

flat vs. sharp

improve adversarial robustness as a side effect (Stutz et al., 2021). Linear vs. non-linear caveats.In linear models trained with cross-entropy, the notion of “flat vs. sharp” is less meaningful: on separable data the weight norm grows without bound as margins maximize, driving the Hessian to zero while the classifier becomes arbitrarily confident. Meanin...

2021