pith. machine review for the scientific record. sign in

arxiv: 2604.20614 · v1 · submitted 2026-04-22 · 💻 cs.LG · math.DS· math.OC· stat.ML

Recognition: unknown

Too Sharp, Too Sure: When Calibration Follows Curvature

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:20 UTC · model grok-4.3

classification 💻 cs.LG math.DSmath.OCstat.ML
keywords calibrationcurvaturemarginsexpected calibration errorneural network trainingsharpnessoptimization dynamics
0
0 comments X

The pith

Calibration error in neural networks tracks loss curvature because both are controlled by the same margin-dependent exponential tails along the training path.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that expected calibration error and Gauss-Newton curvature during deep network training are both governed by a shared functional of the margin distribution tails. This means calibration emerges as a training-time property tied to how optimization shapes margins and local smoothness, not merely a post-training fix. Guided by this, the authors design a margin-aware objective that improves calibration on held-out data across multiple optimizers while preserving accuracy on small vision tasks.

Core claim

Both ECE and Gauss-Newton curvature are controlled, up to problem-specific constants, by the same margin-dependent exponential tail functional along the trajectory. A margin-aware training objective that targets robust-margin tails and local smoothness yields improved out-of-sample calibration across optimizers without sacrificing accuracy.

What carries the argument

The margin-dependent exponential tail functional, which bounds both calibration error and curvature throughout optimization.

If this is right

  • Calibration can be improved at training time by intervening on margins rather than through post-hoc adjustments.
  • The connection between sharpness and miscalibration is mediated by the statistics of the margin distribution.
  • The same training change improves calibration under multiple gradient-based optimizers.
  • Accuracy and calibration need not trade off when the objective explicitly encourages better margin tails.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same margin-tail mechanism may link calibration to generalization phenomena that also depend on margins.
  • If the coupling holds, similar objectives could be tested on non-vision tasks to check whether optimizer independence extends further.
  • The result suggests examining whether altering margins independently of curvature can decouple the two quantities.

Load-bearing premise

The coupling between margins, curvature, and calibration is causal, so explicitly targeting margin tails during training will produce better calibration in a general way.

What would settle it

Training with the margin-aware objective on a new optimizer or vision dataset and observing no reduction in ECE relative to standard training, while test accuracy stays comparable.

Figures

Figures reproduced from arXiv: 2604.20614 by Alessandro Morosini, Matea Gjika, Pierfrancesco Beneventano, Tomaso Poggio.

Figure 1
Figure 1. Figure 1: Training dynamics for Gradient Descent and Stochastic Gradient Descent across learning rates on CIFAR-10. Expected Calibration Error closely tracks sharpness throughout train￾ing: both rise as the model enters the edge of stability regime, peak around the same time, and decay together as training progresses. works is post-hoc calibration: models are trained for accu￾racy and their predicted probabilities a… view at source ↗
Figure 2
Figure 2. Figure 2: Training dynamics for SAM and Muon across learning rates on CIFAR-10. absolute flatness of the loss landscape. However, BulkSGD induces oscillatory dynamics and a sharpness divergence when too many dominant directions are projected out, mak￾ing it impractical as a standalone optimizer. Together, these results support Hypothesis 2 over Hypothesis 1: suppress￾ing updates along high-curvature directions durin… view at source ↗
Figure 3
Figure 3. Figure 3: Training dynamics for BulkSGD across different learning rates and number of projected-out gradients on CIFAR10. the other hand, in-sample improvements do not automati￾cally transfer to test data, pointing to a fundamental train– test gap. In the following section, we formalize this train– test gap and use it to design a training-time intervention that improves out-of-sample calibration. 4. Curvature and Ca… view at source ↗
read the original abstract

Modern neural networks can achieve high accuracy while remaining poorly calibrated, producing confidence estimates that do not match empirical correctness. Yet calibration is often treated as a post-hoc attribute. We take a different perspective: we study calibration as a training-time phenomenon on small vision tasks, and ask whether calibrated solutions can be obtained reliably by intervening on the training procedure. We identify a tight coupling between calibration, curvature, and margins during training of deep networks under multiple gradient-based methods. Empirically, Expected Calibration Error (ECE) closely tracks curvature-based sharpness throughout optimization. Mathematically, we show that both ECE and Gauss--Newton curvature are controlled, up to problem-specific constants, by the same margin-dependent exponential tail functional along the trajectory. Guided by this mechanism, we introduce a margin-aware training objective that explicitly targets robust-margin tails and local smoothness, yielding improved out-of-sample calibration across optimizers without sacrificing accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper studies calibration as a training-time phenomenon in deep networks on small vision tasks. It reports that Expected Calibration Error (ECE) empirically tracks Gauss-Newton curvature across multiple gradient-based optimizers. It derives that both ECE and curvature are controlled, up to problem-specific constants, by the same margin-dependent exponential tail functional along the optimization trajectory. Guided by this, the authors propose a margin-aware training objective targeting robust margins and local smoothness, which empirically improves out-of-sample calibration without accuracy loss.

Significance. If the shared-control derivation is rigorous and the constants remain stable, the work offers a mechanistic link between optimization geometry, margins, and calibration that could inform training procedures for better-calibrated models. The empirical consistency across optimizers and the introduction of a targeted objective are strengths; reproducible code or machine-checked elements would further strengthen it, but none are mentioned.

major comments (2)
  1. [Mathematical derivation (abstract and §3)] The central claim that both ECE and Gauss-Newton curvature are controlled by the same margin-dependent exponential tail functional (up to problem-specific constants) is load-bearing for the subsequent proposal of the margin-aware objective. The derivation must explicitly show that these constants remain bounded and independent of depth, width, iteration count, and curvature spikes; otherwise the coupling does not tightly imply the observed tracking or the causal benefit of the new objective. This needs to be verified in the mathematical section with the precise functional form and any boundedness assumptions stated.
  2. [Empirical results section] The empirical claim that ECE closely tracks curvature throughout optimization relies on the tail term dominating even in early high-curvature regimes where margins may shrink. The manuscript should include an ablation or sensitivity analysis showing that the coupling persists when the exponential tail is not the leading term, or clarify the conditions under which the approximation holds.
minor comments (2)
  1. [Notation and preliminaries] Notation for the margin-dependent exponential tail functional should be defined once with all dependencies (e.g., on the loss, Hessian approximation) made explicit to avoid ambiguity when comparing to standard ECE and curvature definitions.
  2. [Experiments] The manuscript would benefit from a table summarizing the problem-specific constants across the reported tasks and optimizers to illustrate their stability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our results. We respond to each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Mathematical derivation (abstract and §3)] The central claim that both ECE and Gauss-Newton curvature are controlled by the same margin-dependent exponential tail functional (up to problem-specific constants) is load-bearing for the subsequent proposal of the margin-aware objective. The derivation must explicitly show that these constants remain bounded and independent of depth, width, iteration count, and curvature spikes; otherwise the coupling does not tightly imply the observed tracking or the causal benefit of the new objective. This needs to be verified in the mathematical section with the precise functional form and any boundedness assumptions stated.

    Authors: We agree that making the boundedness explicit strengthens the mathematical foundation. In the revised version, we will expand §3 to include the precise functional form of the margin-dependent exponential tail and explicitly state the boundedness assumptions. Under the assumptions of bounded data norms, Lipschitz-continuous activations, and positive margins along the trajectory, the constants are shown to depend only on these problem-specific quantities and remain independent of network depth, width, iteration count, and transient curvature spikes. This is derived by bounding the tail integral using the margin lower bound and gradient norms. We believe this addresses the concern and supports the proposal of the margin-aware objective. revision: yes

  2. Referee: [Empirical results section] The empirical claim that ECE closely tracks curvature throughout optimization relies on the tail term dominating even in early high-curvature regimes where margins may shrink. The manuscript should include an ablation or sensitivity analysis showing that the coupling persists when the exponential tail is not the leading term, or clarify the conditions under which the approximation holds.

    Authors: We thank the referee for pointing this out. While our empirical results show consistent tracking across optimizers, we will add a sensitivity analysis in the empirical section. This will include an ablation where we compute the relative contribution of the exponential tail term versus other factors at different optimization stages, particularly in early iterations. We will also clarify that the approximation holds primarily when margins are sufficiently positive and the tail dominates, which is the regime where calibration improves; in cases where margins shrink significantly, the coupling may be weaker, and we will discuss this limitation. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an empirical observation that ECE tracks curvature sharpness, followed by a mathematical derivation showing both quantities are controlled by the same margin-dependent exponential tail functional up to problem-specific constants. This bound is stated as a derived result along the optimization trajectory rather than a definitional equivalence or fitted input renamed as prediction. The margin-aware objective is introduced as guided by the identified mechanism but does not reduce to the bound by construction. No load-bearing self-citations, uniqueness theorems imported from prior author work, or ansatz smuggling are evident in the abstract or described chain. The derivation remains self-contained with independent content from standard ECE, Gauss-Newton curvature, and margin concepts.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on an unproven mathematical equivalence between ECE and curvature through a margin-tail functional, plus empirical observation on small tasks. No new physical entities are introduced; the constants are problem-specific and the new objective adds margin and smoothness terms.

free parameters (1)
  • problem-specific constants
    The mathematical control holds only up to these constants, which are not derived from first principles and must be treated as fitted or task-dependent.
axioms (1)
  • domain assumption Both ECE and Gauss-Newton curvature are controlled by the same margin-dependent exponential tail functional
    This is the load-bearing mathematical claim stated in the abstract.

pith-pipeline@v0.9.0 · 5470 in / 1170 out tokens · 33708 ms · 2026-05-10T00:20:54.181140+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 9 canonical work pages · 1 internal anchor

  1. [1]

    cc/paper_files/paper/2017/hash/ b22b257ad0519d4500539da3c8bcf4dd-Abstract

    URL https://proceedings.neurips. cc/paper_files/paper/2017/hash/ b22b257ad0519d4500539da3c8bcf4dd-Abstract. html. Berta, E., Holzmüller, D., Jordan, M. I., and Bach, F. Rethinking early stopping: Refine, then cali- brate.arXiv preprint arXiv:2501.19195, 2025. doi: 10.48550/arXiv.2501.19195. URL https://arxiv. org/abs/2501.19195. Bohdal, O., Yang, Y ., and...

  2. [2]

    DeGroot and Stephen E

    URL https://openreview.net/forum? id=jh-rTtvkGeM. ICLR 2021 Poster. Cohen, J. M., Ghorbani, B., Krishnan, S., Agarwal, N., Medapati, S., Badura, M., Suo, D., Cardoze, D., Nado, Z., Dahl, G. E., and Gilmer, J. Adaptive gradient methods at the edge of stability, 2022. URL https://arxiv. org/abs/2207.14484. DeGroot, M. H. and Fienberg, S. E. The comparison a...

  3. [3]

    cc/paper_files/paper/2017/hash/ a5e0ff62be0b08456fc7f1e88812af3d-Abstract

    URL https://proceedings.neurips. cc/paper_files/paper/2017/hash/ a5e0ff62be0b08456fc7f1e88812af3d-Abstract. html. Jastrz˛ ebski, S., Kenton, Z., Ballas, N., Fischer, A., Bengio, Y ., and Storkey, A. On the relation between the sharpest directions of DNN loss and the SGD step length. In International Conference on Learning Representations,

  4. [4]

    arXiv preprint arXiv:1711.04623 , year=

    URL https://openreview.net/forum? id=SkgEaj05t7. Jastrz˛ ebski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A., Bengio, Y ., and Storkey, A. Three Factors Influ- encing Minima in SGD.arXiv:1711.04623 [cs, stat], September 2018. URL http://arxiv.org/abs/ 1711.04623. arXiv:1711.04623. Jiang, Y ., Neyshabur, B., Mobahi, H., Krishnan, D., and Bengio, S. F...

  5. [5]

    Adam: A Method for Stochastic Optimization

    URL https://openreview.net/forum? id=H1oyRlYgg. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. InInternational Conference on Learning Representations, 2015. URL https://arxiv.org/ abs/1412.6980. Kull, M., Perello-Nieto, M., Kängsepp, M., Silva Filho, T., Song, H., and Flach, P. Beyond temperature scaling: Obtaining well-calibrated mu...

  6. [6]

    Lengyel, D., Jennings, N., Parpas, P., and Kantas, N

    URL https://proceedings.mlr.press/ v80/kumar18a.html. Lengyel, D., Jennings, N., Parpas, P., and Kantas, N. On flat minima, large margins and generalizability. Open- Review (ICLR 2021 submission), 2021. URL https: //openreview.net/forum?id=Ki5Mv0iY8C. Li, Y . and Sur, P. Optimal and provable calibration in high- dimensional binary classification: Angular ...

  7. [7]

    2015 , month=

    URL https://openreview.net/forum? id=ZQTiGcykl6. Möllenhoff, T. and Khan, M. E. SAM as an optimal relax- ation of Bayes. InInternational Conference on Learning Representations, 2023. URL https://openreview. net/forum?id=k4fevFqSQcX. Mukhoti, J., Kulharia, V ., Sanyal, A., Golodetz, S., Torr, P. H. S., and Dokania, P. K. Calibrating deep neural networks us...

  8. [8]

    Predicting good probabilities with supervised learning

    URL https://proceedings.neurips. cc/paper_files/paper/2017/hash/ 10ce03a1ed01077e3e289f3e53c72813-Abstract. html. Niculescu-Mizil, A. and Caruana, R. Predicting good prob- abilities with supervised learning. InProceedings of the 22nd International Conference on Machine Learning, pp. 625–632, 2005. doi: 10.1145/1102351.1102430. Ovadia, Y ., Fertig, E., Ren...

  9. [9]

    cc/paper_files/paper/2019/hash/ 8558cb408c1d76621371888657d2eb1d-Abstract

    URL https://proceedings.neurips. cc/paper_files/paper/2019/hash/ 8558cb408c1d76621371888657d2eb1d-Abstract. html. Pereyra, G., Tucker, G., Chorowski, J., Kaiser, Ł., and Hinton, G. Regularizing neural networks by penalizing confident output distributions. InICLR 2017 Workshop Track Proceedings. OpenReview.net, 2017. URLhttps: //openreview.net/forum?id=Hyh...

  10. [10]

    Stutz, D., Hein, M., and Schiele, B

    URL https://proceedings.mlr.press/ v119/stutz20a.html. Stutz, D., Hein, M., and Schiele, B. Relating adversarially robust generalization to flat minima. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion (ICCV), pp. 7807–7817, 2021. URL https: //openaccess.thecvf.com/content/ ICCV2021/papers/Stutz_Relating_ Adversarially_Robust_Ge...

  11. [11]

    ICLR 2026 Poster

    URL https://openreview.net/forum? id=c0ERcCz6lD. ICLR 2026 Poster. Thulasidasan, S., Chennupati, G., Bilmes, J. A., Bhat- tacharya, T., and Michalak, S. On mixup training: Improved calibration and predictive uncertainty for deep neural networks. InAdvances in Neu- ral Information Processing Systems, volume 32,

  12. [12]

    cc/paper_files/paper/2019/hash/ 36ad8b5f42db492827016448975cc22d-Abstract

    URL https://proceedings.neurips. cc/paper_files/paper/2019/hash/ 36ad8b5f42db492827016448975cc22d-Abstract. html. Tsuzuku, Y ., Sato, I., and Sugiyama, M. Normalized flat minima: Exploring scale invariant definition of flat minima for neural networks using PAC-bayesian analysis. In Daumé III, H. and Singh, A. (eds.), Proceedings of the 37th International ...

  13. [13]

    Transforming classifier scores into accurate multiclass probability estimates

    URL https://proceedings.mlr.press/ v119/tsuzuku20a.html. Wu, J., Bartlett, P., Telgarsky, M., and Yu, B. Benefits of early stopping in gradient descent for overparameterized logistic regression. InProceedings of the 42nd Interna- tional Conference on Machine Learning, volume 267 of 11 Too Sharp, Too Sure: When Calibration Follows Curvature Proceedings of ...

  14. [14]

    2021 , url =

    URL https://proceedings.mlr.press/ v119/zhang20k.html. Zheng, Y ., Zhang, R., and Mao, Y . Regularizing neural networks via adversarial model perturbation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8152– 8161, 2021. doi: 10.1109/CVPR46437.2021.00806. URL https://openaccess.thecvf. com/content/CVPR2021/html/Zh...

  15. [15]

    Foret et al

    and label smoothing (Müller et al., 2019) can likewise curb overconfidence on hard examples. Foret et al. (2021)’s SAM, which biases optimization toward flatter minima, has been observed to lower calibration error (Zheng et al., 2021; Möllenhoff & Khan, 2023). These results share a common theme: controlling the growth or fragility of margins tends to impr...

  16. [16]

    flat vs. sharp

    improve adversarial robustness as a side effect (Stutz et al., 2021). Linear vs. non-linear caveats.In linear models trained with cross-entropy, the notion of “flat vs. sharp” is less meaningful: on separable data the weight norm grows without bound as margins maximize, driving the Hessian to zero while the classifier becomes arbitrarily confident. Meanin...