On Calibration of Modern Neural Networks
read the original abstract
Confidence calibration -- the problem of predicting probability estimates representative of the true correctness likelihood -- is important for classification models in many applications. We discover that modern neural networks, unlike those from a decade ago, are poorly calibrated. Through extensive experiments, we observe that depth, width, weight decay, and Batch Normalization are important factors influencing calibration. We evaluate the performance of various post-processing calibration methods on state-of-the-art architectures with image and document classification datasets. Our analysis and experiments not only offer insights into neural network learning, but also provide a simple and straightforward recipe for practical settings: on most datasets, temperature scaling -- a single-parameter variant of Platt Scaling -- is surprisingly effective at calibrating predictions.
This paper has not been read by Pith yet.
Forward citations
Cited by 26 Pith papers
-
Bayesian Social Deduction with Graph-Informed Language Models
Hybrid Bayesian-graph LLM agent reaches competitive performance against large models and achieves 67% win rate against humans in controlled Avalon play, outperforming baselines and human teammates.
-
Pointwise Metrics Mislead: An Evaluation Protocol for Multimodal Inverse Problems
Pointwise metrics compress marginal spectra in multimodal inverse problems, and a three-part protocol using CRPS, spectrum fidelity, and calibration reverses model rankings on synthetic and particle-physics benchmarks.
-
Training-Free Cultural Alignment of Large Language Models via Persona Disagreement
DISCA uses disagreement among WVS-grounded persona panels to apply loss-averse logit corrections that reduce cultural misalignment by 10-24% on MultiTP for models 3.8B and larger, without weight changes.
-
Training-Free Cultural Alignment of Large Language Models via Persona Disagreement
DISCA converts within-country disagreement among World Values Survey personas into a bounded logit correction that reduces cultural misalignment by 10-24% on MultiTP for models 3.8B and larger across 20 countries, wit...
-
Diversity in Large Language Models under Supervised Fine-Tuning
TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
-
Pioneer Agent: Continual Improvement of Small Language Models in Production
Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...
-
Ensemble-Based Dirichlet Modeling for Predictive Uncertainty and Selective Classification
Ensemble-based method of moments on softmax outputs produces stable Dirichlet predictive distributions that improve uncertainty-guided tasks like selective classification over evidential deep learning.
-
Photometric Redshift PDFs via Neural Network Classification for DESI Legacy Imaging Surveys and Pan-STARRS
Neural network classification with CRPS optimization produces calibrated photometric redshift PDFs for DESI Legacy and Pan-STARRS data, achieving σ_NMAD of 0.0153 on LSDR10 and outperforming regression methods.
-
Training Language Models to Self-Correct via Reinforcement Learning
SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
-
Language Models (Mostly) Know What They Know
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
-
When to Answer and When to Defer: A Decision Framework for Reliable Code Predictions
Introduces a unified framework integrating uncertainty estimation, calibration, and tool-based abstention for reliable code predictions in language models.
-
R2V Agent: Teaching SLMs When to Ask for Help
R2V-Agent combines an SLM policy trained via BC and DPO with a step-level risk-calibrated router using Brier scores and CVaR to escalate to LLM only on high residual failure risk, improving success-cost tradeoffs on H...
-
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
-
Scale-Dependent Input Representation and Confidence Estimation for LLMs in Materials Property Prediction
Larger LLMs handle detailed crystal descriptions better than small ones, and mean negative log-likelihood of predicted numbers tracks prediction error after fine-tuning.
-
Diversity in Large Language Models under Supervised Fine-Tuning
Supervised fine-tuning narrows LLM generative diversity through neglect of low-frequency patterns and knowledge forgetting, but the TOFU loss mitigates this effect across models and benchmarks.
-
Calibration Collapse Under Sycophancy Fine-Tuning: How Reward Hacking Breaks Uncertainty Quantification in LLMs
Sycophantic GRPO fine-tuning degrades LLM calibration, raising ECE by 0.006 and MCE by 0.010, with a persistent residual after post-hoc scaling.
-
MedFormer-UR: Uncertainty-Routed Transformer for Medical Image Classification
MedFormer-UR integrates evidential uncertainty from Dirichlet distributions and class-specific prototypes into a transformer to improve calibration and selective prediction on medical images across four modalities.
-
Uncertainty-Calibrated Explainable Artificial Intelligence for Fetal Ultrasound Plane Classification: A Systematic Review
PRISMA 2020 systematic review of 78 studies on fetal ultrasound plane classification paired with explainability or uncertainty, introducing the CALIB-XFUS reporting framework across six domains.
-
LiLAW: Lightweight Learnable Adaptive Weighting to Learn Sample Difficulty & Improve Noisy Training
LiLAW learns to weight samples as easy, moderate or hard using three global scalars updated by one gradient step on a validation batch to improve noisy training performance.
-
Calibrated Model-Based Deep Reinforcement Learning
Augmenting model-based RL agents with calibrated predictive uncertainties improves planning, sample efficiency, and exploration on continuous control tasks.
-
A Gaia-linked High-purity QSO Candidate Catalog in Selected Fields with Extinction-binned Calibration and Spectrum-informed Training
The P3 selector achieves 0.9809 purity and 0.8869 completeness for QSO candidates in selected fields, outperforming Gaia's official probabilities.
-
Uncertainty in Physics and AI: Taxonomy, Quantification, and Validation
A unified taxonomy of uncertainty in ML for physics is introduced together with validation tools such as coverage, calibration, and proper scoring rules, illustrated on regression and classification tasks.
-
TRACE: A Metrologically-Grounded Engineering Framework for Trustworthy Agentic AI Systems in Operationally Critical Domains
TRACE is a metrologically-grounded four-layer engineering framework for trustworthy agentic AI that enforces an ML-LLM split, stateful policies, human supervision, and a parsimony metric across critical domains.
-
Trust but Verify: Introducing DAVinCI -- A Framework for Dual Attribution and Verification in Claim Inference for Language Models
DAVinCI combines claim attribution to model internals and external sources with entailment-based verification to improve LLM factual reliability by 5-20% on fact-checking datasets.
-
Single-bit-per-weight deep convolutional neural networks without batch-normalization layers for embedded systems
Experiments show that shifted-ReLU layers can replace batch-normalization in single-bit-weight wide residual networks on CIFAR-10/100 and ImageNet without consistent accuracy penalty.
-
Calibration, Uncertainty Communication, and Deployment Readiness in CKD Risk Prediction: A Framework Evaluation Study
CKD risk prediction models achieve AUROC 1.00 internally but drop to 0.48-0.58 externally with high calibration error and low deployment scores, indicating need for external validation.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.