Recognition: no theorem link
Measures of predictive accuracy, miscalibration and discrimination
Pith reviewed 2026-05-14 20:05 UTC · model grok-4.3
The pith
ABC, ABC² and Gini scores rely on predictor-dependent weights that break alignment with mean-consistent loss functions and can produce dishonest model evaluations for point predictions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that ABC, ABC² and Gini scores depend on predictor-dependent weights and therefore fail to align with the class of mean-consistent scoring functions given by Bregman divergences; as a direct consequence they may produce dishonest rankings of point predictors when used for model selection, whereas the miscalibration and discrimination components extracted from the authors' version of Murphy's decomposition remain aligned with that class.
What carries the argument
Murphy's decomposition of expected loss into miscalibration and discrimination terms, derived without explicit dependence on the response variable, set against Lorenz-curve accuracy measures whose weights depend on the predictor.
If this is right
- ABC² reduces some of the original ABC's difficulties in detecting mean calibration.
- The Gini score inherits the same predictor-dependent weighting problem as ABC.
- When Lorenz curves cross once, third-degree stochastic dominance supplies weaker but still usable dominance criteria for certain subclasses of Bregman divergences.
- Model selection should prefer the miscalibration and discrimination measures from the Murphy decomposition over ABC, ABC² or Gini.
Where Pith is reading between the lines
- Practitioners could run side-by-side comparisons on real data to count how often ABC and a chosen Bregman loss disagree on which predictor is better.
- The dominance results when curves intersect may carry over to other curve-based ranking problems in insurance or risk scoring.
- The new Murphy decomposition without direct response dependence could simplify software implementations for large-scale forecast evaluation.
Load-bearing premise
Mean-consistent loss functions given by Bregman divergences form the right reference class for honest evaluation of point predictions, and any predictor-dependent weighting necessarily falls outside that class.
What would settle it
A numerical example or dataset in which the predictor with the lower ABC (or Gini) score has strictly higher expected Bregman loss than the predictor with the higher ABC score would show that the alignment failure produces contradictory model rankings.
Figures
read the original abstract
We study the evaluation of real-valued point predictors under the decision-theoretic framework of mean-consistent loss functions given by the Bregman divergences. We first derive a new version of Murphy's decomposition of the expected loss which does not directly include the response itself but only its predictors. We then relate the miscalibration and the discrimination component of the Murphy's decomposition to Lorenz-curve-based accuracy measures that are widely used in practice. Besides the usual area between the concentration and Lorenz curves, ABC, we introduce a mean-squared version ABC$^2$ that mitigates some of the weaknesses of the original ABC in identifying mean-calibration. More importantly, both ABC and ABC$^2$ are shown to rely on predictor-dependent weights, so they fail to align with the class of mean-consistent scoring functions. In the same spirit, we derive a similar result for the widely used Gini score. These results indicate that ABC, ABC$^2$ and Gini scores may lead to dishonest evaluation of point predictions when used for model selection; this gives support to use mean-consistent loss functions as well as the miscalibration and the discrimination measure from the Murphy's decomposition of the expected loss for model evaluation. Finally, we study forecast dominance when Lorenz curves intersect. We show that Lorenz and Murphy's curves have the same number of crossings and, in the one-crossing case, we establish weaker dominance criteria for subclasses of Bregman divergences through third-degree stochastic dominance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to derive a new version of Murphy's decomposition of expected loss for real-valued point predictors that depends only on the predictors (not the responses) under the framework of mean-consistent Bregman divergences. It relates the resulting miscalibration and discrimination components to Lorenz-curve accuracy measures, introduces the mean-squared variant ABC² to better identify mean-calibration, demonstrates that ABC, ABC² and the Gini score employ predictor-dependent weights that place them outside the mean-consistent class, and concludes that these measures can produce dishonest model selection. The paper further shows that Lorenz and Murphy curves share the same number of crossings and derives weaker dominance results via third-degree stochastic dominance for one-crossing cases within subclasses of Bregman divergences.
Significance. If the algebraic derivations hold, the work supplies a decision-theoretic argument against relying on ABC, ABC² and Gini for selecting among point predictors and instead favors mean-consistent losses together with the explicit miscalibration and discrimination terms from the Murphy decomposition. The dominance results when curves intersect add a practical tool for comparing forecasts. These contributions are relevant to statistical model evaluation and could affect how accuracy is assessed in applications that use Lorenz-based or Gini-type scores.
major comments (2)
- The central claim that ABC, ABC² and Gini rely on predictor-dependent weights misaligned with mean-consistent Bregman divergences is load-bearing; the manuscript must exhibit the explicit functional form of these weights (or the resulting scoring rule) so that readers can verify the dependence on the predictor and the consequent failure of alignment.
- The new predictor-only Murphy decomposition is the foundation for all subsequent relations to Lorenz measures; the derivation steps that eliminate the response variable while preserving the decomposition into miscalibration and discrimination must be shown in full, including any intermediate expectations or conditioning arguments.
minor comments (3)
- In the abstract and introduction, briefly recall the definition of mean-consistency for Bregman divergences to make the reference class explicit for readers unfamiliar with the term.
- In the dominance section, state the precise subclass of Bregman divergences for which the third-degree stochastic dominance implication holds in the one-crossing case.
- Ensure consistent notation between the abstract (ABC²) and the main text when defining the mean-squared variant.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation and constructive suggestions. We address each major comment below and will incorporate the requested clarifications in a revised manuscript.
read point-by-point responses
-
Referee: The central claim that ABC, ABC² and Gini rely on predictor-dependent weights misaligned with mean-consistent Bregman divergences is load-bearing; the manuscript must exhibit the explicit functional form of these weights (or the resulting scoring rule) so that readers can verify the dependence on the predictor and the consequent failure of alignment.
Authors: We agree that explicit forms are necessary for verification. In the revision we will add the closed-form expressions for the predictor-dependent weights underlying ABC, ABC², and the Gini score, together with the equivalent scoring rules, and show directly that these weights depend on the predictor value and therefore lie outside the mean-consistent Bregman class. revision: yes
-
Referee: The new predictor-only Murphy decomposition is the foundation for all subsequent relations to Lorenz measures; the derivation steps that eliminate the response variable while preserving the decomposition into miscalibration and discrimination must be shown in full, including any intermediate expectations or conditioning arguments.
Authors: We accept the request for complete transparency. The revised manuscript will present the full derivation of the predictor-only Murphy decomposition, including all intermediate conditional expectations, the law of total expectation steps that remove the response variable, and the preservation of the miscalibration and discrimination terms. revision: yes
Circularity Check
No significant circularity
full rationale
The paper derives a Murphy decomposition expressed only in terms of predictors and relates ABC/ABC²/Gini scores to predictor-dependent weights outside the mean-consistent Bregman class. These steps begin from the standard definition of Bregman divergences and the known Murphy decomposition; no equation reduces a target quantity to a parameter fitted inside the paper, no self-citation is invoked as a load-bearing uniqueness theorem, and no ansatz is smuggled via prior work by the same authors. The interpretive preference for mean-consistency is stated explicitly rather than derived from the paper's own fitted objects. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Bregman divergences form the class of mean-consistent loss functions for real-valued point predictors
- standard math Murphy's decomposition of expected loss exists and can be rewritten to depend only on predictors
Reference graph
Works this paper leans on
-
[1]
Aaberge, R. (2009). Ranking intersecting Lorenz curves. Social Choice and Welfare 33 , 235-259
2009
-
[2]
Atkinson, A. (1969). On the measurement of inequality. Journal of Economic Theory 2 , 244-263
1969
-
[3]
Bendel, R.B., Higgins, S.S., Teberg, J.E., Pyke, D.A. (1989). Comparison of skewness coefficient, coefficient of variation, and Gini coefficient as inequality measures within populations. Oecologia 78 , 394-400
1989
-
[4]
(2025) Gini score under ties and case weights
Brauer, A., W\"uthrich, M.V. (2025) Gini score under ties and case weights. arXiv 2511.15446
-
[5]
Dardanoni, V., Lambert, P. (1988). Welfare rankings of income distributions: A role for the variance and some insights for tax reform Social Choice and Welfare 5 , 1-17
1988
-
[6]
Denuit, M., Dhaene, J., Goovaerts, M., Kaas, R. (2005). Actuarial Theory for Dependent Risks: Measures, Orders and Models. Wiley, New York
2005
-
[7]
Denuit, M., Huyghe, J., Trufin, J., Verdebout, T. (2024). Testing for auto-calibration with Lorenz and concentration curves. Insurance: Mathematics and Economics 117 , 130-139
2024
-
[8]
Denuit, M., Sznajder, D., Trufin, J. (2019). Model selection based on Lorenz and concentration curves, Gini indices and convex order. Insurance: Mathematics & Economics 89 , 128-139
2019
-
[9]
Denuit, M., Trufin, J. (2021). Lorenz curve, Gini coefficient, and Tweedie dominance for autocalibrated predictors. LIDAM Discussion Paper ISBA 36
2021
-
[10]
Denuit, M., Trufin, J. (2025). Another look at the zero integral difference between Lorenz and concentration curves in supervised learning. LIDAM Discussion Paper ISBA 26
2025
-
[11]
Denuit, M., Trufin, J., Hainaut, D. (2020). Effective Statistical Learning Methods for Actuaries II , Springer
2020
-
[12]
Denuit, M., Trufin, J., Verdebout, T. (2025). Comparison of predictors' performance in insurance pricing: testing for Bregman dominance based on Murphy diagrams. European Actuarial Journal 15 , 493-504
2025
- [13]
-
[14]
Gneiting, T., Jordan, A., Kr\" u ger, F
Ehm, W. Gneiting, T., Jordan, A., Kr\" u ger, F. (2016). Of quantiles and expectiles: Consistent scoring functions, Choquet representations, and forecast rankings. Journal of the Royal Statistical Society Series B: Statistical Methodology 78 , 505-562
2016
-
[15]
Eliazar, I. (2015). The sociogeometry of inequality: Part II. Physica A 426 , 116-137
2015
-
[16]
Frees, E.W., Meyers, G., Cummings, A.D. (2011). Summarizing insurance scores using a Gini index. Journal of the American Statistical Association 106(495), 1085-1098
2011
-
[17]
Frees, E.W., Meyers, G., Cummings, A.D. (2013). Insurance ratemaking and a Gini index. Journal of Risk and Insurance 81, 335-366
2013
-
[18]
Gini, C. (1912). Variabilit\`a e Mutabilit\`a. Contributo allo Studio delle Distribuzioni e delle Relazioni Statistiche . C. Cuppini, Bologna
1912
-
[19]
Gneiting, T. (2011). Making and evaluating point forecasts. Journal of the American Statistical Association 106/494 , 746-762
2011
-
[20]
Gneiting, T., Raftery, A.E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association 102/477 , 359-378
2007
-
[21]
Gneiting, T., Ranjan, R. (2011). Comparing density forecasts using threshold-and quantile-weighted scoring rules. Journal of Business & Economic Statistics 29 , 411-422
2011
-
[22]
Gneiting, T., Resin, J. (2023). Regression diagnostics meets forecst evaluation: conditional calibration, reliability diagrams, and coefficient of determination. Electronic Journal of Statistics 17 , 3226-3286
2023
-
[23]
Gourieroux, C., Montfort, A., Trognon, A. (1984). Pseudo maximum likelihood methods: theory. Econometrica 52/3 , 681-700
1984
-
[24]
Hand, D.J. (2009). Measuring classifier performance: A coherent alternative to the area under the ROC curve. Machine Learning 77 , 103-123
2009
-
[25]
Hand, D.J., Anagnostopoulos, C. (2023). Notes on the H-measure of classifier performance. Advances in Data Analysis and Classification 17 , 109-124
2023
-
[26]
Kr\"uger, F., Ziegel, J. (2021). Generic conditions for forecast dominance. Journal of Business & Economic Statistics 39 , 972-983
2021
-
[27]
Lando, T., Bertoli-Barsotti, L. (2016). Weak orderings for intersecting Lorenz curves. METRON 74 , 177-192
2016
-
[28]
Levy, H. (2006). Stochastic Dominance: Investment Decision Making under Uncertainty. Springer
2006
-
[29]
Muliere, P., Scarsini, M. (1989). A note on stochastic dominance and inequality of measures. Journal of Economic Theory 49 , 2, 314-323
1989
-
[30]
M\"uller, A. (1996). Orderings of risks: A comparative study via stop-loss transforms. Insurance: Mathematics and Economics 17 , 215-222
1996
-
[31]
Murphy, A.H. (1973). A new vector partition of the probability score. Journal of Applied Meteorology 12/4 , 595-600
1973
-
[32]
Murphy, A.H., Winkler, R.L. (1987). A general framework for forecast verification. Monthly Weather Review 115/7 , 1330-1338
1987
- [33]
-
[34]
Savage, L.J. (1971). Elicitable of personal probabilities and expectations. Journal of the American Statistical Association 66/336 , 783-810
1971
-
[35]
Shaked, M., Shanthikumar, J.G. (2007). Stochastic Orders . Springer, New York
2007
-
[36]
Taagart, R. (2022). Evaluation of point forecasts for extreme events using consistent scoring functions. Quartely Journal of the Royal Meteorogical Society 148 , 306-320
2022
-
[37]
Tasche, D. (2006). Validation of internal rating systems and PD estimates. arXiv :0606071
2006
-
[38]
W\" u thrich, M.V. (2025). Auto-calibration tests for discrete finite regression functions European Actuarial Journal 15, 335-341
2025
-
[39]
W\" u thrich, M.V. (2025). Model selection with Gini indices under auto-calibration. European Actuarial Journal 13, 469-477
2025
-
[40]
W\"uthrich, M.V., Merz, M. (2023). Statistical Foundations of Actuarial Learning and its Applications . Springer Actuarial
2023
-
[41]
Yitzhaki, S., Schechtman, E. (2013). The Gini Methodology: A Primer on a Statistical Methodology . Springer
2013
-
[42]
Zoli, C. (2002). Inverse stochastic dominance, inequality measurement and Gini indices. Journal of Economics 77 , 119-161
2002
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.