arxiv: 2605.12679 · v1 · submitted 2026-05-12 · 📊 stat.ME

Recognition: no theorem link

Measures of predictive accuracy, miscalibration and discrimination

{\L}ukasz Delong, Mario W\"uthrich

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:05 UTC · model grok-4.3

classification 📊 stat.ME

keywords predictive accuracymiscalibrationdiscriminationMurphy decompositionLorenz curveBregman divergencepoint predictionmodel evaluation

0 comments

The pith

ABC, ABC² and Gini scores rely on predictor-dependent weights that break alignment with mean-consistent loss functions and can produce dishonest model evaluations for point predictions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates real-valued point predictors inside the decision-theoretic setting of mean-consistent losses defined by Bregman divergences. It first produces a version of Murphy's decomposition of expected loss that works only with the predictors rather than the raw responses. It then shows that the widely used Lorenz-curve accuracy measures ABC, its squared variant ABC², and the Gini score all introduce weights that depend on the predictor values themselves. Because these weights vary, the measures do not stay inside the class of mean-consistent scoring functions. The authors therefore conclude that ABC, ABC² and Gini can select models in ways that contradict proper loss-based evaluation, and they recommend instead using the miscalibration and discrimination terms from the Murphy decomposition. The paper closes by examining dominance relations when Lorenz curves cross.

Core claim

The central claim is that ABC, ABC² and Gini scores depend on predictor-dependent weights and therefore fail to align with the class of mean-consistent scoring functions given by Bregman divergences; as a direct consequence they may produce dishonest rankings of point predictors when used for model selection, whereas the miscalibration and discrimination components extracted from the authors' version of Murphy's decomposition remain aligned with that class.

What carries the argument

Murphy's decomposition of expected loss into miscalibration and discrimination terms, derived without explicit dependence on the response variable, set against Lorenz-curve accuracy measures whose weights depend on the predictor.

If this is right

ABC² reduces some of the original ABC's difficulties in detecting mean calibration.
The Gini score inherits the same predictor-dependent weighting problem as ABC.
When Lorenz curves cross once, third-degree stochastic dominance supplies weaker but still usable dominance criteria for certain subclasses of Bregman divergences.
Model selection should prefer the miscalibration and discrimination measures from the Murphy decomposition over ABC, ABC² or Gini.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Practitioners could run side-by-side comparisons on real data to count how often ABC and a chosen Bregman loss disagree on which predictor is better.
The dominance results when curves intersect may carry over to other curve-based ranking problems in insurance or risk scoring.
The new Murphy decomposition without direct response dependence could simplify software implementations for large-scale forecast evaluation.

Load-bearing premise

Mean-consistent loss functions given by Bregman divergences form the right reference class for honest evaluation of point predictions, and any predictor-dependent weighting necessarily falls outside that class.

What would settle it

A numerical example or dataset in which the predictor with the lower ABC (or Gini) score has strictly higher expected Bregman loss than the predictor with the higher ABC score would show that the alignment failure produces contradictory model rankings.

Figures

Figures reproduced from arXiv: 2605.12679 by {\L}ukasz Delong, Mario W\"uthrich.

**Figure 6.1.** Figure 6.1: The Lorenz curves and the concentrations curves of the predictors [PITH_FULL_IMAGE:figures/full_fig_p015_6_1.png] view at source ↗

**Figure 7.1.** Figure 7.1: Lorenz curves of the predictors X1 and X2 from Example 5. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_7_1.png] view at source ↗

**Figure 8.1.** Figure 8.1: Murphy’s curves measuring the discriminatory power of the predictors [PITH_FULL_IMAGE:figures/full_fig_p027_8_1.png] view at source ↗

**Figure 8.2.** Figure 8.2: LHS: The distributions FX1 and FX2 of the predictors X1 and X2 from Example 7. RHS: The double integral u 7→ R ∞ u R ∞ v (FX2 (t) − FX1 (t))dtdv of the difference in the distributions of the predictors X1 and X2. −4 −3 −2 −1 0 0e+00 4e+04 8e+04 Tweedie p 0 1 2 3 4 1 2 3 4 5 Tweedie p Ratio of discrimination statistics [PITH_FULL_IMAGE:figures/full_fig_p032_8_2.png] view at source ↗

**Figure 8.3.** Figure 8.3: The ratio of the discrimination statistics of the predictors [PITH_FULL_IMAGE:figures/full_fig_p032_8_3.png] view at source ↗

read the original abstract

We study the evaluation of real-valued point predictors under the decision-theoretic framework of mean-consistent loss functions given by the Bregman divergences. We first derive a new version of Murphy's decomposition of the expected loss which does not directly include the response itself but only its predictors. We then relate the miscalibration and the discrimination component of the Murphy's decomposition to Lorenz-curve-based accuracy measures that are widely used in practice. Besides the usual area between the concentration and Lorenz curves, ABC, we introduce a mean-squared version ABC$^2$ that mitigates some of the weaknesses of the original ABC in identifying mean-calibration. More importantly, both ABC and ABC$^2$ are shown to rely on predictor-dependent weights, so they fail to align with the class of mean-consistent scoring functions. In the same spirit, we derive a similar result for the widely used Gini score. These results indicate that ABC, ABC$^2$ and Gini scores may lead to dishonest evaluation of point predictions when used for model selection; this gives support to use mean-consistent loss functions as well as the miscalibration and the discrimination measure from the Murphy's decomposition of the expected loss for model evaluation. Finally, we study forecast dominance when Lorenz curves intersect. We show that Lorenz and Murphy's curves have the same number of crossings and, in the one-crossing case, we establish weaker dominance criteria for subclasses of Bregman divergences through third-degree stochastic dominance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ABC, ABC² and Gini scores can rank point predictors unreliably because their weights depend on the predictions, unlike mean-consistent Bregman losses.

read the letter

The main thing to know is that ABC, ABC², and the Gini score can give misleading rankings when used for model selection. Their implied weights turn out to depend on the predictor values themselves, which places them outside the class of mean-consistent scoring functions built from Bregman divergences. The authors reach this by deriving a Murphy decomposition that works with predictors only, without the responses appearing directly in the formula, then linking the miscalibration and discrimination pieces to the usual Lorenz-curve areas. They also introduce the squared version ABC² to sharpen detection of mean calibration problems. The algebra on the predictor-dependent weights looks clean and appears new relative to the scoring-rule literature they cite. The crossing-curves section adds a modest extension via third-degree stochastic dominance for one-crossing cases, though it stays secondary to the main misalignment result. One soft spot is the strong wording around “dishonest” evaluation; it follows directly from their stated preference for mean-consistency but will not convince readers who treat other loss families as equally legitimate. The dominance claims are stated precisely but would benefit from a concrete numerical example showing when the weaker criterion actually alters a ranking. No circularity or hidden fitting steps show up in the structure. This paper is for people who already use Lorenz-based metrics in forecasting or risk work and want a decision-theoretic reason to switch to the decomposition components instead. It deserves a serious referee because the core derivation is self-contained, reproducible in principle, and directly challenges a common practice.

Referee Report

2 major / 3 minor

Summary. The paper claims to derive a new version of Murphy's decomposition of expected loss for real-valued point predictors that depends only on the predictors (not the responses) under the framework of mean-consistent Bregman divergences. It relates the resulting miscalibration and discrimination components to Lorenz-curve accuracy measures, introduces the mean-squared variant ABC² to better identify mean-calibration, demonstrates that ABC, ABC² and the Gini score employ predictor-dependent weights that place them outside the mean-consistent class, and concludes that these measures can produce dishonest model selection. The paper further shows that Lorenz and Murphy curves share the same number of crossings and derives weaker dominance results via third-degree stochastic dominance for one-crossing cases within subclasses of Bregman divergences.

Significance. If the algebraic derivations hold, the work supplies a decision-theoretic argument against relying on ABC, ABC² and Gini for selecting among point predictors and instead favors mean-consistent losses together with the explicit miscalibration and discrimination terms from the Murphy decomposition. The dominance results when curves intersect add a practical tool for comparing forecasts. These contributions are relevant to statistical model evaluation and could affect how accuracy is assessed in applications that use Lorenz-based or Gini-type scores.

major comments (2)

The central claim that ABC, ABC² and Gini rely on predictor-dependent weights misaligned with mean-consistent Bregman divergences is load-bearing; the manuscript must exhibit the explicit functional form of these weights (or the resulting scoring rule) so that readers can verify the dependence on the predictor and the consequent failure of alignment.
The new predictor-only Murphy decomposition is the foundation for all subsequent relations to Lorenz measures; the derivation steps that eliminate the response variable while preserving the decomposition into miscalibration and discrimination must be shown in full, including any intermediate expectations or conditioning arguments.

minor comments (3)

In the abstract and introduction, briefly recall the definition of mean-consistency for Bregman divergences to make the reference class explicit for readers unfamiliar with the term.
In the dominance section, state the precise subclass of Bregman divergences for which the third-degree stochastic dominance implication holds in the one-crossing case.
Ensure consistent notation between the abstract (ABC²) and the main text when defining the mean-squared variant.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and constructive suggestions. We address each major comment below and will incorporate the requested clarifications in a revised manuscript.

read point-by-point responses

Referee: The central claim that ABC, ABC² and Gini rely on predictor-dependent weights misaligned with mean-consistent Bregman divergences is load-bearing; the manuscript must exhibit the explicit functional form of these weights (or the resulting scoring rule) so that readers can verify the dependence on the predictor and the consequent failure of alignment.

Authors: We agree that explicit forms are necessary for verification. In the revision we will add the closed-form expressions for the predictor-dependent weights underlying ABC, ABC², and the Gini score, together with the equivalent scoring rules, and show directly that these weights depend on the predictor value and therefore lie outside the mean-consistent Bregman class. revision: yes
Referee: The new predictor-only Murphy decomposition is the foundation for all subsequent relations to Lorenz measures; the derivation steps that eliminate the response variable while preserving the decomposition into miscalibration and discrimination must be shown in full, including any intermediate expectations or conditioning arguments.

Authors: We accept the request for complete transparency. The revised manuscript will present the full derivation of the predictor-only Murphy decomposition, including all intermediate conditional expectations, the law of total expectation steps that remove the response variable, and the preservation of the miscalibration and discrimination terms. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper derives a Murphy decomposition expressed only in terms of predictors and relates ABC/ABC²/Gini scores to predictor-dependent weights outside the mean-consistent Bregman class. These steps begin from the standard definition of Bregman divergences and the known Murphy decomposition; no equation reduces a target quantity to a parameter fitted inside the paper, no self-citation is invoked as a load-bearing uniqueness theorem, and no ansatz is smuggled via prior work by the same authors. The interpretive preference for mean-consistency is stated explicitly rather than derived from the paper's own fitted objects. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on the standard definition and properties of Bregman divergences as mean-consistent losses and on the existence of Murphy's decomposition; no new free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Bregman divergences form the class of mean-consistent loss functions for real-valued point predictors
Invoked as the decision-theoretic framework throughout the abstract
standard math Murphy's decomposition of expected loss exists and can be rewritten to depend only on predictors
Stated as the first derived result

pith-pipeline@v0.9.0 · 5555 in / 1477 out tokens · 48741 ms · 2026-05-14T20:05:30.478079+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 3 canonical work pages

[1]

Aaberge, R. (2009). Ranking intersecting Lorenz curves. Social Choice and Welfare 33 , 235-259

2009
[2]

Atkinson, A. (1969). On the measurement of inequality. Journal of Economic Theory 2 , 244-263

1969
[3]

Bendel, R.B., Higgins, S.S., Teberg, J.E., Pyke, D.A. (1989). Comparison of skewness coefficient, coefficient of variation, and Gini coefficient as inequality measures within populations. Oecologia 78 , 394-400

1989
[4]

(2025) Gini score under ties and case weights

Brauer, A., W\"uthrich, M.V. (2025) Gini score under ties and case weights. arXiv 2511.15446

work page arXiv 2025
[5]

Dardanoni, V., Lambert, P. (1988). Welfare rankings of income distributions: A role for the variance and some insights for tax reform Social Choice and Welfare 5 , 1-17

1988
[6]

Denuit, M., Dhaene, J., Goovaerts, M., Kaas, R. (2005). Actuarial Theory for Dependent Risks: Measures, Orders and Models. Wiley, New York

2005
[7]

Denuit, M., Huyghe, J., Trufin, J., Verdebout, T. (2024). Testing for auto-calibration with Lorenz and concentration curves. Insurance: Mathematics and Economics 117 , 130-139

2024
[8]

Denuit, M., Sznajder, D., Trufin, J. (2019). Model selection based on Lorenz and concentration curves, Gini indices and convex order. Insurance: Mathematics & Economics 89 , 128-139

2019
[9]

Denuit, M., Trufin, J. (2021). Lorenz curve, Gini coefficient, and Tweedie dominance for autocalibrated predictors. LIDAM Discussion Paper ISBA 36

2021
[10]

Denuit, M., Trufin, J. (2025). Another look at the zero integral difference between Lorenz and concentration curves in supervised learning. LIDAM Discussion Paper ISBA 26

2025
[11]

Denuit, M., Trufin, J., Hainaut, D. (2020). Effective Statistical Learning Methods for Actuaries II , Springer

2020
[12]

Denuit, M., Trufin, J., Verdebout, T. (2025). Comparison of predictors' performance in insurance pricing: testing for Bregman dominance based on Murphy diagrams. European Actuarial Journal 15 , 493-504

2025
[13]

Dimitriadis, T., Gneiting, T., Jordan, A., Vogel, P. (2023). Evaluating probability classifiers: the triptych. arXiv :2301.10803

work page arXiv 2023
[14]

Gneiting, T., Jordan, A., Kr\" u ger, F

Ehm, W. Gneiting, T., Jordan, A., Kr\" u ger, F. (2016). Of quantiles and expectiles: Consistent scoring functions, Choquet representations, and forecast rankings. Journal of the Royal Statistical Society Series B: Statistical Methodology 78 , 505-562

2016
[15]

Eliazar, I. (2015). The sociogeometry of inequality: Part II. Physica A 426 , 116-137

2015
[16]

Frees, E.W., Meyers, G., Cummings, A.D. (2011). Summarizing insurance scores using a Gini index. Journal of the American Statistical Association 106(495), 1085-1098

2011
[17]

Frees, E.W., Meyers, G., Cummings, A.D. (2013). Insurance ratemaking and a Gini index. Journal of Risk and Insurance 81, 335-366

2013
[18]

Gini, C. (1912). Variabilit\`a e Mutabilit\`a. Contributo allo Studio delle Distribuzioni e delle Relazioni Statistiche . C. Cuppini, Bologna

1912
[19]

Gneiting, T. (2011). Making and evaluating point forecasts. Journal of the American Statistical Association 106/494 , 746-762

2011
[20]

Gneiting, T., Raftery, A.E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association 102/477 , 359-378

2007
[21]

Gneiting, T., Ranjan, R. (2011). Comparing density forecasts using threshold-and quantile-weighted scoring rules. Journal of Business & Economic Statistics 29 , 411-422

2011
[22]

Gneiting, T., Resin, J. (2023). Regression diagnostics meets forecst evaluation: conditional calibration, reliability diagrams, and coefficient of determination. Electronic Journal of Statistics 17 , 3226-3286

2023
[23]

Gourieroux, C., Montfort, A., Trognon, A. (1984). Pseudo maximum likelihood methods: theory. Econometrica 52/3 , 681-700

1984
[24]

Hand, D.J. (2009). Measuring classifier performance: A coherent alternative to the area under the ROC curve. Machine Learning 77 , 103-123

2009
[25]

Hand, D.J., Anagnostopoulos, C. (2023). Notes on the H-measure of classifier performance. Advances in Data Analysis and Classification 17 , 109-124

2023
[26]

Kr\"uger, F., Ziegel, J. (2021). Generic conditions for forecast dominance. Journal of Business & Economic Statistics 39 , 972-983

2021
[27]

Lando, T., Bertoli-Barsotti, L. (2016). Weak orderings for intersecting Lorenz curves. METRON 74 , 177-192

2016
[28]

Levy, H. (2006). Stochastic Dominance: Investment Decision Making under Uncertainty. Springer

2006
[29]

Muliere, P., Scarsini, M. (1989). A note on stochastic dominance and inequality of measures. Journal of Economic Theory 49 , 2, 314-323

1989
[30]

M\"uller, A. (1996). Orderings of risks: A comparative study via stop-loss transforms. Insurance: Mathematics and Economics 17 , 215-222

1996
[31]

Murphy, A.H. (1973). A new vector partition of the probability score. Journal of Applied Meteorology 12/4 , 595-600

1973
[32]

Murphy, A.H., Winkler, R.L. (1987). A general framework for forecast verification. Monthly Weather Review 115/7 , 1330-1338

1987
[33]

Pohle, M.-O. (2020). The Murphy decomposition and the calibration-resolution principle: A new perspective on forecast evaluation. arXiv :2005.01835

work page arXiv 2020
[34]

Savage, L.J. (1971). Elicitable of personal probabilities and expectations. Journal of the American Statistical Association 66/336 , 783-810

1971
[35]

Shaked, M., Shanthikumar, J.G. (2007). Stochastic Orders . Springer, New York

2007
[36]

Taagart, R. (2022). Evaluation of point forecasts for extreme events using consistent scoring functions. Quartely Journal of the Royal Meteorogical Society 148 , 306-320

2022
[37]

Tasche, D. (2006). Validation of internal rating systems and PD estimates. arXiv :0606071

2006
[38]

W\" u thrich, M.V. (2025). Auto-calibration tests for discrete finite regression functions European Actuarial Journal 15, 335-341

2025
[39]

W\" u thrich, M.V. (2025). Model selection with Gini indices under auto-calibration. European Actuarial Journal 13, 469-477

2025
[40]

W\"uthrich, M.V., Merz, M. (2023). Statistical Foundations of Actuarial Learning and its Applications . Springer Actuarial

2023
[41]

Yitzhaki, S., Schechtman, E. (2013). The Gini Methodology: A Primer on a Statistical Methodology . Springer

2013
[42]

Zoli, C. (2002). Inverse stochastic dominance, inequality measurement and Gini indices. Journal of Economics 77 , 119-161

2002