Robust Bayesian Predictive Model Selection using Bregman Divergence
Pith reviewed 2026-06-27 12:41 UTC · model grok-4.3
The pith
Replacing the log score with a Bregman divergence in leave-one-out cross-validation yields a predictive model selector that asymptotically picks the closest distribution under misspecification.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A score-matched generalized ELPD framework replaces the log score by a Bregman scoring rule both to form the generalized posterior and to evaluate leave-one-out predictive utility; under model misspecification this procedure asymptotically selects the model whose predictive distribution is closest to the data-generating process under the chosen Bregman divergence.
What carries the argument
The Bregman scoring rule and its associated generalized posterior, which together define the generalized ELPD used for predictive utility ranking.
If this is right
- Model rankings become tunable for outlier sensitivity by choice of the beta parameter in the beta-divergence family.
- In microbial and forensic data examples the selected model can differ from the one chosen by ordinary ELPD because low-density observations exert less influence.
- The framework supplies a direct proper-score generalization of standard leave-one-out cross-validation.
- Asymptotic consistency targets the predictive distribution that minimizes the chosen divergence rather than the Kullback-Leibler divergence.
Where Pith is reading between the lines
- The same generalized-posterior construction could be applied with other proper scoring rules to achieve robustness properties not limited to the Bregman family.
- In settings with heavy tails or contamination the method offers a concrete way to trade bias for reduced variance in model selection.
- The divergence-minimizing property suggests that model averaging weights derived from the generalized ELPD would also converge to weights concentrated on the closest predictive distributions.
Load-bearing premise
The Bregman scoring rule and generalized posterior produce an out-of-sample utility ranking that is asymptotically consistent for the divergence-minimizing model, without explicit conditions stated on the model class or data-generating process.
What would settle it
A Monte Carlo experiment in which the procedure repeatedly selects a model whose predictive distribution does not minimize the target Bregman divergence to the known data-generating process would falsify the asymptotic selection claim.
Figures
read the original abstract
Predictive Bayesian model comparison often relies on leave-one-out (LOO) cross-validation criteria such as the expected log predictive density (ELPD). However, model rankings can be overly sensitive to outliers and tail mismatch because ELPD is based on the log score. We propose a score-matched generalized ELPD framework that replaces the log score by a Bregman scoring rule to update model parameters through a generalized posterior and to evaluate LOO predictive utility. Candidate posterior predictive distributions are ranked by out-of-sample utility under the chosen scoring rule, yielding a direct proper-score generalization of standard ELPD. We focus especially on the $\beta$-divergence family, where $\beta$ controls the sensitivity of predictive comparison to low-density observations. Under model misspecification, the procedure asymptotically selects the model whose predictive distribution is closest to the data-generating process under the chosen Bregman divergence. A simulation study and applications to microbial and forensic data show that the generalized ELPD can change the selected model through reduced sensitivity to low-density observations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a generalized ELPD framework that replaces the log score with a Bregman scoring rule (focusing on the β-divergence family) both to form a generalized posterior and to compute LOO predictive utility for model ranking. Under misspecification the procedure is claimed to asymptotically select the predictive distribution minimizing the chosen divergence to the DGP. Simulations and applications to microbial and forensic data are reported to produce different model rankings than standard ELPD due to reduced sensitivity to low-density observations.
Significance. If the asymptotic selection property can be rigorously established, the framework would supply a tunable robust alternative to ELPD-based predictive model comparison. The empirical illustrations already show that altering the scoring rule can change selected models, which is of practical interest in misspecified settings. However, the absence of any derivation, regularity conditions, or quantitative verification of the generalized posterior concentration undermines the central claim and therefore the current significance of the contribution.
major comments (2)
- [Abstract] Abstract: The asymptotic selection claim (“the procedure asymptotically selects the model whose predictive distribution is closest to the data-generating process under the chosen Bregman divergence”) is stated without any derivation, reference to a theorem, or list of regularity conditions (compactness of parameter space, uniform integrability of the score, uniqueness of the minimizer, ergodicity of the data process). This is the load-bearing theoretical result; its absence prevents assessment of whether the generalized posterior and LOO utility ranking are consistent for the divergence minimizer.
- [Abstract / Method description] The construction of the generalized posterior via replacement of the log score by the Bregman scoring rule is described only at a high level; no explicit form of the generalized posterior, no proof that it concentrates at the expected-score minimizer, and no discussion of how the β parameter enters the posterior are supplied. These steps are required for the subsequent LOO ranking argument.
minor comments (1)
- [Abstract] The phrase “score-matched generalized ELPD framework” is introduced without a precise definition or equation linking the Bregman score to the leave-one-out utility; a short clarifying equation would improve readability.
Simulated Author's Rebuttal
We thank the referee for the thorough review and for highlighting the need for explicit theoretical support. We agree that the current manuscript presents the asymptotic selection property and the generalized posterior construction at a high level. Below we address each major comment and commit to adding the required derivations and explicit forms in a revised version.
read point-by-point responses
-
Referee: [Abstract] Abstract: The asymptotic selection claim (“the procedure asymptotically selects the model whose predictive distribution is closest to the data-generating process under the chosen Bregman divergence”) is stated without any derivation, reference to a theorem, or list of regularity conditions (compactness of parameter space, uniform integrability of the score, uniqueness of the minimizer, ergodicity of the data process). This is the load-bearing theoretical result; its absence prevents assessment of whether the generalized posterior and LOO utility ranking are consistent for the divergence minimizer.
Authors: We acknowledge that the abstract asserts the asymptotic selection property without a derivation or list of regularity conditions in the main text. Although the claim is a direct consequence of standard consistency results for generalized posteriors defined by proper scoring rules, we agree that a self-contained argument is required. In the revision we will add a dedicated theoretical section that derives the asymptotic selection result under explicit regularity conditions (compact parameter space, uniform integrability of the Bregman score, uniqueness of the minimizer, and ergodicity of the data-generating process). revision: yes
-
Referee: [Abstract / Method description] The construction of the generalized posterior via replacement of the log score by the Bregman scoring rule is described only at a high level; no explicit form of the generalized posterior, no proof that it concentrates at the expected-score minimizer, and no discussion of how the β parameter enters the posterior are supplied. These steps are required for the subsequent LOO ranking argument.
Authors: We accept the criticism that the generalized posterior is introduced only conceptually. The revised manuscript will supply the explicit functional form of the generalized posterior, prove its concentration at the minimizer of the expected Bregman score (under the regularity conditions listed in the response to the first comment), and detail how the tuning parameter β enters both the posterior and the LOO utility through the β-divergence scoring rule. revision: yes
Circularity Check
No circularity: asymptotic claim rests on external properties of proper scoring rules
full rationale
The paper's central claim—that the procedure asymptotically selects the Bregman-divergence-minimizing predictive distribution under misspecification—is presented as a direct consequence of the general theory of proper scoring rules and generalized posteriors. No equation or derivation step within the abstract or described framework reduces this result to a fitted parameter, self-defined quantity, or load-bearing self-citation internal to the paper. The consistency argument is invoked from established scoring-rule properties rather than constructed tautologically inside the manuscript, rendering the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- β
axioms (1)
- standard math Bregman divergences define proper scoring rules whose expected value is minimized by the true predictive distribution.
Reference graph
Works this paper leans on
-
[1]
S.-I. Amari. -divergence is unique, belonging to both f -divergence and Bregman divergence classes . IEEE Transactions on Information Theory, 55 0 (11): 0 4925--4931, 2009
2009
-
[2]
M. J. Angilletta Jr. Estimating and comparing thermal performance curves. Journal of Thermal Biology, 31 0 (7): 0 541--545, 2006
2006
-
[3]
Banerjee, S
A. Banerjee, S. Merugu, I. S. Dhillon, J. Ghosh, and J. Lafferty. Clustering with Bregman divergences. Journal of machine learning research, 6 0 (10), 2005
2005
-
[4]
A. Basu, I. R. Harris, N. L. Hjort, and M. Jones. Robust and efficient estimation by minimising a density power divergence. Biometrika, 85 0 (3): 0 549--559, 1998
1998
-
[5]
J. O. Berger. Statistical decision theory and Bayesian analysis. Springer Science & Business Media, 1985
1985
-
[6]
J. O. Berger. An overview of robust Bayesian analysis . Test, 3 0 (1): 0 5--124, 1994
1994
-
[7]
R. H. Berk. Limiting behavior of posterior distributions when the model is incorrect. The Annals of Mathematical Statistics, 37 0 (1): 0 51--58, 1966
1966
-
[8]
J. M. Bernardo and A. F. Smith. Bayesian Theory, volume 586. Wiley Online Library, 1994
1994
-
[9]
Besag, J
J. Besag, J. York, and A. Molli \'e . Bayesian image restoration, with two applications in spatial statistics. Annals of the institute of statistical mathematics, 43 0 (1): 0 1--20, 1991
1991
-
[10]
Bayesian fractional posteriors , volume =
A. Bhattacharya, D. Pati, and Y. Yang. Bayesian fractional posteriors. The Annals of Statistics, 47 0 (1): 0 39 -- 66, 2019. doi:10.1214/18-AOS1712. URL https://doi.org/10.1214/18-AOS1712
-
[11]
P. G. Bissiri, C. C. Holmes, and S. G. Walker. A general framework for updating belief distributions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78 0 (5): 0 1103--1130, 2016
2016
-
[12]
L. M. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR computational mathematics and mathematical physics, 7 0 (3): 0 200--217, 1967
1967
-
[13]
O. Bunke and X. Milhaud. Asymptotic behavior of Bayes estimates under possibly incorrect models . The Annals of Statistics, 26 0 (2): 0 617 -- 644, 1998. doi:10.1214/aos/1028144851. URL https://doi.org/10.1214/aos/1028144851
-
[14]
Carpenter, A
B. Carpenter, A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker, J. Guo, P. Li, and A. Riddell. Stan: A probabilistic programming language. Journal of statistical software, 76: 0 1--32, 2017
2017
-
[15]
P. S. Chodrow. Equivalence of informations characterizes Bregman divergences. Entropy, 27 0 (7), 2025. ISSN 1099-4300. doi:10.3390/e27070766. URL https://www.mdpi.com/1099-4300/27/7/766
-
[16]
D. K. Dey and L. R. Birmiwal. Robust Bayesian analysis using divergence measures . Statistics & Probability Letters, 20 0 (4): 0 287--294, 1994
1994
-
[17]
B. A. Frigyik, S. Srivastava, and M. R. Gupta. Functional Bregman Divergence and Bayesian Estimation of Distributions . IEEE Transactions on Information Theory, 54 0 (11): 0 5130--5139, 2008. doi:10.1109/TIT.2008.929943
-
[18]
S. Geisser. The predictive sample reuse method with applications. Journal of the American statistical Association, 70 0 (350): 0 320--328, 1975
1975
-
[19]
A. E. Gelfand, D. K. Dey, and H. Chang. Model determination using predictive distributions with implementation via sampling based methods. In J. Bernardo, J. Berger, A. Dawid, and A. Smith, editors, Bayesian Statistics 4, pages 147--167. Oxford University Press, 1992
1992
-
[20]
Ghosh and A
A. Ghosh and A. Basu. Robust Bayes estimation using the density power divergence. Annals of the Institute of Statistical Mathematics, 68 0 (2): 0 413--437, 2016
2016
-
[21]
Girardi, L
P. Girardi, L. Greco, V. Mameli, M. Musio, W. Racugno, E. Ruli, and L. Ventura. Robust inference for non-linear regression models from the Tsallis score: application to coronavirus disease 2019 contagion in Italy . Stat, 9 0 (1): 0 e309, 2020
2019
-
[22]
Giummol \`e , V
F. Giummol \`e , V. Mameli, E. Ruli, and L. Ventura. Objective Bayesian inference with proper scoring rules . Test, 28 0 (3): 0 728--755, 2019
2019
-
[23]
Gneiting and A
T. Gneiting and A. E. Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association, 102 0 (477): 0 359--378, 2007
2007
-
[24]
Goh and D
G. Goh and D. K. Dey. Bayesian model diagnostics using functional Bregman divergence . Journal of Multivariate Analysis, 124: 0 371--383, 2014
2014
-
[25]
Goh and D
G. Goh and D. K. Dey. Bayesian model assessment and selection using Bregman divergence . Advances in Statistics-Theory and Applications: Honoring the Contributions of Barry C. Arnold in Statistical Science, pages 295--313, 2021
2021
-
[26]
Gr \"u nwald
P. Gr \"u nwald. The safe Bayesian : learning the learning rate via the mixability gap. In International Conference on Algorithmic Learning Theory, pages 169--183. Springer, 2012
2012
-
[27]
P. D. Gr \"u nwald and A. P. Dawid. Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory . The Annals of Statistics, 32 0 (4): 0 1367 -- 1433, 2004
2004
-
[28]
J. A. Hoeting, D. Madigan, A. E. Raftery, and C. T. Volinsky. Bayesian model averaging: a tutorial (with comments by M. Clyde, David Draper and E. I. George, and a rejoinder by the authors . Statistical Science, 14 0 (4): 0 382 -- 417, 1999. doi:10.1214/ss/1009212519
-
[29]
Hooker and A
G. Hooker and A. N. Vidyashankar. Bayesian model robustness via disparities. Test, 23 0 (3): 0 556--584, 2014
2014
-
[30]
P. J. Huber. Robust Statistics. Wiley Series in Probability and Mathematical Statistics. Wiley, New York, 1981
1981
-
[31]
Jewson, J
J. Jewson, J. Q. Smith, and C. Holmes. Principles of Bayesian inference using general divergence criteria . Entropy, 20 0 (6): 0 442, 2018
2018
-
[32]
J. Jewson, J. Q. Smith, and C. Holmes. On the Stability of General Bayesian Inference . Bayesian Analysis, pages 1 -- 31, 2024. doi:10.1214/24-BA1502. URL https://doi.org/10.1214/24-BA1502
-
[33]
Kaplan-Damary, M
N. Kaplan-Damary, M. Mandel, Y. Yekutieli, Y. Shor, and S. Wiesner. Location distribution of randomly acquired characteristics on a shoe sole. Journal of Forensic Sciences, 67 0 (5): 0 1801--1809, 2022
2022
-
[34]
Kellermann, S
V. Kellermann, S. L. Chown, M. F. Schou, I. Aitkenhead, C. Janion-Scheepers, A. Clemson, M. T. Scott, and C. M. Sgr \`o . Comparing thermal performance curves across traits: how consistent are they? Journal of Experimental Biology, 222 0 (11): 0 jeb193433, 2019
2019
-
[35]
D. Kellett, D. Lagnado, R. Morgan, and S. Nakhaeizadeh. A Bayesian network approach to evaluating footwear evidence. Forensic Science International: Synergy, 12: 0 100673, 2026. ISSN 2589-871X. doi:https://doi.org/10.1016/j.fsisyn.2026.100673. URL https://www.sciencedirect.com/science/article/pii/S2589871X26000161
-
[36]
Knoblauch, J
J. Knoblauch, J. E. Jewson, and T. Damoulas. Doubly robust B ayesian inference for non-stationary streaming data with -divergences. Advances in Neural Information Processing Systems, 31, 2018
2018
-
[37]
Knoblauch, J
J. Knoblauch, J. Jewson, and T. Damoulas. An optimization-centric view on Bayes' rule: reviewing and generalizing variational inference . Journal of Machine Learning Research, 23 0 (132): 0 1--109, 2022
2022
-
[38]
Kontopoulos, A
D.-G. Kontopoulos, A. Sentis, M. Daufresne, N. Glazman, A. I. Dell, and S. Pawar. No universal mathematical model for thermal performance curves across traits and taxonomic groups. Nature communications, 15 0 (1): 0 8855, 2024
2024
-
[39]
D. V. Lindley. The choice of variables in B ayesian analysis. Journal of the Royal Statistical Society. Series B (Methodological), 30 0 (2): 0 239--251, 1968
1968
-
[41]
Martin and N
R. Martin and N. Syring. Direct Gibbs posterior inference on risk minimizers: Construction, concentration, and calibration. In Handbook of Statistics, volume 47, pages 1--41. Elsevier, 2022
2022
-
[42]
Matsubara, J
T. Matsubara, J. Knoblauch, F.-X. Briol, and C. J. Oates. Robust generalised Bayesian inference for intractable likelihoods . Journal of the Royal Statistical Society Series B: Statistical Methodology, 84 0 (3): 0 997--1022, 2022
2022
-
[43]
McLatchie, E
Y. McLatchie, E. Fong, D. T. Frazier, and J. Knoblauch. Predictive performance of power posteriors. Biometrika, page asaf034, 2025 a
2025
-
[44]
McLatchie, S
Y. McLatchie, S. R \"o gnvaldsson, F. Weber, and A. Vehtari. Advances in projection predictive inference. Statistical Science, 40 0 (1): 0 128--147, 2025 b
2025
-
[45]
J. W. Miller. Asymptotic normality, concentration, and coverage of generalized posteriors. The Journal of Machine Learning Research, 22 0 (1): 0 7598--7650, 2021
2021
-
[46]
J. W. Miller and D. B. Dunson. Robust Bayesian inference via coarsening. Journal of the American Statistical Association, 114 0 (527): 0 1113--1125, 2019
2019
-
[47]
Nakagawa and S
T. Nakagawa and S. Hashimoto. Robust Bayesian inference via -divergence . Communications in Statistics-Theory and Methods, 49 0 (2): 0 343--360, 2020
2020
-
[48]
Pacchiardi, S
L. Pacchiardi, S. Khoo, and R. Dutta. Generalized Bayesian likelihood-free inference . Electronic Journal of Statistics, 18 0 (2): 0 3628--3686, 2024
2024
-
[49]
Piironen and A
J. Piironen and A. Vehtari. Comparison of Bayesian predictive methods for model selection . Statistics and Computing, 27: 0 711--735, 2017
2017
-
[50]
Piironen, M
J. Piironen, M. Paasiniemi, and A. Vehtari. Projective inference in high-dimensional problems: prediction and feature selection. Electronic Journal of Statistics, 14 0 (1): 0 2155 -- 2197, 2020
2020
-
[51]
D. A. Ratkowsky, J. Olley, and T. Ross. Unifying temperature effects on the growth rate of bacteria and the stability of globular proteins. Journal of theoretical biology, 233 0 (3): 0 351--362, 2005
2005
-
[52]
T. Sawa. Information criteria for discriminating among alternative regression models. Econometrica: Journal of the Econometric Society, pages 1273--1291, 1978
1978
-
[53]
B. J. Sinclair, K. E. Marshall, M. A. Sewell, D. L. Levesque, C. S. Willett, S. Slotsbo, Y. Dong, C. D. Harley, D. J. Marshall, B. S. Helmuth, et al. Can we predict ectotherm responses to climate change using thermal performance curves and body temperatures? Ecology letters, 19 0 (11): 0 1372--1385, 2016
2016
-
[54]
Sivula, M
T. Sivula, M. Magnusson, A. A. Matamoros, and A. Vehtari. Uncertainty in Bayesian leave-one-out cross-validation based model comparison . Bayesian Analysis, 1 0 (1): 0 1--31, 2025
2025
-
[55]
N. A. Spencer and J. S. Murray. A Bayesian hierarchical model for evaluating forensic footwear evidence. The Annals of Applied Statistics, 14 0 (3): 0 1449--1470, 2020
2020
-
[56]
M. Stone. Cross-validation and multinomial prediction. Biometrika, pages 509--515, 1974
1974
-
[57]
Sugasawa
S. Sugasawa. Robust empirical Bayes small area estimation with density power divergence. Biometrika, 107 0 (2): 0 467--480, 2020
2020
-
[58]
Vehtari and J
A. Vehtari and J. Ojanen. A survey of Bayesian predictive methods for model assessment, selection and comparison . Statistics Surveys, 6 0 (none): 0 142 -- 228, 2012
2012
-
[59]
Vehtari, A
A. Vehtari, A. Gelman, and J. Gabry. Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC . Statistics and computing, 27: 0 1413--1432, 2017
2017
-
[60]
Vehtari, D
A. Vehtari, D. Simpson, A. Gelman, Y. Yao, and J. Gabry. Pareto smoothed importance sampling. Journal of Machine Learning Research, 25 0 (72): 0 1--58, 2024
2024
-
[61]
Wiesner, Y
S. Wiesner, Y. Shor, T. Tsach, N. Kaplan-Damary, and Y. Yekutieli. Dataset of digitized racs and their rarity score analysis for strengthening shoeprint evidence. Journal of forensic sciences, 65 0 (3): 0 762--774, 2020
2020
-
[62]
Y. Yao, A. Vehtari, D. Simpson, and A. Gelman. Using stacking to average Bayesian predictive distributions (with discussion) . Bayesian Analysis, 13 0 (3): 0 917--1003, 2018
2018
-
[63]
Statistics Surveys , number =
Aki Vehtari and Janne Ojanen , title =. Statistics Surveys , number =
-
[64]
Bayesian Analysis , volume=
Sivula, Tuomas and Magnusson, M. Bayesian Analysis , volume=. 2025 , publisher=
2025
-
[65]
Journal of statistical software , volume=
Stan: A probabilistic programming language , author=. Journal of statistical software , volume=
-
[66]
2017 , publisher=
Piironen, Juho and Vehtari, Aki , journal=. 2017 , publisher=
2017
-
[67]
Journal of the American statistical Association , volume=
The predictive sample reuse method with applications , author=. Journal of the American statistical Association , volume=. 1975 , publisher=
1975
-
[68]
Journal of the American Statistical Association , volume=
A predictive approach to model selection , author=. Journal of the American Statistical Association , volume=. 1979 , publisher=
1979
-
[69]
, author=
Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. , author=. Journal of machine learning research , volume=
-
[70]
Danyela Kellett and David Lagnado and Ruth Morgan and Sherry Nakhaeizadeh , doi =. A. Forensic Science International: Synergy , keywords =. 2026 , bdsk-url-1 =
2026
-
[71]
Journal of forensic sciences , volume=
Dataset of digitized RACs and their rarity score analysis for strengthening shoeprint evidence , author=. Journal of forensic sciences , volume=. 2020 , publisher=
2020
-
[72]
Spencer, Neil A and Murray, Jared S , journal=. A. 2020 , publisher=
2020
-
[73]
Journal of Forensic Sciences , volume=
Location distribution of randomly acquired characteristics on a shoe sole , author=. Journal of Forensic Sciences , volume=. 2022 , publisher=
2022
-
[74]
Annals of the institute of statistical mathematics , volume=
Bayesian image restoration, with two applications in spatial statistics , author=. Annals of the institute of statistical mathematics , volume=. 1991 , publisher=
1991
-
[75]
arXiv preprint arXiv:2602.07006 , year=
Scalable spatial point process models for forensic footwear analysis , author=. arXiv preprint arXiv:2602.07006 , year=
-
[76]
Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=
Bayesian measures of model complexity and fit , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 2002 , publisher=
2002
-
[77]
Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=
Predictive model selection , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 1995 , publisher=
1995
-
[78]
Biometrika , volume=
Model choice: a minimum posterior predictive loss approach , author=. Biometrika , volume=. 1998 , publisher=
1998
-
[79]
Journal of the American Statistical Association , volume=
Bayes factors , author=. Journal of the American Statistical Association , volume=. 1995 , publisher=
1995
-
[80]
Journal of the American Statistical Association , volume=
Markov chain monte carlo methods for computing Bayes factors: A comparative review , author=. Journal of the American Statistical Association , volume=. 2001 , publisher=
2001
-
[81]
Optimal predictive model selection , author=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.