Calibrated Probability Forecast Sequences and Measure-Valued Martingales
Pith reviewed 2026-07-01 02:18 UTC · model grok-4.3
The pith
Auto-calibration of probability forecast sequences holds exactly when the forecasts form a measure-valued martingale.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For observations that sit in any Borel space, auto-calibration of a forecast sequence is equivalent to the associated sequence of random probability measures satisfying the martingale property. The paper proposes a simple statistical approach to testing this martingale property, which yields the first method for assessing calibration of sequences of probability forecasts.
What carries the argument
The equivalence between extended auto-calibration of forecast sequences and the martingale property of the associated measure-valued processes.
If this is right
- The martingale property can be tested directly to verify auto-calibration of updating forecast sequences.
- The test applies uniformly to any Borel space of observations.
- Calibration assessment no longer needs separate handling for single versus sequential forecasts.
- Statistical procedures already developed for martingales become available for forecast calibration.
Where Pith is reading between the lines
- The same martingale test might be applied to online forecasting systems that issue repeated updates.
- Links could be explored between this equivalence and other martingale characterizations in sequential prediction.
- Efficient computational versions of the proposed statistical test could be developed for large data sets.
- The sequence definition of auto-calibration might be compared with other calibration notions under the same martingale lens.
Load-bearing premise
The chosen extension of single-forecast auto-calibration to sequences of forecasts is the correct definition to test via the martingale property.
What would settle it
A concrete sequence of forecasts in a Borel space that satisfies the martingale property yet fails the extended auto-calibration definition, or the converse.
read the original abstract
We consider the calibration of probability forecasts. Several notions of calibration exist when the forecaster issues a single forecast for each of the observations that is to be predicted. We extend one of these notions, auto-calibration, to the common situation in which the forecaster issues a sequence of forecasts for each observation, repeatedly updating their prediction as they receive additional information. For observations that sit in any Borel space, we show that auto-calibration is equivalent to a certain sequence of random probability measures satisfying the martingale property, and we propose a simple, statistical approach to testing this property. This provides, for the first time, a way of testing the calibration of such sequences of probability forecasts.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper extends the notion of auto-calibration from single forecasts to sequences of forecasts issued for each observation in an arbitrary Borel space. It establishes an equivalence between this extended auto-calibration and the property that the associated sequence of random probability measures forms a martingale, and proposes a statistical test of the martingale property as a means to assess calibration of such forecast sequences.
Significance. If the equivalence holds, the result supplies the first explicit characterization and testable criterion for calibration of sequential probability forecasts in general spaces. The martingale formulation connects forecasting theory to stochastic processes in a manner that may enable further theoretical development and practical diagnostics in dynamic prediction problems.
major comments (2)
- The manuscript asserts an equivalence between the extended auto-calibration and the martingale property, but the precise definition of the extended auto-calibration (and whether it is chosen independently of the martingale characterization or constructed to produce the equivalence) must be stated explicitly in the main text before the theorem; without this, it is unclear whether the result is substantive or definitional.
- The proposed statistical test of the martingale property is described only at a high level in the abstract; the test statistic, its asymptotic properties, and any assumptions required for validity (e.g., on the filtration or the Borel space) need to be derived and stated in the section presenting the test so that the procedure can be evaluated for correctness and practicality.
minor comments (2)
- Notation for the sequence of forecasts and the associated random measures should be introduced with a single consistent definition early in the paper to avoid ambiguity when moving between the calibration and martingale statements.
- The claim that the approach works 'for the first time' for sequences should be supported by a brief comparison to prior work on sequential calibration or martingale characterizations of forecasts.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and the detailed comments, which help improve the clarity of the presentation. We address each major comment below.
read point-by-point responses
-
Referee: The manuscript asserts an equivalence between the extended auto-calibration and the martingale property, but the precise definition of the extended auto-calibration (and whether it is chosen independently of the martingale characterization or constructed to produce the equivalence) must be stated explicitly in the main text before the theorem; without this, it is unclear whether the result is substantive or definitional.
Authors: The extended auto-calibration is obtained by a direct, natural extension of the single-forecast definition to sequences of forecasts; it is introduced independently of the martingale property, after which the equivalence is proved as a theorem. We agree that placing the definition immediately before the theorem will remove any ambiguity. We will revise the manuscript to state the definition of extended auto-calibration explicitly in the main text right before the theorem statement. revision: yes
-
Referee: The proposed statistical test of the martingale property is described only at a high level in the abstract; the test statistic, its asymptotic properties, and any assumptions required for validity (e.g., on the filtration or the Borel space) need to be derived and stated in the section presenting the test so that the procedure can be evaluated for correctness and practicality.
Authors: The test is introduced in Section 4 of the manuscript. We acknowledge that the current exposition is concise and that explicit statements of the test statistic, its limiting distribution, and the precise measurability and integrability assumptions would aid evaluation. We will expand that section to include the derivation of the statistic, its asymptotic properties under the null, and the required conditions on the filtration and the Borel space. revision: yes
Circularity Check
Mathematical equivalence with no circularity
full rationale
The paper establishes a mathematical if-and-only-if equivalence between the extended definition of auto-calibration for forecast sequences and the martingale property of associated random probability measures on Borel spaces. This is presented as a theorem derived from the definitions, with an explicit extension of the single-forecast notion stated as the natural choice. No fitted parameters, self-citations, or ansatzes are invoked as load-bearing steps; the result does not reduce to its inputs by construction. The proposed statistical test follows directly from the equivalence without circularity.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Observations lie in a Borel space
- domain assumption The martingale property in the space of probability measures captures the intended notion of auto-calibration for sequences
Reference graph
Works this paper leans on
-
[1]
BROCKWELL, A. E. (2007). Universal residuals: a multivariate transformation.Statist. Probab. Lett.77 1473–1478. https://doi.org/10.1016/j.spl.2007.02.008 MR2395595
-
[2]
CZADO, C., GNEITING, T. and HELD, L. (2009). Predictive model assessment for count data.Biometrics 651254–1261. https://doi.org/10.1111/j.1541-0420.2009.01191.x MR2756513
-
[3]
DAWID, A. P. (1984). Statistical theory. The prequential approach.J. Roy. Statist. Soc. Ser. A147278–292. https://doi.org/10.2307/2981683 MR763811
-
[4]
X., GUNTHER, T
DIEBOLD, F. X., GUNTHER, T. A. and TAY, A. S. (1998). Evaluating Density Forecasts with Applications to Financial Risk Management.International Economic Review39863–883. https://doi.org/10.2307/ 2527342
1998
-
[5]
FERGUSON, T. S. (1967).Mathematical statistics: A decision theoretic approach.Probability and Mathe- matical Statistics, Vol. 1. Academic Press, New York-London. MR215390
1967
-
[6]
GNEITING, T., BALABDAOUI, F. and RAFTERY, A. E. (2007). Probabilistic forecasts, calibration and sharpness.J. R. Stat. Soc. Ser. B Stat. Methodol.69243–268. https://doi.org/10.1111/j.1467-9868. 2007.00587.x MR2325275
-
[7]
GNEITING, T. and RAFTERY, A. E. (2007). Strictly proper scoring rules, prediction, and estimation.J. Amer. Statist. Assoc.102359–378. https://doi.org/10.1198/016214506000001437 MR2345548
-
[8]
GNEITING, T. and RANJAN, R. (2013). Combining predictive distributions.Electron. J. Stat.71747–1782. https://doi.org/10.1214/13-EJS823 MR3080409
-
[9]
HOROWITZ, J. (1985). Measure-valued random processes.Z. Wahrsch. Verw. Gebiete70213–236. https: //doi.org/10.1007/BF02451429 MR799147
-
[10]
and WEIS, L
HYTÖNEN, T.,VANNEERVEN, J., VERAAR, M. and WEIS, L. (2016).Analysis in Banach spaces. Vol. I. Martingales and Littlewood-Paley theory.Ergebnisse der Mathematik und ihrer Grenzgebiete. 3. Folge. A Series of Modern Surveys in Mathematics [Results in Mathematics and Related Areas. 3rd Series. A Series of Modern Surveys in Mathematics]63. Springer, Cham. MR3617205
2016
-
[11]
(2017).Random measures, theory and applications.Probability Theory and Stochastic Modelling77
KALLENBERG, O. (2017).Random measures, theory and applications.Probability Theory and Stochastic Modelling77. Springer, Cham. https://doi.org/10.1007/978-3-319-41598-7 MR3642325
-
[12]
KALLENBERG, O. (2021).Foundations of modern probability, third ed.Probability Theory and Stochastic Modelling99. Springer, Cham. https://doi.org/10.1007/978-3-030-61871-1 MR4226142
-
[13]
KNÜPPEL, M., KRÜGER, F. and POHLE, M.-O. (2023). Score-based calibration testing for multivariate forecast distributions. https://doi.org/10.48550/arXiv.2211.16362
-
[14]
MITCHELL, J. (2008). Density forecast revisions and forecast efficiency. This paper is not currently avail- able online. See https://api.semanticscholar.org/CorpusID:17382920
2008
-
[15]
MODESTE, T. (2023). Évaluation et construction des prévisions probabilistes : Score et calibration dans un cadre dynamique, Theses, Université Claude Bernard - Lyon I https://theses.hal.science/tel-04517250
2023
-
[16]
NORDHAUS, W. D. (1987). Forecasting Efficiency: Concepts and Applications.The Review of Economics and Statistics69667–674. https://doi.org/10.2307/1935962
-
[18]
TSYPLAKOV, A. (2020). Evaluation of Probabilistic Forecasts: Conditional Auto-calibration. https://dx.doi. org/10.2139/ssrn.2236605
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.