Calibrated Probability Forecast Sequences and Measure-Valued Martingales

Christopher Ferro; Thomas Wilkinson

arxiv: 2606.31621 · v1 · pith:36Q3FZT7new · submitted 2026-06-30 · 🧮 math.ST · stat.TH

Calibrated Probability Forecast Sequences and Measure-Valued Martingales

Thomas Wilkinson , Christopher Ferro This is my paper

Pith reviewed 2026-07-01 02:18 UTC · model grok-4.3

classification 🧮 math.ST stat.TH

keywords probability forecastingauto-calibrationmeasure-valued martingalesforecast sequencescalibration testingBorel spacesstatistical tests

0 comments

The pith

Auto-calibration of probability forecast sequences holds exactly when the forecasts form a measure-valued martingale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends auto-calibration from single forecasts to sequences of updating forecasts for each observation. It shows that for observations in any Borel space this extended auto-calibration is equivalent to the sequence of random probability measures satisfying the martingale property. The equivalence supplies a statistical test for calibration of such sequences. A sympathetic reader cares because earlier calibration checks applied only to single forecasts and no prior method existed for repeated updating forecasts.

Core claim

For observations that sit in any Borel space, auto-calibration of a forecast sequence is equivalent to the associated sequence of random probability measures satisfying the martingale property. The paper proposes a simple statistical approach to testing this martingale property, which yields the first method for assessing calibration of sequences of probability forecasts.

What carries the argument

The equivalence between extended auto-calibration of forecast sequences and the martingale property of the associated measure-valued processes.

If this is right

The martingale property can be tested directly to verify auto-calibration of updating forecast sequences.
The test applies uniformly to any Borel space of observations.
Calibration assessment no longer needs separate handling for single versus sequential forecasts.
Statistical procedures already developed for martingales become available for forecast calibration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same martingale test might be applied to online forecasting systems that issue repeated updates.
Links could be explored between this equivalence and other martingale characterizations in sequential prediction.
Efficient computational versions of the proposed statistical test could be developed for large data sets.
The sequence definition of auto-calibration might be compared with other calibration notions under the same martingale lens.

Load-bearing premise

The chosen extension of single-forecast auto-calibration to sequences of forecasts is the correct definition to test via the martingale property.

What would settle it

A concrete sequence of forecasts in a Borel space that satisfies the martingale property yet fails the extended auto-calibration definition, or the converse.

read the original abstract

We consider the calibration of probability forecasts. Several notions of calibration exist when the forecaster issues a single forecast for each of the observations that is to be predicted. We extend one of these notions, auto-calibration, to the common situation in which the forecaster issues a sequence of forecasts for each observation, repeatedly updating their prediction as they receive additional information. For observations that sit in any Borel space, we show that auto-calibration is equivalent to a certain sequence of random probability measures satisfying the martingale property, and we propose a simple, statistical approach to testing this property. This provides, for the first time, a way of testing the calibration of such sequences of probability forecasts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a martingale equivalence for auto-calibration of updating forecast sequences and a statistical test for it.

read the letter

The main thing to know is that this paper extends auto-calibration from single forecasts to sequences of updating ones and shows the property is equivalent to the sequence of predictive measures forming a martingale, then suggests a test for the martingale condition. It works for observations in any Borel space.

They do a clean job generalizing the single-forecast case. The martingale framing is a natural fit and gives a concrete way to check calibration when forecasts keep getting revised. The claim that this is the first such test for sequences looks right based on the abstract.

The soft spot is the test procedure itself. The abstract calls it simple, but without seeing the exact steps, how they handle the measure-valued setting, or any checks on size and power, it is hard to judge how practical it will be. That part needs to be spelled out clearly in the paper.

The definition of the extended auto-calibration looks like a reasonable choice rather than an arbitrary one, so the equivalence does not seem forced.

This is for people working on probabilistic forecasting who need to verify calibration on sequential predictions. A reader who already knows the single-forecast literature will see the extension quickly.

It deserves peer review. The core claim is precise and the motivation is practical.

Referee Report

2 major / 2 minor

Summary. The paper extends the notion of auto-calibration from single forecasts to sequences of forecasts issued for each observation in an arbitrary Borel space. It establishes an equivalence between this extended auto-calibration and the property that the associated sequence of random probability measures forms a martingale, and proposes a statistical test of the martingale property as a means to assess calibration of such forecast sequences.

Significance. If the equivalence holds, the result supplies the first explicit characterization and testable criterion for calibration of sequential probability forecasts in general spaces. The martingale formulation connects forecasting theory to stochastic processes in a manner that may enable further theoretical development and practical diagnostics in dynamic prediction problems.

major comments (2)

The manuscript asserts an equivalence between the extended auto-calibration and the martingale property, but the precise definition of the extended auto-calibration (and whether it is chosen independently of the martingale characterization or constructed to produce the equivalence) must be stated explicitly in the main text before the theorem; without this, it is unclear whether the result is substantive or definitional.
The proposed statistical test of the martingale property is described only at a high level in the abstract; the test statistic, its asymptotic properties, and any assumptions required for validity (e.g., on the filtration or the Borel space) need to be derived and stated in the section presenting the test so that the procedure can be evaluated for correctness and practicality.

minor comments (2)

Notation for the sequence of forecasts and the associated random measures should be introduced with a single consistent definition early in the paper to avoid ambiguity when moving between the calibration and martingale statements.
The claim that the approach works 'for the first time' for sequences should be supported by a brief comparison to prior work on sequential calibration or martingale characterizations of forecasts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and the detailed comments, which help improve the clarity of the presentation. We address each major comment below.

read point-by-point responses

Referee: The manuscript asserts an equivalence between the extended auto-calibration and the martingale property, but the precise definition of the extended auto-calibration (and whether it is chosen independently of the martingale characterization or constructed to produce the equivalence) must be stated explicitly in the main text before the theorem; without this, it is unclear whether the result is substantive or definitional.

Authors: The extended auto-calibration is obtained by a direct, natural extension of the single-forecast definition to sequences of forecasts; it is introduced independently of the martingale property, after which the equivalence is proved as a theorem. We agree that placing the definition immediately before the theorem will remove any ambiguity. We will revise the manuscript to state the definition of extended auto-calibration explicitly in the main text right before the theorem statement. revision: yes
Referee: The proposed statistical test of the martingale property is described only at a high level in the abstract; the test statistic, its asymptotic properties, and any assumptions required for validity (e.g., on the filtration or the Borel space) need to be derived and stated in the section presenting the test so that the procedure can be evaluated for correctness and practicality.

Authors: The test is introduced in Section 4 of the manuscript. We acknowledge that the current exposition is concise and that explicit statements of the test statistic, its limiting distribution, and the precise measurability and integrability assumptions would aid evaluation. We will expand that section to include the derivation of the statistic, its asymptotic properties under the null, and the required conditions on the filtration and the Borel space. revision: yes

Circularity Check

0 steps flagged

Mathematical equivalence with no circularity

full rationale

The paper establishes a mathematical if-and-only-if equivalence between the extended definition of auto-calibration for forecast sequences and the martingale property of associated random probability measures on Borel spaces. This is presented as a theorem derived from the definitions, with an explicit extension of the single-forecast notion stated as the natural choice. No fitted parameters, self-citations, or ansatzes are invoked as load-bearing steps; the result does not reduce to its inputs by construction. The proposed statistical test follows directly from the equivalence without circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The result rests on standard measure-theoretic assumptions for Borel spaces and the martingale property in the space of probability measures; no free parameters or invented entities are indicated in the abstract.

axioms (2)

domain assumption Observations lie in a Borel space
Required for the equivalence statement in the abstract.
domain assumption The martingale property in the space of probability measures captures the intended notion of auto-calibration for sequences
Central to the claimed equivalence.

pith-pipeline@v0.9.1-grok · 5632 in / 1065 out tokens · 33183 ms · 2026-07-01T02:18:49.111199+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 12 canonical work pages

[1]

BROCKWELL, A. E. (2007). Universal residuals: a multivariate transformation.Statist. Probab. Lett.77 1473–1478. https://doi.org/10.1016/j.spl.2007.02.008 MR2395595

work page doi:10.1016/j.spl.2007.02.008 2007
[2]

and HELD, L

CZADO, C., GNEITING, T. and HELD, L. (2009). Predictive model assessment for count data.Biometrics 651254–1261. https://doi.org/10.1111/j.1541-0420.2009.01191.x MR2756513

work page doi:10.1111/j.1541-0420.2009.01191.x 2009
[3]

DAWID, A. P. (1984). Statistical theory. The prequential approach.J. Roy. Statist. Soc. Ser. A147278–292. https://doi.org/10.2307/2981683 MR763811

work page doi:10.2307/2981683 1984
[4]

X., GUNTHER, T

DIEBOLD, F. X., GUNTHER, T. A. and TAY, A. S. (1998). Evaluating Density Forecasts with Applications to Financial Risk Management.International Economic Review39863–883. https://doi.org/10.2307/ 2527342

1998
[5]

FERGUSON, T. S. (1967).Mathematical statistics: A decision theoretic approach.Probability and Mathe- matical Statistics, Vol. 1. Academic Press, New York-London. MR215390

1967
[6]

and RAFTERY, A

GNEITING, T., BALABDAOUI, F. and RAFTERY, A. E. (2007). Probabilistic forecasts, calibration and sharpness.J. R. Stat. Soc. Ser. B Stat. Methodol.69243–268. https://doi.org/10.1111/j.1467-9868. 2007.00587.x MR2325275

work page doi:10.1111/j.1467-9868 2007
[7]

and RAFTERY, A

GNEITING, T. and RAFTERY, A. E. (2007). Strictly proper scoring rules, prediction, and estimation.J. Amer. Statist. Assoc.102359–378. https://doi.org/10.1198/016214506000001437 MR2345548

work page doi:10.1198/016214506000001437 2007
[8]

and RANJAN, R

GNEITING, T. and RANJAN, R. (2013). Combining predictive distributions.Electron. J. Stat.71747–1782. https://doi.org/10.1214/13-EJS823 MR3080409

work page doi:10.1214/13-ejs823 2013
[9]

HOROWITZ, J. (1985). Measure-valued random processes.Z. Wahrsch. Verw. Gebiete70213–236. https: //doi.org/10.1007/BF02451429 MR799147

work page doi:10.1007/bf02451429 1985
[10]

and WEIS, L

HYTÖNEN, T.,VANNEERVEN, J., VERAAR, M. and WEIS, L. (2016).Analysis in Banach spaces. Vol. I. Martingales and Littlewood-Paley theory.Ergebnisse der Mathematik und ihrer Grenzgebiete. 3. Folge. A Series of Modern Surveys in Mathematics [Results in Mathematics and Related Areas. 3rd Series. A Series of Modern Surveys in Mathematics]63. Springer, Cham. MR3617205

2016
[11]

(2017).Random measures, theory and applications.Probability Theory and Stochastic Modelling77

KALLENBERG, O. (2017).Random measures, theory and applications.Probability Theory and Stochastic Modelling77. Springer, Cham. https://doi.org/10.1007/978-3-319-41598-7 MR3642325

work page doi:10.1007/978-3-319-41598-7 2017
[12]

Kallenberg , Title =

KALLENBERG, O. (2021).Foundations of modern probability, third ed.Probability Theory and Stochastic Modelling99. Springer, Cham. https://doi.org/10.1007/978-3-030-61871-1 MR4226142

work page doi:10.1007/978-3-030-61871-1 2021
[13]

and POHLE, M.-O

KNÜPPEL, M., KRÜGER, F. and POHLE, M.-O. (2023). Score-based calibration testing for multivariate forecast distributions. https://doi.org/10.48550/arXiv.2211.16362

work page doi:10.48550/arxiv.2211.16362 2023
[14]

MITCHELL, J. (2008). Density forecast revisions and forecast efficiency. This paper is not currently avail- able online. See https://api.semanticscholar.org/CorpusID:17382920

2008
[15]

MODESTE, T. (2023). Évaluation et construction des prévisions probabilistes : Score et calibration dans un cadre dynamique, Theses, Université Claude Bernard - Lyon I https://theses.hal.science/tel-04517250

2023
[16]

NORDHAUS, W. D. (1987). Forecasting Efficiency: Concepts and Applications.The Review of Economics and Statistics69667–674. https://doi.org/10.2307/1935962

work page doi:10.2307/1935962 1987
[18]

TSYPLAKOV, A. (2020). Evaluation of Probabilistic Forecasts: Conditional Auto-calibration. https://dx.doi. org/10.2139/ssrn.2236605

work page doi:10.2139/ssrn.2236605 2020

[1] [1]

BROCKWELL, A. E. (2007). Universal residuals: a multivariate transformation.Statist. Probab. Lett.77 1473–1478. https://doi.org/10.1016/j.spl.2007.02.008 MR2395595

work page doi:10.1016/j.spl.2007.02.008 2007

[2] [2]

and HELD, L

CZADO, C., GNEITING, T. and HELD, L. (2009). Predictive model assessment for count data.Biometrics 651254–1261. https://doi.org/10.1111/j.1541-0420.2009.01191.x MR2756513

work page doi:10.1111/j.1541-0420.2009.01191.x 2009

[3] [3]

DAWID, A. P. (1984). Statistical theory. The prequential approach.J. Roy. Statist. Soc. Ser. A147278–292. https://doi.org/10.2307/2981683 MR763811

work page doi:10.2307/2981683 1984

[4] [4]

X., GUNTHER, T

DIEBOLD, F. X., GUNTHER, T. A. and TAY, A. S. (1998). Evaluating Density Forecasts with Applications to Financial Risk Management.International Economic Review39863–883. https://doi.org/10.2307/ 2527342

1998

[5] [5]

FERGUSON, T. S. (1967).Mathematical statistics: A decision theoretic approach.Probability and Mathe- matical Statistics, Vol. 1. Academic Press, New York-London. MR215390

1967

[6] [6]

and RAFTERY, A

GNEITING, T., BALABDAOUI, F. and RAFTERY, A. E. (2007). Probabilistic forecasts, calibration and sharpness.J. R. Stat. Soc. Ser. B Stat. Methodol.69243–268. https://doi.org/10.1111/j.1467-9868. 2007.00587.x MR2325275

work page doi:10.1111/j.1467-9868 2007

[7] [7]

and RAFTERY, A

GNEITING, T. and RAFTERY, A. E. (2007). Strictly proper scoring rules, prediction, and estimation.J. Amer. Statist. Assoc.102359–378. https://doi.org/10.1198/016214506000001437 MR2345548

work page doi:10.1198/016214506000001437 2007

[8] [8]

and RANJAN, R

GNEITING, T. and RANJAN, R. (2013). Combining predictive distributions.Electron. J. Stat.71747–1782. https://doi.org/10.1214/13-EJS823 MR3080409

work page doi:10.1214/13-ejs823 2013

[9] [9]

HOROWITZ, J. (1985). Measure-valued random processes.Z. Wahrsch. Verw. Gebiete70213–236. https: //doi.org/10.1007/BF02451429 MR799147

work page doi:10.1007/bf02451429 1985

[10] [10]

and WEIS, L

HYTÖNEN, T.,VANNEERVEN, J., VERAAR, M. and WEIS, L. (2016).Analysis in Banach spaces. Vol. I. Martingales and Littlewood-Paley theory.Ergebnisse der Mathematik und ihrer Grenzgebiete. 3. Folge. A Series of Modern Surveys in Mathematics [Results in Mathematics and Related Areas. 3rd Series. A Series of Modern Surveys in Mathematics]63. Springer, Cham. MR3617205

2016

[11] [11]

(2017).Random measures, theory and applications.Probability Theory and Stochastic Modelling77

KALLENBERG, O. (2017).Random measures, theory and applications.Probability Theory and Stochastic Modelling77. Springer, Cham. https://doi.org/10.1007/978-3-319-41598-7 MR3642325

work page doi:10.1007/978-3-319-41598-7 2017

[12] [12]

Kallenberg , Title =

KALLENBERG, O. (2021).Foundations of modern probability, third ed.Probability Theory and Stochastic Modelling99. Springer, Cham. https://doi.org/10.1007/978-3-030-61871-1 MR4226142

work page doi:10.1007/978-3-030-61871-1 2021

[13] [13]

and POHLE, M.-O

KNÜPPEL, M., KRÜGER, F. and POHLE, M.-O. (2023). Score-based calibration testing for multivariate forecast distributions. https://doi.org/10.48550/arXiv.2211.16362

work page doi:10.48550/arxiv.2211.16362 2023

[14] [14]

MITCHELL, J. (2008). Density forecast revisions and forecast efficiency. This paper is not currently avail- able online. See https://api.semanticscholar.org/CorpusID:17382920

2008

[15] [15]

MODESTE, T. (2023). Évaluation et construction des prévisions probabilistes : Score et calibration dans un cadre dynamique, Theses, Université Claude Bernard - Lyon I https://theses.hal.science/tel-04517250

2023

[16] [16]

NORDHAUS, W. D. (1987). Forecasting Efficiency: Concepts and Applications.The Review of Economics and Statistics69667–674. https://doi.org/10.2307/1935962

work page doi:10.2307/1935962 1987

[17] [18]

TSYPLAKOV, A. (2020). Evaluation of Probabilistic Forecasts: Conditional Auto-calibration. https://dx.doi. org/10.2139/ssrn.2236605

work page doi:10.2139/ssrn.2236605 2020