pith. sign in

arxiv: 2606.31621 · v1 · pith:36Q3FZT7new · submitted 2026-06-30 · 🧮 math.ST · stat.TH

Calibrated Probability Forecast Sequences and Measure-Valued Martingales

Pith reviewed 2026-07-01 02:18 UTC · model grok-4.3

classification 🧮 math.ST stat.TH
keywords probability forecastingauto-calibrationmeasure-valued martingalesforecast sequencescalibration testingBorel spacesstatistical tests
0
0 comments X

The pith

Auto-calibration of probability forecast sequences holds exactly when the forecasts form a measure-valued martingale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends auto-calibration from single forecasts to sequences of updating forecasts for each observation. It shows that for observations in any Borel space this extended auto-calibration is equivalent to the sequence of random probability measures satisfying the martingale property. The equivalence supplies a statistical test for calibration of such sequences. A sympathetic reader cares because earlier calibration checks applied only to single forecasts and no prior method existed for repeated updating forecasts.

Core claim

For observations that sit in any Borel space, auto-calibration of a forecast sequence is equivalent to the associated sequence of random probability measures satisfying the martingale property. The paper proposes a simple statistical approach to testing this martingale property, which yields the first method for assessing calibration of sequences of probability forecasts.

What carries the argument

The equivalence between extended auto-calibration of forecast sequences and the martingale property of the associated measure-valued processes.

If this is right

  • The martingale property can be tested directly to verify auto-calibration of updating forecast sequences.
  • The test applies uniformly to any Borel space of observations.
  • Calibration assessment no longer needs separate handling for single versus sequential forecasts.
  • Statistical procedures already developed for martingales become available for forecast calibration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same martingale test might be applied to online forecasting systems that issue repeated updates.
  • Links could be explored between this equivalence and other martingale characterizations in sequential prediction.
  • Efficient computational versions of the proposed statistical test could be developed for large data sets.
  • The sequence definition of auto-calibration might be compared with other calibration notions under the same martingale lens.

Load-bearing premise

The chosen extension of single-forecast auto-calibration to sequences of forecasts is the correct definition to test via the martingale property.

What would settle it

A concrete sequence of forecasts in a Borel space that satisfies the martingale property yet fails the extended auto-calibration definition, or the converse.

read the original abstract

We consider the calibration of probability forecasts. Several notions of calibration exist when the forecaster issues a single forecast for each of the observations that is to be predicted. We extend one of these notions, auto-calibration, to the common situation in which the forecaster issues a sequence of forecasts for each observation, repeatedly updating their prediction as they receive additional information. For observations that sit in any Borel space, we show that auto-calibration is equivalent to a certain sequence of random probability measures satisfying the martingale property, and we propose a simple, statistical approach to testing this property. This provides, for the first time, a way of testing the calibration of such sequences of probability forecasts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper extends the notion of auto-calibration from single forecasts to sequences of forecasts issued for each observation in an arbitrary Borel space. It establishes an equivalence between this extended auto-calibration and the property that the associated sequence of random probability measures forms a martingale, and proposes a statistical test of the martingale property as a means to assess calibration of such forecast sequences.

Significance. If the equivalence holds, the result supplies the first explicit characterization and testable criterion for calibration of sequential probability forecasts in general spaces. The martingale formulation connects forecasting theory to stochastic processes in a manner that may enable further theoretical development and practical diagnostics in dynamic prediction problems.

major comments (2)
  1. The manuscript asserts an equivalence between the extended auto-calibration and the martingale property, but the precise definition of the extended auto-calibration (and whether it is chosen independently of the martingale characterization or constructed to produce the equivalence) must be stated explicitly in the main text before the theorem; without this, it is unclear whether the result is substantive or definitional.
  2. The proposed statistical test of the martingale property is described only at a high level in the abstract; the test statistic, its asymptotic properties, and any assumptions required for validity (e.g., on the filtration or the Borel space) need to be derived and stated in the section presenting the test so that the procedure can be evaluated for correctness and practicality.
minor comments (2)
  1. Notation for the sequence of forecasts and the associated random measures should be introduced with a single consistent definition early in the paper to avoid ambiguity when moving between the calibration and martingale statements.
  2. The claim that the approach works 'for the first time' for sequences should be supported by a brief comparison to prior work on sequential calibration or martingale characterizations of forecasts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and the detailed comments, which help improve the clarity of the presentation. We address each major comment below.

read point-by-point responses
  1. Referee: The manuscript asserts an equivalence between the extended auto-calibration and the martingale property, but the precise definition of the extended auto-calibration (and whether it is chosen independently of the martingale characterization or constructed to produce the equivalence) must be stated explicitly in the main text before the theorem; without this, it is unclear whether the result is substantive or definitional.

    Authors: The extended auto-calibration is obtained by a direct, natural extension of the single-forecast definition to sequences of forecasts; it is introduced independently of the martingale property, after which the equivalence is proved as a theorem. We agree that placing the definition immediately before the theorem will remove any ambiguity. We will revise the manuscript to state the definition of extended auto-calibration explicitly in the main text right before the theorem statement. revision: yes

  2. Referee: The proposed statistical test of the martingale property is described only at a high level in the abstract; the test statistic, its asymptotic properties, and any assumptions required for validity (e.g., on the filtration or the Borel space) need to be derived and stated in the section presenting the test so that the procedure can be evaluated for correctness and practicality.

    Authors: The test is introduced in Section 4 of the manuscript. We acknowledge that the current exposition is concise and that explicit statements of the test statistic, its limiting distribution, and the precise measurability and integrability assumptions would aid evaluation. We will expand that section to include the derivation of the statistic, its asymptotic properties under the null, and the required conditions on the filtration and the Borel space. revision: yes

Circularity Check

0 steps flagged

Mathematical equivalence with no circularity

full rationale

The paper establishes a mathematical if-and-only-if equivalence between the extended definition of auto-calibration for forecast sequences and the martingale property of associated random probability measures on Borel spaces. This is presented as a theorem derived from the definitions, with an explicit extension of the single-forecast notion stated as the natural choice. No fitted parameters, self-citations, or ansatzes are invoked as load-bearing steps; the result does not reduce to its inputs by construction. The proposed statistical test follows directly from the equivalence without circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The result rests on standard measure-theoretic assumptions for Borel spaces and the martingale property in the space of probability measures; no free parameters or invented entities are indicated in the abstract.

axioms (2)
  • domain assumption Observations lie in a Borel space
    Required for the equivalence statement in the abstract.
  • domain assumption The martingale property in the space of probability measures captures the intended notion of auto-calibration for sequences
    Central to the claimed equivalence.

pith-pipeline@v0.9.1-grok · 5632 in / 1065 out tokens · 33183 ms · 2026-07-01T02:18:49.111199+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 12 canonical work pages

  1. [1]

    BROCKWELL, A. E. (2007). Universal residuals: a multivariate transformation.Statist. Probab. Lett.77 1473–1478. https://doi.org/10.1016/j.spl.2007.02.008 MR2395595

  2. [2]

    and HELD, L

    CZADO, C., GNEITING, T. and HELD, L. (2009). Predictive model assessment for count data.Biometrics 651254–1261. https://doi.org/10.1111/j.1541-0420.2009.01191.x MR2756513

  3. [3]

    DAWID, A. P. (1984). Statistical theory. The prequential approach.J. Roy. Statist. Soc. Ser. A147278–292. https://doi.org/10.2307/2981683 MR763811

  4. [4]

    X., GUNTHER, T

    DIEBOLD, F. X., GUNTHER, T. A. and TAY, A. S. (1998). Evaluating Density Forecasts with Applications to Financial Risk Management.International Economic Review39863–883. https://doi.org/10.2307/ 2527342

  5. [5]

    FERGUSON, T. S. (1967).Mathematical statistics: A decision theoretic approach.Probability and Mathe- matical Statistics, Vol. 1. Academic Press, New York-London. MR215390

  6. [6]

    and RAFTERY, A

    GNEITING, T., BALABDAOUI, F. and RAFTERY, A. E. (2007). Probabilistic forecasts, calibration and sharpness.J. R. Stat. Soc. Ser. B Stat. Methodol.69243–268. https://doi.org/10.1111/j.1467-9868. 2007.00587.x MR2325275

  7. [7]

    and RAFTERY, A

    GNEITING, T. and RAFTERY, A. E. (2007). Strictly proper scoring rules, prediction, and estimation.J. Amer. Statist. Assoc.102359–378. https://doi.org/10.1198/016214506000001437 MR2345548

  8. [8]

    and RANJAN, R

    GNEITING, T. and RANJAN, R. (2013). Combining predictive distributions.Electron. J. Stat.71747–1782. https://doi.org/10.1214/13-EJS823 MR3080409

  9. [9]

    HOROWITZ, J. (1985). Measure-valued random processes.Z. Wahrsch. Verw. Gebiete70213–236. https: //doi.org/10.1007/BF02451429 MR799147

  10. [10]

    and WEIS, L

    HYTÖNEN, T.,VANNEERVEN, J., VERAAR, M. and WEIS, L. (2016).Analysis in Banach spaces. Vol. I. Martingales and Littlewood-Paley theory.Ergebnisse der Mathematik und ihrer Grenzgebiete. 3. Folge. A Series of Modern Surveys in Mathematics [Results in Mathematics and Related Areas. 3rd Series. A Series of Modern Surveys in Mathematics]63. Springer, Cham. MR3617205

  11. [11]

    (2017).Random measures, theory and applications.Probability Theory and Stochastic Modelling77

    KALLENBERG, O. (2017).Random measures, theory and applications.Probability Theory and Stochastic Modelling77. Springer, Cham. https://doi.org/10.1007/978-3-319-41598-7 MR3642325

  12. [12]

    Kallenberg , Title =

    KALLENBERG, O. (2021).Foundations of modern probability, third ed.Probability Theory and Stochastic Modelling99. Springer, Cham. https://doi.org/10.1007/978-3-030-61871-1 MR4226142

  13. [13]

    and POHLE, M.-O

    KNÜPPEL, M., KRÜGER, F. and POHLE, M.-O. (2023). Score-based calibration testing for multivariate forecast distributions. https://doi.org/10.48550/arXiv.2211.16362

  14. [14]

    MITCHELL, J. (2008). Density forecast revisions and forecast efficiency. This paper is not currently avail- able online. See https://api.semanticscholar.org/CorpusID:17382920

  15. [15]

    MODESTE, T. (2023). Évaluation et construction des prévisions probabilistes : Score et calibration dans un cadre dynamique, Theses, Université Claude Bernard - Lyon I https://theses.hal.science/tel-04517250

  16. [16]

    NORDHAUS, W. D. (1987). Forecasting Efficiency: Concepts and Applications.The Review of Economics and Statistics69667–674. https://doi.org/10.2307/1935962

  17. [18]

    TSYPLAKOV, A. (2020). Evaluation of Probabilistic Forecasts: Conditional Auto-calibration. https://dx.doi. org/10.2139/ssrn.2236605