pith. machine review for the scientific record. sign in

arxiv: 2605.03997 · v1 · submitted 2026-05-05 · 📊 stat.ME · econ.EM· physics.ao-ph

Recognition: unknown

Uncertainty Quantification in Forecast Comparisons

Marc-Oliver Pohle, Sebastian Lerch, Tanja Zahn

Authors on Pith no claims yet

Pith reviewed 2026-05-07 02:35 UTC · model grok-4.3

classification 📊 stat.ME econ.EMphysics.ao-ph
keywords forecast evaluationskill scoressimultaneous confidence bandsDiebold-Mariano testbootstrapuncertainty quantificationmultivariate forecastingproper scoring rules
0
0 comments X

The pith

Simultaneous confidence bands provide valid joint inference on expected scores and skill scores across multiple forecast dimensions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops simultaneous confidence bands for expected scores and skill scores to handle uncertainty quantification when comparing forecasts in settings with many variables, horizons, locations, or methods. Standard pairwise tests or pointwise intervals ignore the multiple-comparison problem and produce invalid joint statements. The bands are constructed via a bootstrap that remains valid under multivariate extensions of the classical Diebold-Mariano assumptions on forecast-error dependence. The approach applies to any forecast type, from point to full distributional predictions, and is illustrated on macroeconomic and weather-forecasting examples.

Core claim

We introduce simultaneous confidence bands for expected scores and skill scores that deliver valid joint inference under multivariate extensions of the classical Diebold-Mariano assumptions. The bands are implemented by a bootstrap procedure and apply to any consistent scoring function or proper scoring rule, thereby furnishing a coherent framework for uncertainty quantification in multi-dimensional forecast evaluation problems.

What carries the argument

Simultaneous confidence bands constructed by bootstrap under multivariate Diebold-Mariano dependence assumptions.

If this is right

  • Joint statements about relative forecast performance across many variables or horizons become statistically valid rather than over-confident.
  • Any proper scoring rule can be equipped with simultaneous bands, so the same procedure works for mean, quantile, and distributional forecasts.
  • Time-varying parameter models can be compared with constant-parameter benchmarks while controlling error rates across multiple macroeconomic series.
  • Physics-based and data-driven weather models can be ranked with joint uncertainty statements across lead times and spatial locations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The bands could be used to construct formal tests for the existence of any forecast that dominates a given benchmark across an entire evaluation grid.
  • Extension to non-stationary or strongly dependent errors would require new theory but would greatly enlarge the set of practical applications.
  • Because the method is agnostic to the scoring rule, it supplies a uniform way to report uncertainty for both economic and statistical loss functions.

Load-bearing premise

Forecast errors obey the multivariate stationarity and weak dependence conditions that extend the classical Diebold-Mariano assumptions.

What would settle it

Empirical coverage of the bands falls materially below the nominal level in Monte Carlo experiments that deliberately violate the multivariate Diebold-Mariano dependence conditions while keeping all other aspects of the evaluation fixed.

Figures

Figures reproduced from arXiv: 2605.03997 by Marc-Oliver Pohle, Sebastian Lerch, Tanja Zahn.

Figure 1
Figure 1. Figure 1: Estimated skill scores of a BVAR with time-varying parameters and stochastic view at source ↗
Figure 2
Figure 2. Figure 2: Estimated skill scores of the BVAR with time-varying parameters and stochastic view at source ↗
Figure 3
Figure 3. Figure 3: Estimated skill scores of the BVAR with time-varying parameters and stochastic view at source ↗
Figure 4
Figure 4. Figure 4: Estimated skill scores of the BVAR with time-varying parameters and stochastic view at source ↗
Figure 5
Figure 5. Figure 5: Average CRPS (left) and the estimated CRPSS (right) with 90% Bonferroni view at source ↗
Figure 6
Figure 6. Figure 6: Width of simultaneous sup-t confidence bands relative to pointwise bands (left) view at source ↗
Figure 7
Figure 7. Figure 7: Width of simultaneous Bonferroni confidence bands relative to sup-t bands (left) view at source ↗
Figure 8
Figure 8. Figure 8: Estimated skill scores of the BVAR with time-varying parameters and stochastic view at source ↗
Figure 9
Figure 9. Figure 9: Estimated skill scores of the BVAR with time-varying parameters and stochastic view at source ↗
Figure 10
Figure 10. Figure 10: Estimated skill scores of the BVAR with time-varying parameters and stochastic view at source ↗
Figure 12
Figure 12. Figure 12: Estimated CRPS (left) and CRPSS (right) with 90% sup-t confidence bands view at source ↗
Figure 11
Figure 11. Figure 11: Estimated skill scores of probabilistic forecasts generated from the data-driven view at source ↗
Figure 13
Figure 13. Figure 13: Estimated skill scores of probabilistic forecasts generated from the data-driven view at source ↗
read the original abstract

Skill scores, which measure the relative improvement of a forecasting method over a benchmark via consistent scoring functions and proper scoring rules, are a standard tool in forecast evaluation, yet their sampling uncertainty is rarely rigorously quantified. With modern forecasting applications being increasingly multivariate and involving evaluations across multiple horizons, variables, spatial locations, and forecasting methods, standard tools like the pairwise Diebold-Mariano forecast accuracy test or pointwise confidence intervals fail to account for the multiple comparison problem, leading to inflated Type I error rates and invalid joint inference. To address the lack of a coherent, statistically rigorous framework for quantifying uncertainty across these multi-dimensional evaluation problems, we introduce simultaneous confidence bands for expected scores and skill scores. Our framework provides a versatile tool for joint inference that is applicable to any forecast type from mean and quantile to full distributional forecasts. We develop a bootstrap implementation and show that our bands are valid under multivariate extensions of the classical Diebold-Mariano assumptions. We demonstrate the practical utility of the approach in two case studies by quantifying the benefits of time-varying parameter models for macroeconomic forecasting, and by comparing data-driven and physics-based models in probabilistic weather forecasting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces simultaneous confidence bands for expected scores and skill scores in multivariate forecast evaluation settings that involve multiple horizons, variables, locations, and methods. It develops a bootstrap procedure claimed to deliver valid joint inference under multivariate extensions of the classical Diebold-Mariano assumptions (stationarity and weak dependence of the vector-valued loss differentials) and illustrates the bands in two empirical case studies: time-varying-parameter macro forecasts and probabilistic weather forecasts.

Significance. If the bootstrap coverage result holds, the work supplies a practical and statistically coherent tool for joint inference that directly addresses the multiple-comparison problem in modern, high-dimensional forecast comparisons. The framework is scoring-rule agnostic and therefore applies to point, quantile, and distributional forecasts alike.

major comments (2)
  1. [§3] §3 (bootstrap consistency): the stated multivariate DM mixing conditions are necessary for the bootstrap to achieve asymptotic coverage, yet the manuscript provides neither explicit rate conditions nor Monte Carlo evidence that coverage is attained when cross-horizon or cross-variable dependence is strong or when volatility clustering is present. Without such verification the central validity claim remains untested.
  2. [§4–5] §4–5 (empirical applications): the reported bands are constructed from the same data used to select model specifications and tuning parameters; no adjustment for this data-dependent choice is described, which can invalidate the nominal coverage even under the maintained assumptions.
minor comments (2)
  1. Notation for the multivariate loss differential process is introduced without an explicit dimension index; adding a clear subscript (e.g., d = 1,…,D) would improve readability.
  2. Figure captions should state the exact bootstrap method (multiplier vs. block) and the number of replications used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on bootstrap validity and post-selection issues. We address each point below and will revise the manuscript accordingly, adding Monte Carlo experiments under stronger dependence and an explicit discussion of data-dependent tuning in the applications.

read point-by-point responses
  1. Referee: [§3] §3 (bootstrap consistency): the stated multivariate DM mixing conditions are necessary for the bootstrap to achieve asymptotic coverage, yet the manuscript provides neither explicit rate conditions nor Monte Carlo evidence that coverage is attained when cross-horizon or cross-variable dependence is strong or when volatility clustering is present. Without such verification the central validity claim remains untested.

    Authors: The consistency theorem in §3 is stated under the maintained multivariate strong-mixing conditions that extend the classical DM assumptions; these already permit a range of weak dependence, including limited volatility clustering. We did not supply explicit convergence rates because the proof follows standard arguments for the multivariate bootstrap under mixing (e.g., via blocking and coupling). We agree, however, that finite-sample coverage under stronger dependence merits direct verification. In the revision we will add a Monte Carlo section that examines coverage for (i) strong cross-horizon and cross-variable correlation and (ii) GARCH-type volatility clustering, using the same block-bootstrap implementation employed in the paper. revision: yes

  2. Referee: [§4–5] §4–5 (empirical applications): the reported bands are constructed from the same data used to select model specifications and tuning parameters; no adjustment for this data-dependent choice is described, which can invalidate the nominal coverage even under the maintained assumptions.

    Authors: We acknowledge that the reported bands do not incorporate an adjustment for the data-dependent selection of model specifications and tuning parameters. In both applications the choices were guided by prior literature and separate validation exercises, yet this does not formally restore exact coverage. In the revision we will add a dedicated paragraph in each empirical section that (a) states the selection steps explicitly, (b) notes the potential coverage distortion, and (c) reports sensitivity checks that fix the specifications on an earlier subsample before constructing the bands on the evaluation period. revision: partial

Circularity Check

0 steps flagged

No circularity: bootstrap validity derived from independent asymptotic assumptions

full rationale

The paper constructs simultaneous confidence bands via a standard multiplier or block bootstrap applied to the vector of loss differentials. Validity is established by showing that the bootstrap consistently approximates the limiting Gaussian process that arises under the stated multivariate extension of Diebold-Mariano conditions (stationarity plus weak dependence sufficient for a functional CLT). These assumptions are external to the procedure itself and are not recovered from the fitted bands; the coverage statement is therefore a genuine theorem rather than a definitional identity or a fitted-input prediction. No self-citation supplies a uniqueness result or an ansatz that the present derivation relies upon. Consequently the derivation chain contains no reduction of the claimed result to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on (1) the existence of a proper scoring rule whose expectation defines the skill score, (2) the multivariate extension of the Diebold-Mariano mixing and moment conditions, and (3) bootstrap consistency under those conditions. No free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Multivariate extension of classical Diebold-Mariano assumptions (stationarity, mixing, finite moments of score differentials)
    Invoked to guarantee bootstrap validity for the simultaneous bands.

pith-pipeline@v0.9.0 · 5505 in / 986 out tokens · 17684 ms · 2026-05-07T02:35:08.069970+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

4 extracted references · 1 canonical work pages

  1. [1]

    Under Assumption 6 andΩ pp >0,p= 1, ..., P, assumptions 2 and 3 hold

  2. [2]

    Then the moving block bootstrap fulfils Assumption 5

    Let Assumption 6 hold and assume thatΩis nonsingular and that the block lengthl fulfils the condition l N + 1 l →0asN→ ∞. Then the moving block bootstrap fulfils Assumption 5. Proof.See Appendix D. C Asymptotic Width and Coverage Comparisons A natural concern is whether the bands become uninformatively wide as the dimensionJ of the skill score vector grow...

  3. [3]

    White (2014, Theorem 6.20) directly ensures Assumption 3

    The univariate central limit theorem in White (2014, Theorem 5.20) together with the Cram´ er-Wold device (noting that linear combinations ofα-mixing sequences are α-mixing of the same rate by White (2014, Theorem 3.49)) yields the multivariate central limit theorem for the scores, that is, Assumption 2. White (2014, Theorem 6.20) directly ensures Assumption 3

  4. [4]

    32 E Tabulated Simulation Results N= 100N= 400 a v boot typeP= 2P= 5P= 25P= 2P= 5P= 25 0 0 block Sup-t 0.872 0.832 0.7560.9040.879 0.848 0 0 block Bonf

    This follows directly from Lahiri (2003, Theorem 3.2). 32 E Tabulated Simulation Results N= 100N= 400 a v boot typeP= 2P= 5P= 25P= 2P= 5P= 25 0 0 block Sup-t 0.872 0.832 0.7560.9040.879 0.848 0 0 block Bonf. 0.865 0.845 0.809 0.904 0.893 0.886 0 0 block Pointw. 0.865 0.613 0.158 0.904 0.7 0.247 0 0 iid Sup-t0.8940.877 0.86 0.9150.901 0.898 0 0 iid Bonf. 0...