pith. machine review for the scientific record. sign in

arxiv: 2605.08422 · v1 · submitted 2026-05-08 · 📊 stat.ME · econ.EM· stat.CO

Recognition: 2 theorem links

· Lean Theorem

Rolling-Origin Conformal Prediction under Local Stationarity and Weak Dependence

Stanis{\l}aw M. S. Halkiewicz

Pith reviewed 2026-05-12 01:03 UTC · model grok-4.3

classification 📊 stat.ME econ.EMstat.CO
keywords conformal predictiontime serieslocal stationarityrolling origincoverage errorα-mixingHölder smoothnesscalibration window
0
0 comments X

The pith

Rolling-origin conformal prediction attains minimax-optimal coverage rates for time series by tuning the calibration window to local stationarity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that conformal prediction intervals for time-series forecasts can be calibrated using only the m most recent pseudo-out-of-sample errors rather than the full history. Under Hölder-β local stationarity and α-mixing, this yields a four-term coverage-error decomposition whose leading terms determine the optimal window size m⋆ ≍ T^{2β/(2β+1)} and the resulting coverage error of order O(T^{-β/(2β+1)}). A Le Cam two-point argument establishes that no other calibration rule can improve this rate over the Hölder-β model class. Real-data experiments on six series and the full M4 collection confirm that the rolling procedure keeps empirical coverage near the nominal level while outperforming full-history calibration.

Core claim

Rolling-origin conformal prediction calibrates the conformal quantile on the m most recent pseudo-out-of-sample forecast errors. Under Hölder-β local stationarity and α-mixing, a four-term coverage-error decomposition yields the optimal calibration window m⋆ ≍ T^{2β/(2β+1)} and coverage-error rate O(T^{-β/(2β+1)}). This rate is minimax optimal, as shown by a Le Cam two-point construction. The Bahadur representation is proved under both α-mixing and physical dependence, and an oracle inequality justifies data-driven window selection via Winkler cross-validation.

What carries the argument

The rolling-origin calibration window of length m that selects the m most recent pseudo-out-of-sample forecast errors for quantile estimation, which enables the four-term coverage-error decomposition.

If this is right

  • The coverage error shrinks at the faster rate O(T^{-β/(2β+1)}) compared with full-history calibration.
  • Winkler cross-validation provides an adaptive, oracle-efficient selector for m that does not require knowledge of β.
  • Empirical coverage remains within ±2 percent of the nominal level at short and medium horizons on real series.
  • The rolling procedure outperforms full-history calibration in 86 percent of the tested series with median Winkler-score improvement of 12.3 percent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The minimax optimality result implies that no substantially different calibration strategy can beat this rate inside the same smoothness class.
  • The cross-frequency regression slope near 2/3 in the empirical study suggests that real data often behave as if governed by the Hölder-β model.
  • The same four-term decomposition technique may extend to other nonconformity scores or to multivariate forecasting problems.
  • The framework indicates that local adaptation is necessary for conformal methods to retain valid coverage under distributional drift.

Load-bearing premise

The time series must satisfy Hölder-β local stationarity together with α-mixing or physical dependence, without which the coverage-error decomposition and optimal-rate result do not apply.

What would settle it

Simulate a Hölder-β locally stationary α-mixing process with known β, apply rolling-origin calibration at m near T^{2β/(2β+1)}, and verify that the empirical coverage deviation decays exactly at rate T^{-β/(2β+1)} rather than faster or slower.

Figures

Figures reproduced from arXiv: 2605.08422 by Stanis{\l}aw M. S. Halkiewicz.

Figure 1
Figure 1. Figure 1: Empirical coverage as a function of calibration window [PITH_FULL_IMAGE:figures/full_fig_p023_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Winkler score (lower is better) as a function of calibration window [PITH_FULL_IMAGE:figures/full_fig_p024_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Empirical coverage at Winkler-optimal m⋆ , averaged across series within each dataset. AR(p) model, h = 1. Horizontal dotted line: nominal 90% target. Numbers above bars show empirical coverage to three decimal places. full rolling vol scaled Mean local coverage 80% 90% 100% 110% Electricity full rolling vol scaled 80% 90% 100% 110% Financial full rolling vol scaled 80% 90% 100% 110% Macro full rolling vol… view at source ↗
Figure 4
Figure 4. Figure 4: Rolling local coverage (mean ± standard deviation over 50-observation windows). AR(p) model, h = 1. Narrower error bars indicate more stable conditional coverage. class [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Left: log-log scatter of Winkler-optimal [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Coverage (left) and mean interval half-width (right) by forecast horizon [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗
read the original abstract

We propose and analyse rolling-origin conformal prediction for time-series forecasting. The method calibrates the conformal quantile against the $m$ most recent pseudo-out-of-sample forecast errors, adapting to serial dependence, volatility clustering, and distributional drift that invalidate classical conformal guarantees. Under H\"{o}lder-$\beta$ local stationarity and $\alpha$-mixing, we establish a four-term coverage-error decomposition and derive the optimal calibration window $m^{\star} \asymp T^{2\beta/(2\beta+1)}$ with coverage-error rate $O(T^{-\beta/(2\beta+1)})$. A Le Cam two-point construction shows this rate is minimax-optimal over the H\"{o}lder-$\beta$ model class. The Bahadur representation is proved under both $\alpha$-mixing and the physical-dependence framework of Wu (2005). An oracle inequality formalises Winkler cross-validation as an adaptive window selector; the required uniform concentration condition is established in an appendix. Validation on six real series and 93 M4 competition series confirms the theory: rolling-origin calibration outperforms full-history calibration in 86\% of comparisons (median Winkler improvement 12.3\%), maintains coverage within $\pm2\%$ of the 90\% target at short and medium horizons, and the cross-frequency log-log regression slope $0.614$ ($95\%$ CI $[0.424, 0.805]$) is consistent with the theoretical $2/3$ after controlling for frequency fixed effects.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes rolling-origin conformal prediction for time-series forecasting, calibrating the conformal quantile on the m most recent pseudo-out-of-sample forecast errors to handle serial dependence, volatility clustering, and distributional drift. Under Hölder-β local stationarity and α-mixing (or physical dependence), it establishes a four-term coverage-error decomposition, derives the optimal calibration window m⋆ ≍ T^{2β/(2β+1)} yielding coverage-error rate O(T^{-β/(2β+1)}), proves this rate is minimax-optimal via a Le Cam two-point construction over the Hölder-β class, provides a Bahadur representation, an oracle inequality for Winkler cross-validation as an adaptive selector, and reports empirical validation on six real series plus 93 M4 series where rolling-origin calibration outperforms full-history in 86% of cases with median Winkler improvement 12.3% and coverage within ±2% of target.

Significance. If the four-term decomposition is complete and the interaction terms are controlled, the work supplies the first optimal-rate theory for adaptive conformal calibration in locally stationary weakly dependent series, together with a practical oracle inequality and strong empirical support on competition data. The dual Bahadur proofs (α-mixing and Wu physical dependence) and the explicit minimax lower bound are notable strengths that would make the result a reference point for time-series conformal methods.

major comments (3)
  1. [four-term coverage-error decomposition] The four-term coverage-error decomposition (abstract and main theoretical result): the balancing argument for m⋆ ≍ T^{2β/(2β+1)} assumes the four terms (non-stationarity bias, quantile variance, mixing covariance, Bahadur remainder) are the only contributions of order T^{-β/(2β+1)}. An interaction of order (m/T)^β · α(m) between the local-stationarity drift and the α-mixing coefficients must be shown to be o(T^{-β/(2β+1)}) under the chosen m⋆; otherwise the claimed rate is not guaranteed. The proof sketch should explicitly bound or absorb this cross term.
  2. [Le Cam two-point construction] Le Cam two-point construction (minimax lower bound): both hypotheses in the construction must lie inside the same α-mixing class (with the same mixing rate) as the upper-bound model; otherwise the lower bound applies to a strictly larger function class than the one for which the upper bound is proved, weakening the minimax statement.
  3. [appendix on uniform concentration] Uniform concentration condition for the oracle inequality (appendix): the condition is invoked to justify Winkler cross-validation as an adaptive selector, but it must be verified to hold uniformly over the rolling-origin windows of length m under the joint Hölder-β and α-mixing assumptions; a counter-example or explicit rate would clarify whether the oracle inequality is sharp.
minor comments (2)
  1. [empirical validation] The cross-frequency log-log regression reports slope 0.614 with 95% CI [0.424, 0.805] claimed to be consistent with the theoretical 2/3; the regression specification (which frequencies, how many series per frequency, fixed effects) should be stated explicitly so readers can assess the power of the consistency check.
  2. [Bahadur representation] Notation for the physical-dependence coefficients (Wu 2005) is introduced alongside α-mixing; a short comparison table or remark clarifying when one framework yields strictly stronger or weaker rates than the other would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough and constructive review. The comments highlight important technical points that will improve the clarity and rigor of the theoretical results. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: The four-term coverage-error decomposition (abstract and main theoretical result): the balancing argument for m⋆ ≍ T^{2β/(2β+1)} assumes the four terms (non-stationarity bias, quantile variance, mixing covariance, Bahadur remainder) are the only contributions of order T^{-β/(2β+1)}. An interaction of order (m/T)^β · α(m) between the local-stationarity drift and the α-mixing coefficients must be shown to be o(T^{-β/(2β+1)}) under the chosen m⋆; otherwise the claimed rate is not guaranteed. The proof sketch should explicitly bound or absorb this cross term.

    Authors: We agree that an explicit bound on the cross term between the Hölder drift and the mixing coefficients is necessary to confirm that it does not affect the claimed rate. In the current proof of Theorem 1 we bound the four main terms separately and invoke the α-mixing decay to control dependence, but we did not isolate this particular product. Under the maintained assumption that α(k) decays at least polynomially with exponent greater than 1, the product (m/T)^β α(m) with m ≍ T^{2β/(2β+1)} is of strictly smaller order than T^{-β/(2β+1)}. We will insert a short lemma that isolates and bounds this interaction term, thereby completing the decomposition argument. revision: yes

  2. Referee: Le Cam two-point construction (minimax lower bound): both hypotheses in the construction must lie inside the same α-mixing class (with the same mixing rate) as the upper-bound model; otherwise the lower bound applies to a strictly larger function class than the one for which the upper bound is proved, weakening the minimax statement.

    Authors: Both hypotheses in the Le Cam construction are generated from the same base process that satisfies the α-mixing condition with identical decay rate; the local perturbation used to create the two points is supported on a vanishing fraction of the sample and does not change the mixing coefficients. We will add an explicit sentence in the proof of the lower bound (Section 4) stating that the mixing rate is held fixed across the two hypotheses, thereby ensuring the lower bound applies to exactly the same function class used for the upper bound. revision: yes

  3. Referee: Uniform concentration condition for the oracle inequality (appendix): the condition is invoked to justify Winkler cross-validation as an adaptive selector, but it must be verified to hold uniformly over the rolling-origin windows of length m under the joint Hölder-β and α-mixing assumptions; a counter-example or explicit rate would clarify whether the oracle inequality is sharp.

    Authors: Appendix C already derives the uniform concentration under the joint Hölder-β and α-mixing assumptions, but the uniformity is stated with respect to a fixed window length. To make the argument fully rigorous for the rolling-origin setting we will add an explicit rate (of order (log m / m)^{1/2} plus a mixing remainder) that holds uniformly over all windows of length m. This rate is sufficient for the oracle inequality to remain sharp; no counter-example arises under the maintained conditions. revision: partial

Circularity Check

0 steps flagged

No significant circularity: theoretical rates derived from external assumptions

full rationale

The paper establishes a four-term coverage-error decomposition under the stated Hölder-β local stationarity and α-mixing conditions, then balances the resulting terms to obtain the optimal m⋆ ≍ T^{2β/(2β+1)} and rate O(T^{-β/(2β+1)}). The Le Cam minimax construction is performed directly over the same model class. The Bahadur representation invokes the external Wu (2005) physical-dependence framework, and the oracle inequality for Winkler cross-validation is proved in the appendix under uniform concentration. No claimed result reduces by construction to a fitted parameter, self-citation, or renamed input; the empirical comparisons on real series are presented separately as validation. The derivation chain is therefore self-contained against the external mixing and stationarity benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions that are not derived inside the paper: Hölder-β local stationarity and α-mixing (or physical dependence). No free parameters are fitted to produce the rate; the window length m* is obtained by balancing bias and variance terms in the coverage-error decomposition. No new entities are postulated.

axioms (2)
  • domain assumption Hölder-β local stationarity
    Invoked to control the bias term in the four-term coverage-error decomposition and to obtain the optimal rate.
  • domain assumption α-mixing (or physical dependence of Wu 2005)
    Required for the Bahadur representation and to bound the stochastic terms in the coverage error.

pith-pipeline@v0.9.0 · 5578 in / 1587 out tokens · 61688 ms · 2026-05-12T01:03:45.986927+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

  1. [1]

    Doukhan P

    Fitting time series models to nonstationary processes.Annals of Statistics25: 1–37. Doukhan P. 1994.Mixing: Properties and Examples. Lecture Notes in Statistics

  2. [2]

    Annals of Mathematical Statistics42: 1957–1961

    A new proof of the Bahadur representation of quantiles and an application. Annals of Mathematical Statistics42: 1957–1961. Gibbs I, Cand` es E

  3. [3]

    Gy¨ orfi L, Kohler M, Krzy˙ zak A, Walk H

    Adaptive conformal inference under distribution shift.Advances in Neural Information Processing Systems34: 1660–1672. Gy¨ orfi L, Kohler M, Krzy˙ zak A, Walk H. 2002.A Distribution-Free Theory of Nonparametric Regression. Springer Series in Statistics. Springer: New York. 28 Kiefer J

  4. [4]

    Massart P

    The M4 competition: 100,000 time series and 61 forecasting methods.International Journal of Forecasting36: 54–74. Massart P. 2007.Concentration Inequalities and Model Selection. Lecture Notes in Mathematics

  5. [5]

    Covariance inequalities for strongly mixing processes.Annales de l’Institut Henri Poincar´ e29: 587–597. Rio E. 2017.Asymptotic Theory of Weakly Dependent Random Processes. Probability Theory and Stochastic Modelling

  6. [6]

    International Journal of Forecasting16: 437–450

    Out-of-sample tests of forecasting accuracy: an analysis and review. International Journal of Forecasting16: 437–450. Tsybakov AB. 2009.Introduction to Nonparametric Estimation. Springer Series in Statistics. Springer: New York. Vogt M

  7. [7]

    Vovk V, Gammerman A, Shafer G

    Nonparametric regression for locally stationary time series.Annals of Statistics 40: 2601–2633. Vovk V, Gammerman A, Shafer G. 2005.Algorithmic Learning in a Random World. Springer: New York. Wendler M

  8. [8]

    A decision theoretic approach to interval estimation.Journal of the American Statistical Association67: 187–191. Wu WB. 2005a. On the Bahadur representation of sample quantiles for dependent sequences. Annals of Statistics33: 1934–1957. Wu WB. 2005b. Nonlinear system theory: another look at dependence.Proceedings of the National Academy of Sciences102: 14...

  9. [9]

    For heavy-tailed oracle scores where |Yj| ≤ 1 fails, Rio’s Fuk–Nagaev inequality (Rio,

    for the variance bound and β > 2 for the polynomial tail in the large-deviation argument; both hold for ARMA, GARCH, and tvARCH processes (Vogt, 2012; Fryzlewicz & Subba Rao, 2011). For heavy-tailed oracle scores where |Yj| ≤ 1 fails, Rio’s Fuk–Nagaev inequality (Rio,

  10. [10]

    Under the physical-dependence stability condition of Wu (2005a), the near-stationarity restriction m = o(T ) can be relaxed using Zhou & Wu (2009), Theorem

    extends the argument under the DMR condition R 1 0 α−1(u)Q2(u) du <∞ . Under the physical-dependence stability condition of Wu (2005a), the near-stationarity restriction m = o(T ) can be relaxed using Zhou & Wu (2009), Theorem

  11. [11]

    ChoosingAso thatC 1A2 >2 givesK 1−C1A2 T →0, establishing (14)

    Substituting, P sup m∈MT |WT (m)− R T (m)|> x T Fcal ≤2K 1−C1A2 T . ChoosingAso thatC 1A2 >2 givesK 1−C1A2 T →0, establishing (14). Step 4: Rate relative to R⋆ T.By Assumption 6, nval ≥cT , so p (logK T )/nval ≤C p (logK T )/T . By Assumption 7, logK T = o(T 1/(2β+1)), equivalently p (logK T )/T = o(T −β/(2β+1)), giv- ing (15). Remark 12(Sharpness).The ra...

  12. [12]

    At β = 1 the moment condition p > 6 34 is satisfied by stationary GARCH(1,1) models under standard parameter restrictions (Carrasco & Chen, 2002). 35