pith. machine review for the scientific record. sign in

arxiv: 2604.12082 · v2 · submitted 2026-04-13 · 💱 q-fin.TR · cs.CE· econ.GN· q-fin.EC

Recognition: unknown

When Forecast Accuracy Fails: Rank Correlation and Decision Quality in Multi-Market Battery Storage Optimization

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:47 UTC · model grok-4.3

classification 💱 q-fin.TR cs.CEecon.GNq-fin.EC
keywords battery energy storageprice forecastingKendall taurank correlationelectricity marketsdispatch optimizationforecast evaluationmulti-market trading
0
0 comments X

The pith

Rank correlation in price forecasts, not mean absolute error, is the main driver of battery storage dispatch revenue in multi-market trading.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Battery energy storage systems trade simultaneously across frequency reserve, day-ahead, and intraday markets using price forecasts to set charge and discharge schedules. The paper tests the standard assumption that lower forecast errors produce better trading outcomes. It finds instead that rank correlation, measured by Kendall tau, strongly determines the fraction of perfect-foresight revenue captured on intraday markets. Forecasts whose prices preserve the correct order above an empirical tau threshold of roughly 0.85 to 0.95 recover 97 to 100 percent of maximum intraday value, while forecasts with near-zero tau recover only about one-third. This pattern stays stable across volatility levels, and reserve capacity payments far exceed intraday trading gains.

Core claim

Using a hierarchical three-layer optimization model on real 2020-2025 data from German and Swiss markets, the study shows that Kendall tau above approximately 0.85-0.95 allows forecasts to capture 97-100 percent of perfect-foresight intraday dispatch revenue, whereas persistence forecasts with near-zero tau capture only 33 percent. FCR capacity revenue exceeds XBID revenue by 6.5 times per MW, and Swiss hydrological surplus anomalies are significantly linked to balancing market revenue.

What carries the argument

Kendall tau rank correlation applied to price forecast series, which captures the ordinal ordering required by the battery dispatch optimization rather than the magnitude of price errors.

Load-bearing premise

The three-layer hierarchical optimization system accurately represents real dispatch decisions and revenues without significant unmodeled constraints, costs, or operational limits that would change the link between forecast rank correlation and achieved value.

What would settle it

Real battery operation data in which a forecast with Kendall tau above 0.9 captures substantially less than 90 percent of perfect-foresight revenue, or a near-zero-tau forecast achieves comparable dispatch performance.

Figures

Figures reproduced from arXiv: 2604.12082 by Alessandro Falezza.

Figure 1
Figure 1. Figure 1: Empirical relationship between Kendall τ and Value Capture Ratio (VCR), estimated from 24,000 DP simulations with synthetic forecasts of controlled rank correla￾tion. The tau-sufficiency region [0.85, 0.95] (shaded) marks the onset of VCR saturation near 100%. Points indicate the τK and VCR values of the five benchmark forecasts [PITH_FULL_IMAGE:figures/full_fig_p016_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Value Capture Ratio (VCR) for each system configuration in the ablation study [PITH_FULL_IMAGE:figures/full_fig_p021_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Relationship between Swiss hydrological anomaly and SRL [PITH_FULL_IMAGE:figures/full_fig_p024_3.png] view at source ↗
read the original abstract

Battery energy storage systems (BESS) participating in multi-market electricity trading require price forecasts to optimize dispatch decisions. A widely held assumption is that forecast accuracy, measured by standard metrics such as mean absolute error (MAE), drives trading performance. We challenge this assumption using a hierarchical three-layer optimization system trading simultaneously on frequency containment reserve (FCR), automatic frequency restoration reserve (aFRR), day-ahead, and continuous intraday (XBID) markets in Germany and Switzerland over 2020-2025, with real market data from Regelleistung.net and Swissgrid. We find that rank correlation (Kendall tau), rather than MAE, is the primary predictor of intraday dispatch value: forecasts above an empirical threshold of tau approximately 0.85-0.95 capture up to 97-100% of perfect-foresight revenue, while persistence forecasts with near-zero tau capture only 33%. This threshold is stable across market regimes and volatility levels, and reflects the ordinal structure of the dispatch problem. Furthermore, under reserve market constraints, FCR capacity revenue exceeds XBID by 6.5x per MW, making capacity allocation -- not forecast accuracy -- the primary driver of total revenue. In the Swiss market, hydrological surplus anomalies are significantly associated with balancing market revenue (p = 0.0005), a mechanism absent from existing German-focused literature. These findings reframe forecast evaluation for BESS operators: the relevant question is not what the MAE is, but whether the forecast achieves tau-sufficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that in multi-market BESS optimization across FCR, aFRR, day-ahead, and XBID markets using 2020-2025 real data from Germany and Switzerland, Kendall tau rank correlation outperforms MAE as a predictor of intraday dispatch value. Forecasts with tau above an empirical 0.85-0.95 threshold recover 97-100% of perfect-foresight revenue while near-zero-tau persistence forecasts recover only 33%; the threshold is stable across regimes. FCR capacity revenue dominates XBID by 6.5x per MW, and Swiss hydrological anomalies correlate with balancing revenue (p=0.0005). The work reframes forecast evaluation around 'tau-sufficiency' rather than accuracy.

Significance. If the central result holds, the paper would meaningfully shift BESS forecast evaluation and dispatch practice away from MAE toward rank-order metrics, with direct revenue implications for operators. The five-year real-market dataset and hierarchical optimization provide concrete empirical grounding, and the FCR dominance plus Swiss hydro finding add novel cross-market insights absent from much German-centric literature. These elements could influence both trading systems and forecast product design if the optimization faithfully captures real constraints.

major comments (3)
  1. [Methods / optimization model description] The hierarchical three-layer optimization is described at a high level in the methods without explicit objective functions or constraint formulations showing how absolute price levels versus rank order enter capacity allocation, aFRR/FCR activation, and XBID bidding. This is load-bearing for the claim that revenue depends primarily on tau rather than MAE, as any absolute-level dependence (e.g., via minimum bid sizes or efficiency losses) would undermine the reported dominance of rank correlation.
  2. [Results / revenue recovery analysis] The empirical tau threshold (0.85-0.95) and associated 97-100% revenue recovery figures are presented without bootstrap intervals, cross-validation across volatility regimes, or explicit sensitivity analysis perturbing forecast magnitudes while holding rank order fixed. This weakens the stability claim and the assertion that the result is not an artifact of the specific model parameterization.
  3. [Data and experimental setup] The paper reports real 2020-2025 data from Regelleistung.net and Swissgrid but provides no details on data exclusion rules, handling of missing periods, or statistical controls for regime shifts. Without these, it is difficult to assess whether the tau-MAE comparison and hydro anomaly result (p=0.0005) are robust to post-hoc selection.
minor comments (2)
  1. [Abstract] The abstract states the tau threshold range but does not specify how it was determined (e.g., via grid search or optimization) or whether it varies by market.
  2. [Introduction / notation] Notation for the three-layer system and market abbreviations (FCR, aFRR, XBID) should be defined on first use with a brief table for readers unfamiliar with European reserve markets.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed report, which highlights important areas for clarification and strengthening. We address each major comment below and will incorporate revisions to improve transparency and robustness. The core empirical findings on the dominance of Kendall tau over MAE remain supported by the multi-year dataset and hierarchical optimization, but we agree that additional details will enhance the manuscript.

read point-by-point responses
  1. Referee: The hierarchical three-layer optimization is described at a high level in the methods without explicit objective functions or constraint formulations showing how absolute price levels versus rank order enter capacity allocation, aFRR/FCR activation, and XBID bidding. This is load-bearing for the claim that revenue depends primarily on tau rather than MAE, as any absolute-level dependence (e.g., via minimum bid sizes or efficiency losses) would undermine the reported dominance of rank correlation.

    Authors: We acknowledge that the optimization model was summarized at a high level in the original submission. In the revised manuscript, we will add explicit mathematical formulations of the objective functions and key constraints for each of the three layers. These will demonstrate how the model primarily relies on price rank order for intraday dispatch decisions (consistent with the tau-sufficiency result) while incorporating absolute price levels for constraints such as minimum bid sizes, efficiency losses, and capacity allocation. We maintain that the empirical evidence from the 2020-2025 data supports the primary role of rank correlation, but agree that full transparency on the model equations is essential. revision: yes

  2. Referee: The empirical tau threshold (0.85-0.95) and associated 97-100% revenue recovery figures are presented without bootstrap intervals, cross-validation across volatility regimes, or explicit sensitivity analysis perturbing forecast magnitudes while holding rank order fixed. This weakens the stability claim and the assertion that the result is not an artifact of the specific model parameterization.

    Authors: The referee is correct that bootstrap intervals, cross-validation, and targeted sensitivity analyses were not included in the original version. We will revise the results section to add bootstrap confidence intervals around the revenue recovery percentages, perform cross-validation across identified volatility regimes (e.g., pre- and post-energy crisis periods), and include an explicit sensitivity analysis that perturbs absolute forecast magnitudes while preserving rank order. These additions will directly address concerns about stability and parameterization artifacts while preserving the reported threshold findings. revision: yes

  3. Referee: The paper reports real 2020-2025 data from Regelleistung.net and Swissgrid but provides no details on data exclusion rules, handling of missing periods, or statistical controls for regime shifts. Without these, it is difficult to assess whether the tau-MAE comparison and hydro anomaly result (p=0.0005) are robust to post-hoc selection.

    Authors: We agree that a dedicated description of data processing is necessary for assessing robustness. In the revised manuscript, we will add a subsection on data handling that specifies exclusion rules (e.g., for anomalous or incomplete market sessions), methods for handling missing periods (e.g., interpolation or exclusion criteria), and statistical controls for regime shifts, including tests around the 2022 energy crisis. For the Swiss hydro anomaly regression, we will clarify the controls and robustness checks used to obtain p=0.0005. These details will allow readers to evaluate the tau-MAE comparison and related findings more rigorously. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results from data-driven optimization on real market data.

full rationale

The paper reports an empirical study: a hierarchical three-layer optimization is run on actual 2020-2025 German/Swiss market data, perfect-foresight revenue is computed as benchmark, and correlations are measured between forecast quality metrics (MAE, Kendall tau) and realized dispatch value. The claimed tau threshold (approximately 0.85-0.95) and the 97-100% revenue capture are presented as observed outcomes of these simulations rather than quantities defined or forced by the model's own equations. No load-bearing self-citations, self-definitional steps, fitted inputs renamed as predictions, or ansatz smuggling appear in the abstract or described methodology. The derivation chain is therefore self-contained and externally falsifiable against the real data and perfect-foresight benchmark.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the validity of the optimization model linking forecasts to revenue and on the representativeness of the 2020-2025 data period for deriving the tau threshold and market associations.

free parameters (1)
  • Kendall tau threshold = 0.85-0.95
    Empirically identified from optimization runs as the level capturing 97-100% of perfect-foresight revenue
axioms (2)
  • domain assumption The hierarchical three-layer optimization accurately represents optimal real-world dispatch decisions across the studied markets
    Invoked to connect forecast properties to achieved revenue
  • domain assumption The 2020-2025 market data period is representative and free of unmodeled structural breaks
    Used to establish the stable threshold and Swiss hydrological association

pith-pipeline@v0.9.0 · 5583 in / 1465 out tokens · 65186 ms · 2026-05-10T14:47:00.454640+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

9 extracted references · 8 canonical work pages · 1 internal anchor

  1. [1]

    Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

    Soft actor-critic: Off-policy max- imum entropy deep reinforcement learning with a stochastic actor, in: Proceedings of 30 the 35th International Conference on Machine Learning (ICML), pp. 1861–1870. URL: https://arxiv.org/abs/1801.01290. Hornek, T., et al.,

  2. [2]

    perfect foresight strategies

    The value of battery energy storage in the continuous intraday market: Forecast vs. perfect foresight strategies. arXiv:2501.07121 . Morales, J.M., Conejo, A.J., Madsen, H., Pinson, P., Zugno, M.,

  3. [3]

    Springer, New York

    Integrat- ing Renewables in Electricity Markets: Operational Problems. Springer, New York. doi:10.1007/978-1-4614-9411-9. Rabiner, L.R.,

  4. [4]

    Proceedings of the IEEE 77, 257–286

    A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77, 257–286. doi:10.1109/5.18626. Regelleistung.net,

  5. [5]

    URL:https://www.regelleistung-online.de/ german-energy-storage-revenue-index/bess-revenue-index-1h/

    BESS Revenue Index 1h — German Energy Stor- age Revenue Index. URL:https://www.regelleistung-online.de/ german-energy-storage-revenue-index/bess-revenue-index-1h/. accessed: 2024-12-22. Schaurecker, D., Wozabal, D., Löhndorf, N., Staake, T.,

  6. [6]

    arXiv:2504.06932

    Maximizing battery storage profits via high-frequency intraday trading. arXiv:2504.06932 . Seifert, P., Kraft, E., Bakker, S., Fleten, S.,

  7. [7]

    arXiv:2406.08390

    Coordinated trading strategies for battery storage in reserve and spot markets. arXiv:2406.08390 . Smets, R., Tanneau, M., Toubeau, J.F., Bruninx, K., Delarue, E., Van Henten- ryck, P.,

  8. [8]

    arXiv:2511.13616 URL:https://arxiv.org/abs/ 2511.13616

    Statistical and economic evaluation of forecasts in electricity markets: Beyond RMSE and MAE. arXiv:2511.13616 URL:https://arxiv.org/abs/ 2511.13616. 31 Vanderschueren, T., Verdonck, T., Baesens, B., Verbeke, W.,

  9. [9]

    Information Sciences 594, 400–415

    Predict-then-optimize or predict-and-optimize? An empirical evaluation of cost-sensitive learning strategies. Information Sciences 594, 400–415. doi:10.1016/j.ins.2022.02.021. 32