pith. machine review for the scientific record. sign in

arxiv: 2604.14739 · v1 · submitted 2026-04-16 · 💻 cs.LG

Recognition: unknown

Assessing the Performance-Efficiency Trade-off of Foundation Models in Probabilistic Electricity Price Forecasting

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:13 UTC · model grok-4.3

classification 💻 cs.LG
keywords electricity price forecastingprobabilistic forecastingtime series foundation modelsNHITSCRPSperformance trade-offday-ahead marketsquantile regression
0
0 comments X

The pith

While time series foundation models generally produce more accurate probabilistic electricity price forecasts than task-specific models, the latter can perform equally well or better when properly configured.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares Time Series Foundation Models against task-specific deep learning approaches for generating probabilistic day-ahead electricity price forecasts in European markets. It evaluates Moirai and ChronosX against an NHITS model paired with quantile regression averaging and a conditional normalizing flow forecaster. TSFMs deliver stronger results on standard probabilistic metrics such as CRPS and Energy Score along with better interval calibration across varying market conditions. At the same time, the NHITS plus QRA combination reaches nearly identical performance levels and surpasses the foundation models in certain setups that include extra market features or few-shot adaptation from other zones. The central takeaway is that any accuracy edge from foundation models must be weighed against their greater computational demands.

Core claim

Across multiple European bidding zones the foundation models Moirai and ChronosX achieve lower CRPS, lower Energy Scores, and better predictive interval calibration than deep learning models trained from scratch for probabilistic electricity price forecasting. A carefully configured NHITS backbone combined with quantile regression averaging reaches performance levels very close to the foundation models and exceeds them when supplied with additional informative feature groups or when adapted through few-shot learning drawn from other markets. The work therefore concludes that the expressive capacity of foundation models is real yet conventional models stay highly competitive once efficiency,

What carries the argument

Head-to-head comparison of probabilistic forecasting models (Moirai, ChronosX, NHITS+QRA, and normalizing flows) on European day-ahead price data using CRPS, Energy Score, and calibration metrics to quantify the accuracy-efficiency trade-off.

If this is right

  • TSFMs become the default choice only when the small accuracy gain justifies their higher compute cost.
  • NHITS combined with QRA offers a practical alternative that matches or exceeds foundation models under targeted enhancements.
  • Few-shot adaptation and extra informative features can close or reverse the performance gap for task-specific models.
  • Model selection for probabilistic electricity price forecasting requires explicit accounting of computational expense against marginal accuracy gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Practitioners facing similar volatility in energy markets should benchmark both foundation and task-specific models on their own data before committing resources.
  • The results point toward possible hybrid pipelines that start with task-specific models and selectively incorporate foundation-model components only where needed.
  • Future work could test whether the same narrow gap appears in longer-horizon or intraday probabilistic forecasting tasks.

Load-bearing premise

The performance gaps and efficiency trade-offs measured in the selected European bidding zones and model configurations will generalize to other electricity markets, different time periods, and real deployment settings.

What would settle it

A replication study on an unseen set of electricity markets or a later time window in which task-specific models remain substantially worse than TSFMs even after feature additions and few-shot adaptation would falsify the claim that conventional models remain highly competitive.

Figures

Figures reproduced from arXiv: 2604.14739 by Benjamin Sch\"afer, Hadeer El Ashhab, Jan Niklas Lettner, Veit Hagenmeyer.

Figure 1
Figure 1. Figure 1: Overview of this work. The left column lists the evaluated models: a simple baseline (M0), a conditioned normalizing [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Average electricity spot market prices across Euro [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: NHITS + QRA, cross-border, few-shot on certain [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
read the original abstract

Large-scale renewable energy deployment introduces pronounced volatility into the electricity system, turning grid operation into a complex stochastic optimization problem. Accurate electricity price forecasting (EPF) is essential not only to support operational decisions, such as optimal bidding strategies and balancing power preparation, but also to reduce economic risk and improve market efficiency. Probabilistic forecasts are particularly valuable because they quantify uncertainty stemming from renewable intermittency, market coupling, and regulatory changes, enabling market participants to make informed decisions that minimize losses and optimize expected revenues. However, it remains an open question which models to employ to produce accurate forecasts. Should these be task-specific machine learning (ML) models or Time Series Foundation Models (TSFMs)? In this work, we compare four models for day-ahead probabilistic EPF (PEPF) in European bidding zones: a deterministic NHITS backbone with Quantile-Regression Averaging (NHITS+QRA) and a conditional Normalizing-Flow forecaster (NF) are compared with two TSFMs, namely Moirai and ChronosX. On the one hand, we find that TSFMs outperform task-specific deep learning models trained from scratch in terms of CRPS, Energy Score, and predictive interval calibration across market conditions. On the other hand, we find that well-configured task-specific models, particularly NHITS combined with QRA, achieve performance very close to TSFMs, and in some scenarios, such as when supplied with additional informative feature groups or adapted via few-shot learning from other European markets, they can even surpass TSFMs. Overall, our findings show that while TSFMs offer expressive modeling capabilities, conventional models remain highly competitive, emphasizing the need to weigh computational expense against marginal performance improvements in PEPF.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript empirically benchmarks two task-specific models (NHITS combined with Quantile Regression Averaging and a conditional Normalizing Flow forecaster) against two Time Series Foundation Models (Moirai and ChronosX) for day-ahead probabilistic electricity price forecasting across European bidding zones. It claims that TSFMs generally deliver superior CRPS, Energy Score, and predictive interval calibration, yet well-tuned task-specific models achieve near-parity and can exceed TSFMs when augmented with additional features or adapted via few-shot learning from other markets, highlighting the need to balance performance gains against computational costs.

Significance. If the reported orderings hold under rigorous controls, the work supplies actionable guidance for energy-market forecasters on when the expressive power of foundation models justifies their overhead versus the competitiveness of specialized architectures. The balanced, non-universal conclusion is a strength, as is the focus on probabilistic metrics relevant to stochastic optimization in grids with high renewable penetration.

major comments (2)
  1. [Abstract and §1] Abstract and §1 (Introduction): The title foregrounds a 'Performance-Efficiency Trade-off', yet neither the abstract nor the stated findings quantify efficiency (training/inference time, memory footprint, or FLOPs). The discussion of 'computational expense' therefore remains qualitative; if §5 or §6 contains such measurements, they should be elevated to the results and tied directly to the performance deltas.
  2. [§3 or §4] §3 (Experimental Setup) or §4 (Results): The abstract asserts outperformance 'across market conditions' and 'in some scenarios' without reference to the exact bidding zones, data provenance (e.g., ENTSO-E identifiers), train/validation/test temporal splits, hyperparameter search protocol, or statistical significance tests on the CRPS/Energy Score differences. These omissions make it impossible to assess whether the reported near-parity or occasional superiority of NHITS+QRA is robust or sensitive to selection effects.
minor comments (2)
  1. [Throughout] Ensure that all acronyms (PEPF, CRPS, QRA, TSFM) are defined at first use and that figure captions explicitly state the number of runs or seeds underlying any averaged metrics.
  2. [Results tables] If tables compare multiple models and markets, add a column or footnote indicating whether differences are statistically significant (e.g., via Diebold-Mariano or paired t-tests) to strengthen the central comparative claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and the detailed, constructive comments. These suggestions will help strengthen the clarity, reproducibility, and balance of the manuscript. We address each major comment below and will revise accordingly.

read point-by-point responses
  1. Referee: [Abstract and §1] Abstract and §1 (Introduction): The title foregrounds a 'Performance-Efficiency Trade-off', yet neither the abstract nor the stated findings quantify efficiency (training/inference time, memory footprint, or FLOPs). The discussion of 'computational expense' therefore remains qualitative; if §5 or §6 contains such measurements, they should be elevated to the results and tied directly to the performance deltas.

    Authors: We agree that the current treatment of efficiency is largely qualitative and that this weakens the title's emphasis on the trade-off. Although later sections note the higher computational demands of TSFMs, no explicit measurements appear in the results. In the revision we will add quantitative efficiency metrics (training time, inference latency per forecast, peak GPU memory, and estimated FLOPs) for all four models, present them in a new table or figure in §4, and directly link the deltas to the CRPS/Energy Score improvements. The abstract will be updated to reference these findings. revision: yes

  2. Referee: [§3 or §4] §3 (Experimental Setup) or §4 (Results): The abstract asserts outperformance 'across market conditions' and 'in some scenarios' without reference to the exact bidding zones, data provenance (e.g., ENTSO-E identifiers), train/validation/test temporal splits, hyperparameter search protocol, or statistical significance tests on the CRPS/Energy Score differences. These omissions make it impossible to assess whether the reported near-parity or occasional superiority of NHITS+QRA is robust or sensitive to selection effects.

    Authors: We appreciate the call for greater transparency. The experimental setup in §3 already specifies the European bidding zones, ENTSO-E data sources, temporal splits, and hyperparameter protocol, but these details are not summarized in the abstract or §1. We will (i) add a concise list of the exact bidding zones and ENTSO-E identifiers to the abstract and introduction, (ii) explicitly state the train/validation/test periods and hyperparameter search method in §1, and (iii) include statistical significance tests (Diebold-Mariano and Wilcoxon signed-rank) on the CRPS and Energy Score differences in the revised §4 to demonstrate robustness. These changes will allow readers to evaluate the near-parity claims directly. revision: yes

Circularity Check

0 steps flagged

Pure empirical benchmarking study with no circular derivations

full rationale

This is a direct empirical comparison of four models (NHITS+QRA, NF, Moirai, ChronosX) on day-ahead probabilistic electricity price forecasting across European bidding zones, using standard metrics (CRPS, Energy Score, calibration) computed from held-out test data. No mathematical derivations, uniqueness theorems, ansatzes, or predictions are claimed; all results are obtained by training the models on the described datasets and evaluating them. Any citations are for model architectures or prior benchmarks and are not load-bearing for the ordering of performance results. The central claims reduce only to the experimental outcomes, with no reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical model comparison; the abstract introduces no new mathematical axioms, free parameters, or invented entities.

pith-pipeline@v0.9.0 · 5630 in / 1069 out tokens · 39679 ms · 2026-05-10T12:13:18.591753+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 22 canonical work pages · 1 internal anchor

  1. [1]

    Stanford CRFM

    2021. Stanford CRFM. https://crfm.stanford.edu/report.html

  2. [2]

    Foundation Models Defining a New Era in Vision: A Survey and Outlook

    2025. Foundation Models Defining a New Era in Vision: A Survey and Outlook. https://www.computer.org/csdl/journal/tp/2025/04/10834497/23mYUeDuDja

  3. [4]

    Kasra Aliyon and Jouni Ritvanen. 2024. Deep Learning-Based Electricity Price Forecasting: Findings on Price Predictability and European Electricity Markets. Energy308 (Nov. 2024), 132877. doi:10.1016/j.energy.2024.132877

  4. [5]

    Andrade, Jorge Filipe, Marisa Reis, and Ricardo J

    José R. Andrade, Jorge Filipe, Marisa Reis, and Ricardo J. Bessa. 2017. Probabilistic Price Forecasting for Day-Ahead and Intraday Markets: Beyond the Statistical Model.Sustainability9, 11 (Oct. 2017). doi:10.3390/su9111990

  5. [6]

    Maddix, Pablo Guer- ron, Tony Hu, Junming Yin, Nick Erickson, Prateek Mutalik Desai, Hao Wang, Huzefa Rangwala, George Karypis, Yuyang Wang, and Michael Bohlke-Schneider

    Abdul Fatir Ansari, Oleksandr Shchur, Jaris Küken, Andreas Auer, Boran Han, Pedro Mercado, Syama Sundar Rangapuram, Huibin Shen, Lorenzo Stella, Xiyuan Zhang, Mononito Goswami, Shubham Kapoor, Danielle C. Maddix, Pablo Guer- ron, Tony Hu, Junming Yin, Nick Erickson, Prateek Mutalik Desai, Hao Wang, Huzefa Rangwala, George Karypis, Yuyang Wang, and Michael...

  6. [7]

    Chronos-2: From Univariate to Universal Forecasting

    Chronos-2: From Univariate to Universal Forecasting. arXiv:2510.15821 [cs] doi:10.48550/arXiv.2510.15821

  7. [8]

    Maddix, Michael Bohlke-Schneider, Bernie Wang, and Syama Sundar Rangapuram

    Sebastian Pineda Arango, Pedro Mercado, Shubham Kapoor, Abdul Fatir Ansari, Lorenzo Stella, Huibin Shen, Hugo Henri Joseph Senetaire, Ali Caner Turkmen, Oleksandr Shchur, Danielle C. Maddix, Michael Bohlke-Schneider, Bernie Wang, and Syama Sundar Rangapuram. 2025. ChronosX: Adapting Pretrained Time Series Models with Exogenous Variables. InProceedings of ...

  8. [9]

    Prof Dr Bruno Burger. [n. d.]. Energy-Charts. https://www.energy- charts.info/index.html

  9. [10]

    Bytez.com, Xu Liu, Juncheng Liu, Gerald Woo, Taha Aksu, Yuxuan Liang, Roger Zimmermann, Chenghao Liu, Junnan Li, Silvio Savarese, Caiming Xiong, and Doyen Sahoo. 2025. Moirai-MoE: Empowering Time Series Foundation Models Wit... https://bytez.com/docs/icml/45201/paper. Assessing the Performance–Efficiency Trade-off of Foundation Models in Probabilistic Ele...

  10. [11]

    Olivares, Boris N

    Cristian Challu, Kin G. Olivares, Boris N. Oreshkin, Federico Garza Ramirez, Max Mergenthaler-Canseco, and Artur Dubrawski. 2023. NHITS: Neural Hierarchical Interpolation for Time Series Forecasting. InProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innova- tive Applications of Artificial Intelli...

  11. [12]

    Youngseog Chung, Willie Neiswanger, Ian Char, and Jeff Schneider. 2021. Be- yond Pinball Loss: Quantile Methods for Calibrated Uncertainty Quantification. InAdvances in Neural Information Processing Systems(2021), Vol. 34. Curran Associates, Inc., 10971–10984. https://proceedings.neurips.cc/paper_files/paper/ 2021/hash/5b168fdba5ee5ea262cc2d4c0b457697-Abs...

  12. [14]

    Eike Cramer, Dirk Witthaut, Alexander Mitsos, and Manuel Dahmen. 2023. Multi- variate Probabilistic Forecasting of Intraday Electricity Prices Using Normalizing Flows.Applied Energy346 (Sept. 2023), 121370. doi:10.1016/j.apenergy.2023. 121370

  13. [15]

    Diebold and Roberto S

    Francis X. Diebold and Roberto S. Mariano. 1995. Comparing Predictive Accuracy. 13, 3 (1995), 253–263. jstor:1392185 doi:10.2307/1392185

  14. [16]

    Energy-Charts. [n. d.]. Energy-Charts API. https://api.energy-charts.info/

  15. [17]

    Average EU Electricity Prices for 2024

    Energy Charts 2024. Average EU Electricity Prices for 2024. https://energy- charts.info/charts/price_average_map/chart.htm?l=de&c=DE&year=2024& interval=year

  16. [18]

    ENTSO-E. [n. d.]. Bidding Zone Review. https://www.entsoe.eu/network_codes/bzr/

  17. [19]

    ENTSO-E. [n. d.]. Mission Statement. https://www.entsoe.eu/about/inside- entsoe/mission-statement/

  18. [20]

    European Commission. [n. d.]. Electricity Market Design. https://energy.ec.europa.eu/topics/markets-and-consumers/electricity-market- design_en

  19. [21]

    Fraunhofer ISE. 2025. 43. Energy-Charts Talks 07.01.2025: Importe, Exporte, Strompreise 2024

  20. [22]

    Tilmann Gneiting and Adrian E Raftery. 2007. Strictly Proper Scoring Rules, Prediction, and Estimation. 102, 477 (2007), 359–378. doi:10.1198/ 016214506000001437

  21. [23]

    Andreas Goldthau and Simone Tagliapietra. 2022. Energy Crisis: Five Questions That Must Be Answered in 2023.Nature612, 7941 (Dec. 2022), 627–630. doi:10. 1038/d41586-022-04467-w

  22. [24]

    Salih Gunduz, Umut Ugurlu, and Ilkay Oksuz. 2023. Transfer Learning for Electricity Price Forecasting.Sustainable Energy, Grids and Networks34 (June 2023), 100996. doi:10.1016/j.segan.2023.100996

  23. [25]

    David Harvey, Stephen Leybourne, and Paul Newbold. 1997. Testing the Equality of Prediction Mean Squared Errors.International Journal of Forecasting13, 2 (June 1997), 281–291. doi:10.1016/S0169-2070(96)00719-4

  24. [26]

    He Jiang, Yawei Dong, Yao Dong, and Jianzhou Wang. 2025. Probabilistic Electric- ity Price Forecasting by Integrating Interpretable Model.Technological Forecasting and Social Change210 (Jan. 2025), 123846. doi:10.1016/j.techfore.2024.123846

  25. [27]

    Chenghao Liu, Taha Aksu, Juncheng Liu, Xu Liu, Hanshu Yan, Quang Pham, Silvio Savarese, Doyen Sahoo, Caiming Xiong, and Junnan Li. 2025. Moirai 2.0: When Less Is More for Time Series Forecasting. arXiv:2511.11698 [cs] doi:10.48550/arXiv.2511.11698

  26. [28]

    2019.A Simple Baseline for Bayesian Uncertainty in Deep Learning

    Wesley Maddox, Timur Garipov, Pavel Izmailov, Dmitry Vetrov, and Andrew Gor- don Wilson. 2019.A Simple Baseline for Bayesian Uncertainty in Deep Learning. arXiv:1902.02476 [cs] doi:10.48550/arXiv.1902.02476

  27. [29]

    Grzegorz Marcjasz, Michał Narajewski, Rafał Weron, and Florian Ziel. 2023. Dis- tributional Neural Networks for Electricity Price Forecasting.Energy Economics 125 (Sept. 2023), 106843. doi:10.1016/j.eneco.2023.106843

  28. [30]

    United Nations. [n. d.]. Renewable Energy – Powering a Safer Future. https://www.un.org/en/climatechange/raising-ambition/renewable-energy

  29. [31]

    A simple, positive semi-definite, heteroskedasticity and auto- correlation consistent covariance matrix

    Whitney K. Newey and Kenneth D. West. 1987. A Simple, Positive Semi-Definite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix. 55, 3 (1987), 703–708. jstor:1913610 doi:10.2307/1913610

  30. [32]

    Jakub Nowotarski and Rafał Weron. 2015. Computing Electricity Spot Price Prediction Intervals Using Quantile Regression and Forecast Averaging.Compu- tational Statistics30, 3 (2015), 791–803

  31. [33]

    Nowtricity. [n. d.]. CO2 Emissions per kWh in Germany - Nowtricity. https://www.nowtricity.com/country/germany/

  32. [34]

    Yuki Osone and Daisuke Kodaira. 2025. Quantile Regression for Probabilistic Electricity Price Forecasting in the U.K. Electricity Market.IEEE Access13 (2025), 10083–10093. doi:10.1109/ACCESS.2025.3528450

  33. [35]

    Ozili and Ercan Ozen

    Peterson K. Ozili and Ercan Ozen. 2023. Global Energy Crisis. InThe Impact of Climate Change and Sustainability Standards on the Insurance Market. 439–454. doi:10.1002/9781394167944.ch29

  34. [36]

    2023.Foundation Models for Natural Lan- guage Processing

    Gerhard Paass and Sven Giesselbach. 2023.Foundation Models for Natural Lan- guage Processing. Springer. doi:10.1007/978-3-031-23190-2

  35. [37]

    George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. [n. d.]. Normalizing Flows for Probabilistic Mod- eling and Inference. ([n. d.])

  36. [38]

    George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. 2021. Normalizing Flows for Probabilistic Model- ing and Inference.Journal of Machine Learning Research22, 57 (2021), 1–64

  37. [39]

    Sebastian Pütz, Hadeer El Ashhab, Matthias Hertel, Ralf Mikut, Markus Götz, Veit Hagenmeyer, and Benjamin Schäfer. 2024. Feasibility of Forecasting Highly Resolved Power Grid Frequency Utilizing Temporal Fusion Transformers. In Proceedings of the 15th ACM International Conference on Future and Sustainable Energy Systems (E-Energy ’24). Association for Com...

  38. [40]

    Nygård, Leonardo Ry- din Gorjão, and Dirk Witthaut

    Julius Trebbien, Sebastian Pütz, Benjamin Schäfer, Heidi S. Nygård, Leonardo Ry- din Gorjão, and Dirk Witthaut. 2023. Probabilistic Forecasting of Day-Ahead Electricity Prices and Their Volatility with LSTMs. arXiv:2310.03339 [cs] doi:10.48550/arXiv.2310.03339

  39. [41]

    Julius Trebbien, Leonardo Rydin Gorjão, Aaron Praktiknjo, Benjamin Schäfer, and Dirk Witthaut. 2023. Understanding Electricity Prices beyond the Merit Order Principle Using Explainable AI.Energy and AI13 (July 2023), 100250. doi:10.1016/j.egyai.2023.100250

  40. [42]

    Julius Trebbien, Anton Tausendfreund, Leonardo Rydin Gorjão, and Dirk Wit- thaut. 2024. Patterns and Correlations in European Electricity Prices.Chaos34, 7 (July 2024), 073108. doi:10.1063/5.0201734

  41. [43]

    Bartosz Uniejewski. 2025. Smoothing Quantile Regression Averaging: A New Approach to Probabilistic Forecasting of Electricity Prices.Journal of Commodity Markets39 (Sept. 2025), 100501. doi:10.1016/j.jcomm.2025.100501

  42. [44]

    Gerald Woo, Chenghao Liu, Akshat Kumar, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. 2024. Unified Training of Universal Time Series Forecasting Transformers. InProceedings of the 41st International Conference on Machine Learning (ICML’24, Vol. 235). JMLR.org, Vienna, Austria, 53140–53164

  43. [45]

    Florian Ziel and Rick Steinert. 2018. Probabilistic Mid- and Long-Term Electricity Price Forecasting.Renewable and Sustainable Energy Reviews94 (Oct. 2018), 251–266. doi:10.1016/j.rser.2018.05.038 A Data Details In the European wholesale electricity market a bidding zone is a geographically defined area for which a single market-clearing price is determin...