pith. machine review for the scientific record. sign in

arxiv: 2512.20761 · v3 · submitted 2025-12-23 · 💻 cs.LG · cs.AI

Recognition: no theorem link

TS-Arena -- A Live Forecast Pre-Registration Platform

Authors on Pith no claims yet

Pith reviewed 2026-05-16 19:56 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords time series forecastingpre-registrationbenchmarkinginformation leakagelive evaluationfoundation modelscontinuous assessment
0
0 comments X

The pith

TS-Arena requires forecasting models to submit predictions before future data exists, eliminating test-set leakage by design.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Evaluating time series models on historical data risks contamination through overlapping samples or correlated series. TS-Arena moves assessment to live future data streams, where models must pre-register predictions before the ground truth physically arrives. This pre-registration protocol makes leakage impossible. The platform uses a modular microservice architecture to harmonize data from multiple sources and run containerized model submissions on ongoing streams. One year of operation on energy time series shows established models build consistent longitudinal scores while new models can demonstrate immediate competitiveness.

Core claim

TS-Arena is a live forecasting platform that enforces a strict pre-registration protocol: models must submit predictions before the corresponding ground-truth data exists. It relies on a modular microservice architecture to structure data from diverse sources and orchestrate containerized submissions. Over one year of energy time series, established models accumulate robust scores while the continuous format lets newcomers compete right away.

What carries the argument

The strict forecasting pre-registration protocol on live data streams, which forces submissions prior to data availability and uses microservices to manage containerized runs.

If this is right

  • Test-set contamination becomes impossible because predictions precede data arrival.
  • Evaluation shifts from infrequent static competitions to continuous longitudinal tracking.
  • Models can be assessed for true generalization on data that did not exist during development.
  • New entrants can demonstrate performance without waiting for the next large-scale competition cycle.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pre-registration idea could apply to live evaluation in other sequential prediction tasks such as reinforcement learning.
  • Sustaining the platform will depend on securing long-term public or private data feeds that cannot be exhausted.
  • Continuous live benchmarks may reduce reliance on fixed historical test sets across machine learning domains.
  • Operational reliability of container orchestration becomes a new evaluation criterion alongside predictive accuracy.

Load-bearing premise

Containerized model submissions run reliably on live streams without introducing new leakage or failures, and enough future data sources will stay available over time.

What would settle it

A documented case where a submitted model receives or uses data after its prediction deadline, or repeated execution failures when running containers on new live streams.

Figures

Figures reproduced from arXiv: 2512.20761 by Henrik Albers, Kevin Zalipski, Marcel Meyer, Oliver M\"uller, Sascha Kaltenpoth.

Figure 1
Figure 1. Figure 1: The Forecast Pre-Registration Protocol 2 [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The Microservice Architecture 3.3 API Portal 3.3.1 Orchestration by Challenges As mentioned, we adapt the concept of competitions [6, 27] and convert it into more fast-paced challenges. A challenge bundles multiple time series and contains many rounds, whose forecasts need to be pre-registered before the same time point tnow and evaluated in an aggregated manner after the actual target values exist. The ti… view at source ↗
Figure 3
Figure 3. Figure 3: TS-Arena Challenge and Model View 3.5 Reference Model Service The Reference Model Service ensures an active participation of a number of reference models in every challenge and round that serve as common baselines. The Reference Model Service interacts with the public TS-Arena API Portal just like any other participant, ensuring the reference models operate under the exact same constraints (e.g., submissio… view at source ↗
Figure 4
Figure 4. Figure 4: Example of a hard coal challenge with high average MASE 12 [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Example of a natural gas challenge with high average MASE [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example of a complete hard coal time series with many high average MASE rounds [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Example of a complete natural gas time series with many high average MASE rounds C Results Snapshot 2025-12-31 This section provides a detailed breakdown of the benchmark results as of December 31, 2025. While the primary objective of TS-Arena is to facilitate a continuous and evolving evaluation of TSFMs, we present this snapshot to offer 13 [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
read the original abstract

Time Series Foundation Models (TSFMs) are transforming the field of forecasting. However, evaluating them on historical data is increasingly difficult due to the risks of train-test sample overlaps and temporal overlaps between correlated train and test time series. To address this, we introduce TS-Arena, a live forecasting platform that shifts evaluation from the known past to the unknown future. Building on the concept of continuous benchmarking, TS-Arena evaluates models on future data. Crucially, we introduce a strict forecasting pre-registration protocol: models must submit predictions before the ground-truth data physically exists. This makes test-set contamination impossible by design. The platform relies on a modular microservice architecture that harmonizes and structures data from different sources and orchestrates containerized model submissions. By enforcing a strict pre-registration protocol on live data streams, TS-Arena prevents information leakage offers a faster alternative to traditional static, infrequently repeated competitions (e.g. the M-Competitions). First empirical results derived from operating TS-Arena over one year of energy time series demonstrate that established TSFMs accumulate robust longitudinal scores over time, while the continuous nature of the benchmark simultaneously allows newcomers to demonstrate immediate competitiveness. TS-Arena provides the necessary infrastructure to assess the true generalization capabilities of modern forecasting models. The platform and corresponding code are available at https://ts-arena.live/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces TS-Arena, a live forecasting platform for evaluating Time Series Foundation Models on future data streams rather than historical data. It centers on a strict pre-registration protocol requiring model submissions before ground-truth data exists, implemented through a modular microservice architecture for data harmonization and containerized model orchestration. One-year empirical results on energy time series are described at a high level to illustrate longitudinal scoring and newcomer competitiveness, with the platform and code made publicly available.

Significance. If the isolation guarantees can be substantiated, the platform would address a genuine and growing problem in TSFM evaluation by enabling continuous, leakage-resistant benchmarking on live streams, providing a faster and more realistic alternative to static competitions such as the M-Competitions. The open release of the platform and code is a clear strength that supports reproducibility and community adoption.

major comments (1)
  1. [Platform Architecture and Orchestration] The central claim that the pre-registration protocol renders test-set contamination impossible by design (abstract and §1) rests on the assumption that the microservice architecture and container orchestration can enforce submissions strictly before any ground-truth data exists. The manuscript provides no details on deadline enforcement, sandboxing to block access to correlated live streams or historical proxies, or audit logs for timing verification, leaving the load-bearing isolation guarantee underspecified.
minor comments (1)
  1. [Abstract] Abstract: the sentence 'TS-Arena prevents information leakage offers a faster alternative' is missing the conjunction 'and'.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the isolation guarantees of the pre-registration protocol. We address the major comment point-by-point below and commit to expanding the relevant sections in the revised manuscript.

read point-by-point responses
  1. Referee: [Platform Architecture and Orchestration] The central claim that the pre-registration protocol renders test-set contamination impossible by design (abstract and §1) rests on the assumption that the microservice architecture and container orchestration can enforce submissions strictly before any ground-truth data exists. The manuscript provides no details on deadline enforcement, sandboxing to block access to correlated live streams or historical proxies, or audit logs for timing verification, leaving the load-bearing isolation guarantee underspecified.

    Authors: We agree that the manuscript currently describes the enforcement mechanisms at a high level and would benefit from explicit details to substantiate the isolation claims. In the revised version we will add a dedicated subsection (new §3.3) that specifies: (1) deadline enforcement via a time-locked submission API that rejects any upload after the pre-defined cutoff (enforced by the orchestration service using synchronized clocks); (2) sandboxing through containerized execution with no outbound network access, read-only volumes, and explicit blocking of any external data sources or historical proxies; and (3) immutable audit logs that record submission timestamps, model container hashes, and verification events, which are publicly queryable. These mechanisms are already implemented in the released codebase; we will include a timeline diagram and pseudocode for the submission protocol to make the guarantees concrete. revision: yes

Circularity Check

0 steps flagged

No circularity in systems description and protocol

full rationale

The paper is a systems and protocol description for a live forecasting platform. It contains no mathematical derivations, equations, fitted parameters, or predictions that reduce to inputs by construction. The central claim about preventing leakage via pre-registration is presented as a design feature of the microservice architecture and container orchestration, without self-definitional loops, self-citation load-bearing arguments, or renaming of known results. Empirical results from one year of energy data are reported as observations from platform operation, not as outputs of any circular fitting process. This is a standard non-circular systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an engineering and systems paper. No free parameters, mathematical axioms, or invented scientific entities are introduced; the contribution rests on the platform architecture and pre-registration protocol.

pith-pipeline@v0.9.0 · 5547 in / 997 out tokens · 47299 ms · 2026-05-16T19:56:39.805732+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Energy-Arena: A Dynamic Benchmark for Operational Energy Forecasting

    econ.EM 2026-04 unverdicted novelty 7.0

    Energy-Arena is a dynamic, forward-looking benchmarking platform that standardizes ex-ante submissions and rolling ex-post evaluations for operational energy forecasting to improve transparency and comparability.

  2. FETS Benchmark: Foundation Models Outperform Dataset-specific Machine Learning in Energy Time Series Forecasting

    cs.LG 2026-04 unverdicted novelty 6.0

    Foundation models outperform dataset-specific machine learning in energy time series forecasting across 54 datasets in 9 categories.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 2 Pith papers · 4 internal anchors

  1. [1]

    Taha Aksu, Gerald Woo, Juncheng Liu, Xu Liu, Chenghao Liu, Silvio Savarese, Caiming Xiong, and Doyen Sahoo. 2024. GIFT-Eval: A Benchmark For General Time Series Forecasting Model Evaluation. doi:10.48550/ arXiv.2410.10393arXiv:2410.10393 [cs]

  2. [2]

    Chronos-2: From Univariate to Universal Forecasting

    Abdul Fatir Ansari, Oleksandr Shchur, Jaris Küken, Andreas Auer, Boran Han, Pedro Mercado, Syama Sundar Rangapuram, Huibin Shen, Lorenzo Stella, Xiyuan Zhang, Mononito Goswami, Shubham Kapoor, Danielle C. Maddix, Pablo Guerron, Tony Hu, Junming Yin, Nick Erickson, Prateek Mutalik Desai, Hao Wang, Huzefa Rangwala, George Karypis, Yuyang Wang, and Michael B...

  3. [3]

    Abdul Fatir Ansari, Caner Turkmen, Oleksandr Shchur, and Lorenzo Stella. 2024. Fast and accurate zero-shot fore- casting with Chronos-Bolt and AutoGluon. https://aws.amazon.com/blogs/machine-learning/ fast-and-accurate-zero-shot-forecasting-with-chronos-bolt-and-autogluon/ tex.howpublished: AWS Machine Learning Blog

  4. [4]

    Andreas Auer, Patrick Podest, Daniel Klotz, Sebastian Böck, Günter Klambauer, and Sepp Hochreiter. 2025. TiRex: Zero-Shot Forecasting Across Long and Short Horizons with Enhanced In-Context Learning.arXiv preprint arXiv:2505.23719(2025)

  5. [5]

    Joachim Bertsch, Christian Growitsch, Stefan Lorenczik, and Stephan Nagl. 2016. Flexibility in Europe’s power sector—An additional requirement or an automatic complement?Energy Economics53 (2016), 118–131

  6. [6]

    Casper Solheim Bojer and Jens Peder Meldgaard. 2021. Kaggle forecasting competitions: An overlooked learning opportunity.International Journal of Forecasting37, 2 (April 2021), 587–603. doi: 10.1016/j.ijforeca st.2020.07.007 9 arXivTemplateA PREPRINT

  7. [7]

    Bundesnetzagentur. [n. d.]. SMARD: Electricity Market Data Platform.https://www.smard.de

  8. [8]

    Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. 2024. A decoder-only foundation model for time-series forecasting

  9. [9]

    Vijay Ekambaram, Arindam Jati, Pankaj Dayama, Sumanta Mukherjee, Nam Nguyen, Wesley M Gifford, Chandra Reddy, and Jayant Kalagnanam. 2024. Tiny time mixers (ttms): Fast pre-trained models for enhanced zero/few- shot forecasting of multivariate time series.Advances in Neural Information Processing Systems37 (2024), 74147–74181

  10. [10]

    Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Prateek Mutalik Desai, David Salinas, and Frank Hutter. 2025. TabArena: A Living Benchmark for Machine Learning on Tabular Data. doi:10.48550/ ARXIV.2506.16791Version Number: 4

  11. [11]

    Denizalp Goktas, Amy Greenwald, Gerardo Riano-Briceno, Alexandra Magnusson, Alif Abdullah, and Beatriz de Lucio. 2025. TempusBench: An evaluation framework for time-series forecasting. InRecent advances in time series foundation models have we reached the ’BERT moment’? https://openreview.net/forum?i d=3fMa060Ag5

  12. [12]

    Mononito Goswami, Konrad Szafer, Arjun Choudhry, Yifu Cai, Shuo Li, and Artur Dubrawski. 2024. Moment: A family of open time-series foundation models.arXiv preprint arXiv:2402.03885(2024)

  13. [13]

    Lars Graf, Thomas Ortner, Stanis´L Wo´L¸ sniak, and Angeliki Pantazi. 2025. Flowstate: Sampling rate invariant time series forecasting.arXiv preprint arXiv:2508.05287(2025)

  14. [14]

    Tao Hong, Pierre Pinson, Yi Wang, Rafal Weron, Dazhi Yang, and Hamidreza Zareipour. 2020. Energy Forecasting: A Review and Outlook.IEEE Open Access Journal of Power and Energy7 (2020), 376–388. doi: 10.1109/OA JPE.2020.3029979

  15. [15]

    Athanasopoulos

    Rob Hyndman and G. Athanasopoulos. 2021.Forecasting: Principles and Practice(3rd ed.). OTexts, Australia

  16. [16]

    Another look at measures of forecast accuracy

    Rob J. Hyndman and Anne B. Koehler. 2006. Another look at measures of forecast accuracy.International Journal of Forecasting22, 4 (Oct. 2006), 679–688. doi:10.1016/j.ijforecast.2006.03.001

  17. [17]

    Max Kanter. 2025. gridstatus: Extract data from ISOs and other energy grid sources. https://github.com /gridstatus/gridstatus

  18. [18]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models. doi: 10.48550/AR XIV.2001.08361Version Number: 1

  19. [19]

    2013.The data warehouse toolkit: the definitive guide to dimensional modeling(3rd ed ed.)

    Ralph Kimball. 2013.The data warehouse toolkit: the definitive guide to dimensional modeling(3rd ed ed.). J. Wiley & Sons, Erscheinungsort nicht ermittelbar

  20. [20]

    Steven Klee and Yuntian Xia. 2025. Measuring time series forecast stability for demand planning. InKDD 2025 workshop on AI for supply chain: Today and future.https://openreview.net/forum?id=26zedug Y8W

  21. [21]

    Jesus Lago, Grzegorz Marcjasz, Bart De Schutter, and Rafał Weron. 2021. Forecasting day-ahead electricity prices: A review of state-of-the-art algorithms, best practices and an open-access benchmark.Applied Energy293 (July 2021), 116983. doi:10.1016/j.apenergy.2021.116983

  22. [22]

    Jensen, and Bin Yang

    Zhe Li, Xiangfei Qiu, Peng Chen, Yihang Wang, Hanyin Cheng, Yang Shu, Jilin Hu, Chenjuan Guo, Aoying Zhou, Christian S. Jensen, and Bin Yang. 2025. TSFM-Bench: A Comprehensive and Unified Benchmark of Foundation Models for Time Series Forecasting. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD ’25). Associa...

  23. [23]

    Yuxuan Liang, Haomin Wen, Yuqi Nie, Yushan Jiang, Ming Jin, Dongjin Song, Shirui Pan, and Qingsong Wen. 2024. Foundation Models for Time Series Analysis: A Tutorial and Survey. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, Barcelona Spain, 6555–6565. doi:10.1145/3637528.3671451

  24. [24]

    Vera Liao and Ziang Xiao

    Q. Vera Liao and Ziang Xiao. 2023. Rethinking Model Evaluation as Narrowing the Socio-Technical Gap. doi:10.48550/ARXIV.2306.03100Version Number: 4

  25. [25]

    Chenghao Liu, Taha Aksu, Juncheng Liu, Xu Liu, Hanshu Yan, Quang Pham, Silvio Savarese, Doyen Sahoo, Caiming Xiong, and Junnan Li. 2025. Moirai 2.0: When less is more for time series forecasting.arXiv preprint arXiv:2511.11698(2025). 10 arXivTemplateA PREPRINT

  26. [26]

    Yong Liu, Guo Qin, Zhiyuan Shi, Zhi Chen, Caiyin Yang, Xiangdong Huang, Jianmin Wang, and Mingsheng Long. 2025. Sundial: A family of highly capable time series foundation models.arXiv preprint arXiv:2502.00816 (2025)

  27. [27]

    Spyros Makridakis, Evangelos Spiliotis, Ross Hollyman, Fotios Petropoulos, Norman Swanson, and Anil Gaba. 2024. The M6 forecasting competition: Bridging the gap between forecasting and investment decisions. International Journal of Forecasting(Nov. 2024), S0169207024001079. doi: 10.1016/j.ijforecast .2024.11.002

  28. [28]

    Marcel Meyer, Sascha Kaltenpoth, Kevin Zalipski, and Oliver Müller. 2025. Time Series Foundation Models: Benchmarking Challenges and Requirements. doi:10.48550/ARXIV.2510.13654Version Number: 1

  29. [29]

    Marcel Meyer, David Zapata Gonzalez, Sascha Kaltenpoth, and Oliver Müller. 2025. Benchmarking Time Series Foundation Models for Short-Term Household Electricity Load Forecasting.IEEE Access13 (2025), 218141–218153. doi:10.1109/ACCESS.2025.3648056

  30. [30]

    Fingrid Oyj. [n. d.]. Fingrid Open Data Platform and API.https://data.fingrid.fi/en

  31. [31]

    Joaquín Amat Rodrigo and Javier Escobar Ortiz. 2024. Data leakage in pre-trained forecasting mod- els. https://cienciadedatos.net/documentos/py63-data-leakage-pre-trained-for ecasting-models.html

  32. [32]

    Lefei Shen, Mouxiang Chen, Xu Liu, Han Fu, Xiaoxue Ren, Jianling Sun, Zhuo Li, and Chenghao Liu. 2025. VisionTS++: Cross-Modal Time Series Foundation Model with Continual Pre-trained Vision Backbones.arXiv preprint arXiv:2508.04379(2025)

  33. [33]

    Xiaoming Shi, Shiyu Wang, Yuqi Nie, Dianqi Li, Zhou Ye, Qingsong Wen, and Ming Jin. 2024. Time-moe: Billion-scale time series foundation models with mixture of experts.arXiv preprint arXiv:2409.16040(2024)

  34. [34]

    Gerald Woo, Chenghao Liu, Akshat Kumar, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. 2024. Unified training of universal time series forecasting transformers. (2024)

  35. [35]

    Zhijian Xu, Wanxu Cai, Xilin Dai, Zhaorong Deng, and Qiang Xu. 2025. Fidel-TS: A High-Fidelity Multimodal Benchmark for Time Series Forecasting. doi:10.48550/ARXIV.2509.24789Version Number: 3

  36. [36]

    Qingren Yao, Chao-Han Huck Yang, Renhe Jiang, Yuxuan Liang, Ming Jin, and Shirui Pan. 2025. Towards neural scaling laws for time series foundation models. InThe thirteenth international conference on learning representations.https://openreview.net/forum?id=uCqxDfLYrB

  37. [37]

    out-of-the-box

    Xu Zhang, Zhengang Huang, Yunzhi Wu, Xun Lu, Erpeng Qi, Yunkai Chen, Zhongya Xue, Qitong Wang, Peng Wang, and Wei Wang. 2025. Multi-period Learning for Financial Time Series Forecasting. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1. ACM, Toronto ON Canada, 2848–2859. doi:10.1145/3690624.3709422 A Detailed Expe...