Recognition: no theorem link
TS-Arena -- A Live Forecast Pre-Registration Platform
Pith reviewed 2026-05-16 19:56 UTC · model grok-4.3
The pith
TS-Arena requires forecasting models to submit predictions before future data exists, eliminating test-set leakage by design.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TS-Arena is a live forecasting platform that enforces a strict pre-registration protocol: models must submit predictions before the corresponding ground-truth data exists. It relies on a modular microservice architecture to structure data from diverse sources and orchestrate containerized submissions. Over one year of energy time series, established models accumulate robust scores while the continuous format lets newcomers compete right away.
What carries the argument
The strict forecasting pre-registration protocol on live data streams, which forces submissions prior to data availability and uses microservices to manage containerized runs.
If this is right
- Test-set contamination becomes impossible because predictions precede data arrival.
- Evaluation shifts from infrequent static competitions to continuous longitudinal tracking.
- Models can be assessed for true generalization on data that did not exist during development.
- New entrants can demonstrate performance without waiting for the next large-scale competition cycle.
Where Pith is reading between the lines
- The same pre-registration idea could apply to live evaluation in other sequential prediction tasks such as reinforcement learning.
- Sustaining the platform will depend on securing long-term public or private data feeds that cannot be exhausted.
- Continuous live benchmarks may reduce reliance on fixed historical test sets across machine learning domains.
- Operational reliability of container orchestration becomes a new evaluation criterion alongside predictive accuracy.
Load-bearing premise
Containerized model submissions run reliably on live streams without introducing new leakage or failures, and enough future data sources will stay available over time.
What would settle it
A documented case where a submitted model receives or uses data after its prediction deadline, or repeated execution failures when running containers on new live streams.
Figures
read the original abstract
Time Series Foundation Models (TSFMs) are transforming the field of forecasting. However, evaluating them on historical data is increasingly difficult due to the risks of train-test sample overlaps and temporal overlaps between correlated train and test time series. To address this, we introduce TS-Arena, a live forecasting platform that shifts evaluation from the known past to the unknown future. Building on the concept of continuous benchmarking, TS-Arena evaluates models on future data. Crucially, we introduce a strict forecasting pre-registration protocol: models must submit predictions before the ground-truth data physically exists. This makes test-set contamination impossible by design. The platform relies on a modular microservice architecture that harmonizes and structures data from different sources and orchestrates containerized model submissions. By enforcing a strict pre-registration protocol on live data streams, TS-Arena prevents information leakage offers a faster alternative to traditional static, infrequently repeated competitions (e.g. the M-Competitions). First empirical results derived from operating TS-Arena over one year of energy time series demonstrate that established TSFMs accumulate robust longitudinal scores over time, while the continuous nature of the benchmark simultaneously allows newcomers to demonstrate immediate competitiveness. TS-Arena provides the necessary infrastructure to assess the true generalization capabilities of modern forecasting models. The platform and corresponding code are available at https://ts-arena.live/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces TS-Arena, a live forecasting platform for evaluating Time Series Foundation Models on future data streams rather than historical data. It centers on a strict pre-registration protocol requiring model submissions before ground-truth data exists, implemented through a modular microservice architecture for data harmonization and containerized model orchestration. One-year empirical results on energy time series are described at a high level to illustrate longitudinal scoring and newcomer competitiveness, with the platform and code made publicly available.
Significance. If the isolation guarantees can be substantiated, the platform would address a genuine and growing problem in TSFM evaluation by enabling continuous, leakage-resistant benchmarking on live streams, providing a faster and more realistic alternative to static competitions such as the M-Competitions. The open release of the platform and code is a clear strength that supports reproducibility and community adoption.
major comments (1)
- [Platform Architecture and Orchestration] The central claim that the pre-registration protocol renders test-set contamination impossible by design (abstract and §1) rests on the assumption that the microservice architecture and container orchestration can enforce submissions strictly before any ground-truth data exists. The manuscript provides no details on deadline enforcement, sandboxing to block access to correlated live streams or historical proxies, or audit logs for timing verification, leaving the load-bearing isolation guarantee underspecified.
minor comments (1)
- [Abstract] Abstract: the sentence 'TS-Arena prevents information leakage offers a faster alternative' is missing the conjunction 'and'.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the isolation guarantees of the pre-registration protocol. We address the major comment point-by-point below and commit to expanding the relevant sections in the revised manuscript.
read point-by-point responses
-
Referee: [Platform Architecture and Orchestration] The central claim that the pre-registration protocol renders test-set contamination impossible by design (abstract and §1) rests on the assumption that the microservice architecture and container orchestration can enforce submissions strictly before any ground-truth data exists. The manuscript provides no details on deadline enforcement, sandboxing to block access to correlated live streams or historical proxies, or audit logs for timing verification, leaving the load-bearing isolation guarantee underspecified.
Authors: We agree that the manuscript currently describes the enforcement mechanisms at a high level and would benefit from explicit details to substantiate the isolation claims. In the revised version we will add a dedicated subsection (new §3.3) that specifies: (1) deadline enforcement via a time-locked submission API that rejects any upload after the pre-defined cutoff (enforced by the orchestration service using synchronized clocks); (2) sandboxing through containerized execution with no outbound network access, read-only volumes, and explicit blocking of any external data sources or historical proxies; and (3) immutable audit logs that record submission timestamps, model container hashes, and verification events, which are publicly queryable. These mechanisms are already implemented in the released codebase; we will include a timeline diagram and pseudocode for the submission protocol to make the guarantees concrete. revision: yes
Circularity Check
No circularity in systems description and protocol
full rationale
The paper is a systems and protocol description for a live forecasting platform. It contains no mathematical derivations, equations, fitted parameters, or predictions that reduce to inputs by construction. The central claim about preventing leakage via pre-registration is presented as a design feature of the microservice architecture and container orchestration, without self-definitional loops, self-citation load-bearing arguments, or renaming of known results. Empirical results from one year of energy data are reported as observations from platform operation, not as outputs of any circular fitting process. This is a standard non-circular systems paper.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 2 Pith papers
-
Energy-Arena: A Dynamic Benchmark for Operational Energy Forecasting
Energy-Arena is a dynamic, forward-looking benchmarking platform that standardizes ex-ante submissions and rolling ex-post evaluations for operational energy forecasting to improve transparency and comparability.
-
FETS Benchmark: Foundation Models Outperform Dataset-specific Machine Learning in Energy Time Series Forecasting
Foundation models outperform dataset-specific machine learning in energy time series forecasting across 54 datasets in 9 categories.
Reference graph
Works this paper leans on
- [1]
-
[2]
Chronos-2: From Univariate to Universal Forecasting
Abdul Fatir Ansari, Oleksandr Shchur, Jaris Küken, Andreas Auer, Boran Han, Pedro Mercado, Syama Sundar Rangapuram, Huibin Shen, Lorenzo Stella, Xiyuan Zhang, Mononito Goswami, Shubham Kapoor, Danielle C. Maddix, Pablo Guerron, Tony Hu, Junming Yin, Nick Erickson, Prateek Mutalik Desai, Hao Wang, Huzefa Rangwala, George Karypis, Yuyang Wang, and Michael B...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.15821arxiv:2510.15821 2025
-
[3]
Abdul Fatir Ansari, Caner Turkmen, Oleksandr Shchur, and Lorenzo Stella. 2024. Fast and accurate zero-shot fore- casting with Chronos-Bolt and AutoGluon. https://aws.amazon.com/blogs/machine-learning/ fast-and-accurate-zero-shot-forecasting-with-chronos-bolt-and-autogluon/ tex.howpublished: AWS Machine Learning Blog
work page 2024
- [4]
-
[5]
Joachim Bertsch, Christian Growitsch, Stefan Lorenczik, and Stephan Nagl. 2016. Flexibility in Europe’s power sector—An additional requirement or an automatic complement?Energy Economics53 (2016), 118–131
work page 2016
-
[6]
Casper Solheim Bojer and Jens Peder Meldgaard. 2021. Kaggle forecasting competitions: An overlooked learning opportunity.International Journal of Forecasting37, 2 (April 2021), 587–603. doi: 10.1016/j.ijforeca st.2020.07.007 9 arXivTemplateA PREPRINT
-
[7]
Bundesnetzagentur. [n. d.]. SMARD: Electricity Market Data Platform.https://www.smard.de
-
[8]
Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. 2024. A decoder-only foundation model for time-series forecasting
work page 2024
-
[9]
Vijay Ekambaram, Arindam Jati, Pankaj Dayama, Sumanta Mukherjee, Nam Nguyen, Wesley M Gifford, Chandra Reddy, and Jayant Kalagnanam. 2024. Tiny time mixers (ttms): Fast pre-trained models for enhanced zero/few- shot forecasting of multivariate time series.Advances in Neural Information Processing Systems37 (2024), 74147–74181
work page 2024
- [10]
-
[11]
Denizalp Goktas, Amy Greenwald, Gerardo Riano-Briceno, Alexandra Magnusson, Alif Abdullah, and Beatriz de Lucio. 2025. TempusBench: An evaluation framework for time-series forecasting. InRecent advances in time series foundation models have we reached the ’BERT moment’? https://openreview.net/forum?i d=3fMa060Ag5
work page 2025
- [12]
- [13]
-
[14]
Tao Hong, Pierre Pinson, Yi Wang, Rafal Weron, Dazhi Yang, and Hamidreza Zareipour. 2020. Energy Forecasting: A Review and Outlook.IEEE Open Access Journal of Power and Energy7 (2020), 376–388. doi: 10.1109/OA JPE.2020.3029979
work page doi:10.1109/oa 2020
-
[15]
Rob Hyndman and G. Athanasopoulos. 2021.Forecasting: Principles and Practice(3rd ed.). OTexts, Australia
work page 2021
-
[16]
Another look at measures of forecast accuracy
Rob J. Hyndman and Anne B. Koehler. 2006. Another look at measures of forecast accuracy.International Journal of Forecasting22, 4 (Oct. 2006), 679–688. doi:10.1016/j.ijforecast.2006.03.001
-
[17]
Max Kanter. 2025. gridstatus: Extract data from ISOs and other energy grid sources. https://github.com /gridstatus/gridstatus
work page 2025
-
[18]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models. doi: 10.48550/AR XIV.2001.08361Version Number: 1
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/ar 2020
-
[19]
2013.The data warehouse toolkit: the definitive guide to dimensional modeling(3rd ed ed.)
Ralph Kimball. 2013.The data warehouse toolkit: the definitive guide to dimensional modeling(3rd ed ed.). J. Wiley & Sons, Erscheinungsort nicht ermittelbar
work page 2013
-
[20]
Steven Klee and Yuntian Xia. 2025. Measuring time series forecast stability for demand planning. InKDD 2025 workshop on AI for supply chain: Today and future.https://openreview.net/forum?id=26zedug Y8W
work page 2025
-
[21]
Jesus Lago, Grzegorz Marcjasz, Bart De Schutter, and Rafał Weron. 2021. Forecasting day-ahead electricity prices: A review of state-of-the-art algorithms, best practices and an open-access benchmark.Applied Energy293 (July 2021), 116983. doi:10.1016/j.apenergy.2021.116983
-
[22]
Zhe Li, Xiangfei Qiu, Peng Chen, Yihang Wang, Hanyin Cheng, Yang Shu, Jilin Hu, Chenjuan Guo, Aoying Zhou, Christian S. Jensen, and Bin Yang. 2025. TSFM-Bench: A Comprehensive and Unified Benchmark of Foundation Models for Time Series Forecasting. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD ’25). Associa...
-
[23]
Yuxuan Liang, Haomin Wen, Yuqi Nie, Yushan Jiang, Ming Jin, Dongjin Song, Shirui Pan, and Qingsong Wen. 2024. Foundation Models for Time Series Analysis: A Tutorial and Survey. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, Barcelona Spain, 6555–6565. doi:10.1145/3637528.3671451
-
[24]
Q. Vera Liao and Ziang Xiao. 2023. Rethinking Model Evaluation as Narrowing the Socio-Technical Gap. doi:10.48550/ARXIV.2306.03100Version Number: 4
- [25]
-
[26]
Yong Liu, Guo Qin, Zhiyuan Shi, Zhi Chen, Caiyin Yang, Xiangdong Huang, Jianmin Wang, and Mingsheng Long. 2025. Sundial: A family of highly capable time series foundation models.arXiv preprint arXiv:2502.00816 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Spyros Makridakis, Evangelos Spiliotis, Ross Hollyman, Fotios Petropoulos, Norman Swanson, and Anil Gaba. 2024. The M6 forecasting competition: Bridging the gap between forecasting and investment decisions. International Journal of Forecasting(Nov. 2024), S0169207024001079. doi: 10.1016/j.ijforecast .2024.11.002
-
[28]
Marcel Meyer, Sascha Kaltenpoth, Kevin Zalipski, and Oliver Müller. 2025. Time Series Foundation Models: Benchmarking Challenges and Requirements. doi:10.48550/ARXIV.2510.13654Version Number: 1
-
[29]
Marcel Meyer, David Zapata Gonzalez, Sascha Kaltenpoth, and Oliver Müller. 2025. Benchmarking Time Series Foundation Models for Short-Term Household Electricity Load Forecasting.IEEE Access13 (2025), 218141–218153. doi:10.1109/ACCESS.2025.3648056
-
[30]
Fingrid Oyj. [n. d.]. Fingrid Open Data Platform and API.https://data.fingrid.fi/en
-
[31]
Joaquín Amat Rodrigo and Javier Escobar Ortiz. 2024. Data leakage in pre-trained forecasting mod- els. https://cienciadedatos.net/documentos/py63-data-leakage-pre-trained-for ecasting-models.html
work page 2024
- [32]
- [33]
-
[34]
Gerald Woo, Chenghao Liu, Akshat Kumar, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. 2024. Unified training of universal time series forecasting transformers. (2024)
work page 2024
-
[35]
Zhijian Xu, Wanxu Cai, Xilin Dai, Zhaorong Deng, and Qiang Xu. 2025. Fidel-TS: A High-Fidelity Multimodal Benchmark for Time Series Forecasting. doi:10.48550/ARXIV.2509.24789Version Number: 3
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.24789version 2025
-
[36]
Qingren Yao, Chao-Han Huck Yang, Renhe Jiang, Yuxuan Liang, Ming Jin, and Shirui Pan. 2025. Towards neural scaling laws for time series foundation models. InThe thirteenth international conference on learning representations.https://openreview.net/forum?id=uCqxDfLYrB
work page 2025
-
[37]
Xu Zhang, Zhengang Huang, Yunzhi Wu, Xun Lu, Erpeng Qi, Yunkai Chen, Zhongya Xue, Qitong Wang, Peng Wang, and Wei Wang. 2025. Multi-period Learning for Financial Time Series Forecasting. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1. ACM, Toronto ON Canada, 2848–2859. doi:10.1145/3690624.3709422 A Detailed Expe...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.