arxiv: 2512.20761 · v3 · submitted 2025-12-23 · 💻 cs.LG · cs.AI

Recognition: no theorem link

TS-Arena -- A Live Forecast Pre-Registration Platform

Marcel Meyer , Sascha Kaltenpoth , Henrik Albers , Kevin Zalipski , Oliver M\"uller

Authors on Pith no claims yet

Pith reviewed 2026-05-16 19:56 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords time series forecastingpre-registrationbenchmarkinginformation leakagelive evaluationfoundation modelscontinuous assessment

0 comments

The pith

TS-Arena requires forecasting models to submit predictions before future data exists, eliminating test-set leakage by design.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Evaluating time series models on historical data risks contamination through overlapping samples or correlated series. TS-Arena moves assessment to live future data streams, where models must pre-register predictions before the ground truth physically arrives. This pre-registration protocol makes leakage impossible. The platform uses a modular microservice architecture to harmonize data from multiple sources and run containerized model submissions on ongoing streams. One year of operation on energy time series shows established models build consistent longitudinal scores while new models can demonstrate immediate competitiveness.

Core claim

TS-Arena is a live forecasting platform that enforces a strict pre-registration protocol: models must submit predictions before the corresponding ground-truth data exists. It relies on a modular microservice architecture to structure data from diverse sources and orchestrate containerized submissions. Over one year of energy time series, established models accumulate robust scores while the continuous format lets newcomers compete right away.

What carries the argument

The strict forecasting pre-registration protocol on live data streams, which forces submissions prior to data availability and uses microservices to manage containerized runs.

If this is right

Test-set contamination becomes impossible because predictions precede data arrival.
Evaluation shifts from infrequent static competitions to continuous longitudinal tracking.
Models can be assessed for true generalization on data that did not exist during development.
New entrants can demonstrate performance without waiting for the next large-scale competition cycle.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pre-registration idea could apply to live evaluation in other sequential prediction tasks such as reinforcement learning.
Sustaining the platform will depend on securing long-term public or private data feeds that cannot be exhausted.
Continuous live benchmarks may reduce reliance on fixed historical test sets across machine learning domains.
Operational reliability of container orchestration becomes a new evaluation criterion alongside predictive accuracy.

Load-bearing premise

Containerized model submissions run reliably on live streams without introducing new leakage or failures, and enough future data sources will stay available over time.

What would settle it

A documented case where a submitted model receives or uses data after its prediction deadline, or repeated execution failures when running containers on new live streams.

Figures

Figures reproduced from arXiv: 2512.20761 by Henrik Albers, Kevin Zalipski, Marcel Meyer, Oliver M\"uller, Sascha Kaltenpoth.

**Figure 2.** Figure 2: The Microservice Architecture 3.3 API Portal 3.3.1 Orchestration by Challenges As mentioned, we adapt the concept of competitions [6, 27] and convert it into more fast-paced challenges. A challenge bundles multiple time series and contains many rounds, whose forecasts need to be pre-registered before the same time point tnow and evaluated in an aggregated manner after the actual target values exist. The ti… view at source ↗

**Figure 3.** Figure 3: TS-Arena Challenge and Model View 3.5 Reference Model Service The Reference Model Service ensures an active participation of a number of reference models in every challenge and round that serve as common baselines. The Reference Model Service interacts with the public TS-Arena API Portal just like any other participant, ensuring the reference models operate under the exact same constraints (e.g., submissio… view at source ↗

**Figure 4.** Figure 4: Example of a hard coal challenge with high average MASE 12 [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Example of a natural gas challenge with high average MASE [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Example of a complete hard coal time series with many high average MASE rounds [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Example of a complete natural gas time series with many high average MASE rounds C Results Snapshot 2025-12-31 This section provides a detailed breakdown of the benchmark results as of December 31, 2025. While the primary objective of TS-Arena is to facilitate a continuous and evolving evaluation of TSFMs, we present this snapshot to offer 13 [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

read the original abstract

Time Series Foundation Models (TSFMs) are transforming the field of forecasting. However, evaluating them on historical data is increasingly difficult due to the risks of train-test sample overlaps and temporal overlaps between correlated train and test time series. To address this, we introduce TS-Arena, a live forecasting platform that shifts evaluation from the known past to the unknown future. Building on the concept of continuous benchmarking, TS-Arena evaluates models on future data. Crucially, we introduce a strict forecasting pre-registration protocol: models must submit predictions before the ground-truth data physically exists. This makes test-set contamination impossible by design. The platform relies on a modular microservice architecture that harmonizes and structures data from different sources and orchestrates containerized model submissions. By enforcing a strict pre-registration protocol on live data streams, TS-Arena prevents information leakage offers a faster alternative to traditional static, infrequently repeated competitions (e.g. the M-Competitions). First empirical results derived from operating TS-Arena over one year of energy time series demonstrate that established TSFMs accumulate robust longitudinal scores over time, while the continuous nature of the benchmark simultaneously allows newcomers to demonstrate immediate competitiveness. TS-Arena provides the necessary infrastructure to assess the true generalization capabilities of modern forecasting models. The platform and corresponding code are available at https://ts-arena.live/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TS-Arena's live pre-registration setup targets leakage in time series benchmarks but the enforcement details are still thin.

read the letter

TS-Arena's core move is to evaluate forecasting models on live future data streams instead of historical sets, with a hard rule that predictions must be submitted before the ground truth data physically exists. That protocol, if it holds, does block the usual contamination routes that plague static benchmarks like the M-competitions. The paper describes a modular microservice layer that ingests and harmonizes data from multiple sources and then runs containerized submissions in an orchestrated way. They have kept the system running for a year on energy time series, and the results indicate that established foundation models build up stable longitudinal scores while new models can enter and compete immediately. That continuous aspect is a practical improvement over infrequent big competitions. The soft spot is the lack of concrete mechanisms for the timing guarantee. The description covers orchestration and harmonization but does not detail how submission deadlines are enforced, what sandboxing prevents access to correlated live proxies, or how audit logs would catch timing violations. Without those specifics the claim that leakage is impossible by design rests more on architecture intent than demonstrated isolation. The one-year empirical section is also high-level, with no error bars, exclusion criteria, or statistical comparisons shown. This is aimed at the time series forecasting community, especially groups that run or maintain benchmarks and want something more trustworthy than historical hold-outs. It deserves peer review because the problem it addresses is real and the live pre-registration concept is distinct from prior work, though the authors would need to add the missing operational safeguards before the platform claims are fully convincing.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces TS-Arena, a live forecasting platform for evaluating Time Series Foundation Models on future data streams rather than historical data. It centers on a strict pre-registration protocol requiring model submissions before ground-truth data exists, implemented through a modular microservice architecture for data harmonization and containerized model orchestration. One-year empirical results on energy time series are described at a high level to illustrate longitudinal scoring and newcomer competitiveness, with the platform and code made publicly available.

Significance. If the isolation guarantees can be substantiated, the platform would address a genuine and growing problem in TSFM evaluation by enabling continuous, leakage-resistant benchmarking on live streams, providing a faster and more realistic alternative to static competitions such as the M-Competitions. The open release of the platform and code is a clear strength that supports reproducibility and community adoption.

major comments (1)

[Platform Architecture and Orchestration] The central claim that the pre-registration protocol renders test-set contamination impossible by design (abstract and §1) rests on the assumption that the microservice architecture and container orchestration can enforce submissions strictly before any ground-truth data exists. The manuscript provides no details on deadline enforcement, sandboxing to block access to correlated live streams or historical proxies, or audit logs for timing verification, leaving the load-bearing isolation guarantee underspecified.

minor comments (1)

[Abstract] Abstract: the sentence 'TS-Arena prevents information leakage offers a faster alternative' is missing the conjunction 'and'.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the isolation guarantees of the pre-registration protocol. We address the major comment point-by-point below and commit to expanding the relevant sections in the revised manuscript.

read point-by-point responses

Referee: [Platform Architecture and Orchestration] The central claim that the pre-registration protocol renders test-set contamination impossible by design (abstract and §1) rests on the assumption that the microservice architecture and container orchestration can enforce submissions strictly before any ground-truth data exists. The manuscript provides no details on deadline enforcement, sandboxing to block access to correlated live streams or historical proxies, or audit logs for timing verification, leaving the load-bearing isolation guarantee underspecified.

Authors: We agree that the manuscript currently describes the enforcement mechanisms at a high level and would benefit from explicit details to substantiate the isolation claims. In the revised version we will add a dedicated subsection (new §3.3) that specifies: (1) deadline enforcement via a time-locked submission API that rejects any upload after the pre-defined cutoff (enforced by the orchestration service using synchronized clocks); (2) sandboxing through containerized execution with no outbound network access, read-only volumes, and explicit blocking of any external data sources or historical proxies; and (3) immutable audit logs that record submission timestamps, model container hashes, and verification events, which are publicly queryable. These mechanisms are already implemented in the released codebase; we will include a timeline diagram and pseudocode for the submission protocol to make the guarantees concrete. revision: yes

Circularity Check

0 steps flagged

No circularity in systems description and protocol

full rationale

The paper is a systems and protocol description for a live forecasting platform. It contains no mathematical derivations, equations, fitted parameters, or predictions that reduce to inputs by construction. The central claim about preventing leakage via pre-registration is presented as a design feature of the microservice architecture and container orchestration, without self-definitional loops, self-citation load-bearing arguments, or renaming of known results. Empirical results from one year of energy data are reported as observations from platform operation, not as outputs of any circular fitting process. This is a standard non-circular systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an engineering and systems paper. No free parameters, mathematical axioms, or invented scientific entities are introduced; the contribution rests on the platform architecture and pre-registration protocol.

pith-pipeline@v0.9.0 · 5547 in / 997 out tokens · 47299 ms · 2026-05-16T19:56:39.805732+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Energy-Arena: A Dynamic Benchmark for Operational Energy Forecasting
econ.EM 2026-04 unverdicted novelty 7.0

Energy-Arena is a dynamic, forward-looking benchmarking platform that standardizes ex-ante submissions and rolling ex-post evaluations for operational energy forecasting to improve transparency and comparability.
FETS Benchmark: Foundation Models Outperform Dataset-specific Machine Learning in Energy Time Series Forecasting
cs.LG 2026-04 unverdicted novelty 6.0

Foundation models outperform dataset-specific machine learning in energy time series forecasting across 54 datasets in 9 categories.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 2 Pith papers · 4 internal anchors

[1]

Taha Aksu, Gerald Woo, Juncheng Liu, Xu Liu, Chenghao Liu, Silvio Savarese, Caiming Xiong, and Doyen Sahoo. 2024. GIFT-Eval: A Benchmark For General Time Series Forecasting Model Evaluation. doi:10.48550/ arXiv.2410.10393arXiv:2410.10393 [cs]

work page arXiv 2024
[2]

Chronos-2: From Univariate to Universal Forecasting

Abdul Fatir Ansari, Oleksandr Shchur, Jaris Küken, Andreas Auer, Boran Han, Pedro Mercado, Syama Sundar Rangapuram, Huibin Shen, Lorenzo Stella, Xiyuan Zhang, Mononito Goswami, Shubham Kapoor, Danielle C. Maddix, Pablo Guerron, Tony Hu, Junming Yin, Nick Erickson, Prateek Mutalik Desai, Hao Wang, Huzefa Rangwala, George Karypis, Yuyang Wang, and Michael B...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.15821arxiv:2510.15821 2025
[3]

Abdul Fatir Ansari, Caner Turkmen, Oleksandr Shchur, and Lorenzo Stella. 2024. Fast and accurate zero-shot fore- casting with Chronos-Bolt and AutoGluon. https://aws.amazon.com/blogs/machine-learning/ fast-and-accurate-zero-shot-forecasting-with-chronos-bolt-and-autogluon/ tex.howpublished: AWS Machine Learning Blog

work page 2024
[4]

Andreas Auer, Patrick Podest, Daniel Klotz, Sebastian Böck, Günter Klambauer, and Sepp Hochreiter. 2025. TiRex: Zero-Shot Forecasting Across Long and Short Horizons with Enhanced In-Context Learning.arXiv preprint arXiv:2505.23719(2025)

work page arXiv 2025
[5]

Joachim Bertsch, Christian Growitsch, Stefan Lorenczik, and Stephan Nagl. 2016. Flexibility in Europe’s power sector—An additional requirement or an automatic complement?Energy Economics53 (2016), 118–131

work page 2016
[6]

Casper Solheim Bojer and Jens Peder Meldgaard. 2021. Kaggle forecasting competitions: An overlooked learning opportunity.International Journal of Forecasting37, 2 (April 2021), 587–603. doi: 10.1016/j.ijforeca st.2020.07.007 9 arXivTemplateA PREPRINT

work page doi:10.1016/j.ijforeca 2021
[7]

Bundesnetzagentur. [n. d.]. SMARD: Electricity Market Data Platform.https://www.smard.de

work page
[8]

Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. 2024. A decoder-only foundation model for time-series forecasting

work page 2024
[9]

Vijay Ekambaram, Arindam Jati, Pankaj Dayama, Sumanta Mukherjee, Nam Nguyen, Wesley M Gifford, Chandra Reddy, and Jayant Kalagnanam. 2024. Tiny time mixers (ttms): Fast pre-trained models for enhanced zero/few- shot forecasting of multivariate time series.Advances in Neural Information Processing Systems37 (2024), 74147–74181

work page 2024
[10]

Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Prateek Mutalik Desai, David Salinas, and Frank Hutter. 2025. TabArena: A Living Benchmark for Machine Learning on Tabular Data. doi:10.48550/ ARXIV.2506.16791Version Number: 4

work page arXiv 2025
[11]

Denizalp Goktas, Amy Greenwald, Gerardo Riano-Briceno, Alexandra Magnusson, Alif Abdullah, and Beatriz de Lucio. 2025. TempusBench: An evaluation framework for time-series forecasting. InRecent advances in time series foundation models have we reached the ’BERT moment’? https://openreview.net/forum?i d=3fMa060Ag5

work page 2025
[12]

Mononito Goswami, Konrad Szafer, Arjun Choudhry, Yifu Cai, Shuo Li, and Artur Dubrawski. 2024. Moment: A family of open time-series foundation models.arXiv preprint arXiv:2402.03885(2024)

work page arXiv 2024
[13]

Lars Graf, Thomas Ortner, Stanis´L Wo´L¸ sniak, and Angeliki Pantazi. 2025. Flowstate: Sampling rate invariant time series forecasting.arXiv preprint arXiv:2508.05287(2025)

work page arXiv 2025
[14]

Tao Hong, Pierre Pinson, Yi Wang, Rafal Weron, Dazhi Yang, and Hamidreza Zareipour. 2020. Energy Forecasting: A Review and Outlook.IEEE Open Access Journal of Power and Energy7 (2020), 376–388. doi: 10.1109/OA JPE.2020.3029979

work page doi:10.1109/oa 2020
[15]

Athanasopoulos

Rob Hyndman and G. Athanasopoulos. 2021.Forecasting: Principles and Practice(3rd ed.). OTexts, Australia

work page 2021
[16]

Another look at measures of forecast accuracy

Rob J. Hyndman and Anne B. Koehler. 2006. Another look at measures of forecast accuracy.International Journal of Forecasting22, 4 (Oct. 2006), 679–688. doi:10.1016/j.ijforecast.2006.03.001

work page doi:10.1016/j.ijforecast.2006.03.001 2006
[17]

Max Kanter. 2025. gridstatus: Extract data from ISOs and other energy grid sources. https://github.com /gridstatus/gridstatus

work page 2025
[18]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models. doi: 10.48550/AR XIV.2001.08361Version Number: 1

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/ar 2020
[19]

2013.The data warehouse toolkit: the definitive guide to dimensional modeling(3rd ed ed.)

Ralph Kimball. 2013.The data warehouse toolkit: the definitive guide to dimensional modeling(3rd ed ed.). J. Wiley & Sons, Erscheinungsort nicht ermittelbar

work page 2013
[20]

Steven Klee and Yuntian Xia. 2025. Measuring time series forecast stability for demand planning. InKDD 2025 workshop on AI for supply chain: Today and future.https://openreview.net/forum?id=26zedug Y8W

work page 2025
[21]

Jesus Lago, Grzegorz Marcjasz, Bart De Schutter, and Rafał Weron. 2021. Forecasting day-ahead electricity prices: A review of state-of-the-art algorithms, best practices and an open-access benchmark.Applied Energy293 (July 2021), 116983. doi:10.1016/j.apenergy.2021.116983

work page doi:10.1016/j.apenergy.2021.116983 2021
[22]

Jensen, and Bin Yang

Zhe Li, Xiangfei Qiu, Peng Chen, Yihang Wang, Hanyin Cheng, Yang Shu, Jilin Hu, Chenjuan Guo, Aoying Zhou, Christian S. Jensen, and Bin Yang. 2025. TSFM-Bench: A Comprehensive and Unified Benchmark of Foundation Models for Time Series Forecasting. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD ’25). Associa...

work page doi:10.1145/3711896.3737442 2025
[23]

Yuxuan Liang, Haomin Wen, Yuqi Nie, Yushan Jiang, Ming Jin, Dongjin Song, Shirui Pan, and Qingsong Wen. 2024. Foundation Models for Time Series Analysis: A Tutorial and Survey. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, Barcelona Spain, 6555–6565. doi:10.1145/3637528.3671451

work page doi:10.1145/3637528.3671451 2024
[24]

Vera Liao and Ziang Xiao

Q. Vera Liao and Ziang Xiao. 2023. Rethinking Model Evaluation as Narrowing the Socio-Technical Gap. doi:10.48550/ARXIV.2306.03100Version Number: 4

work page doi:10.48550/arxiv.2306.03100version 2023
[25]

Chenghao Liu, Taha Aksu, Juncheng Liu, Xu Liu, Hanshu Yan, Quang Pham, Silvio Savarese, Doyen Sahoo, Caiming Xiong, and Junnan Li. 2025. Moirai 2.0: When less is more for time series forecasting.arXiv preprint arXiv:2511.11698(2025). 10 arXivTemplateA PREPRINT

work page arXiv 2025
[26]

Yong Liu, Guo Qin, Zhiyuan Shi, Zhi Chen, Caiyin Yang, Xiangdong Huang, Jianmin Wang, and Mingsheng Long. 2025. Sundial: A family of highly capable time series foundation models.arXiv preprint arXiv:2502.00816 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Spyros Makridakis, Evangelos Spiliotis, Ross Hollyman, Fotios Petropoulos, Norman Swanson, and Anil Gaba. 2024. The M6 forecasting competition: Bridging the gap between forecasting and investment decisions. International Journal of Forecasting(Nov. 2024), S0169207024001079. doi: 10.1016/j.ijforecast .2024.11.002

work page doi:10.1016/j.ijforecast 2024
[28]

Marcel Meyer, Sascha Kaltenpoth, Kevin Zalipski, and Oliver Müller. 2025. Time Series Foundation Models: Benchmarking Challenges and Requirements. doi:10.48550/ARXIV.2510.13654Version Number: 1

work page doi:10.48550/arxiv.2510.13654version 2025
[29]

Marcel Meyer, David Zapata Gonzalez, Sascha Kaltenpoth, and Oliver Müller. 2025. Benchmarking Time Series Foundation Models for Short-Term Household Electricity Load Forecasting.IEEE Access13 (2025), 218141–218153. doi:10.1109/ACCESS.2025.3648056

work page doi:10.1109/access.2025.3648056 2025
[30]

Fingrid Oyj. [n. d.]. Fingrid Open Data Platform and API.https://data.fingrid.fi/en

work page
[31]

Joaquín Amat Rodrigo and Javier Escobar Ortiz. 2024. Data leakage in pre-trained forecasting mod- els. https://cienciadedatos.net/documentos/py63-data-leakage-pre-trained-for ecasting-models.html

work page 2024
[32]

Lefei Shen, Mouxiang Chen, Xu Liu, Han Fu, Xiaoxue Ren, Jianling Sun, Zhuo Li, and Chenghao Liu. 2025. VisionTS++: Cross-Modal Time Series Foundation Model with Continual Pre-trained Vision Backbones.arXiv preprint arXiv:2508.04379(2025)

work page arXiv 2025
[33]

Xiaoming Shi, Shiyu Wang, Yuqi Nie, Dianqi Li, Zhou Ye, Qingsong Wen, and Ming Jin. 2024. Time-moe: Billion-scale time series foundation models with mixture of experts.arXiv preprint arXiv:2409.16040(2024)

work page arXiv 2024
[34]

Gerald Woo, Chenghao Liu, Akshat Kumar, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. 2024. Unified training of universal time series forecasting transformers. (2024)

work page 2024
[35]

Zhijian Xu, Wanxu Cai, Xilin Dai, Zhaorong Deng, and Qiang Xu. 2025. Fidel-TS: A High-Fidelity Multimodal Benchmark for Time Series Forecasting. doi:10.48550/ARXIV.2509.24789Version Number: 3

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.24789version 2025
[36]

Qingren Yao, Chao-Han Huck Yang, Renhe Jiang, Yuxuan Liang, Ming Jin, and Shirui Pan. 2025. Towards neural scaling laws for time series foundation models. InThe thirteenth international conference on learning representations.https://openreview.net/forum?id=uCqxDfLYrB

work page 2025
[37]

out-of-the-box

Xu Zhang, Zhengang Huang, Yunzhi Wu, Xun Lu, Erpeng Qi, Yunkai Chen, Zhongya Xue, Qitong Wang, Peng Wang, and Wei Wang. 2025. Multi-period Learning for Financial Time Series Forecasting. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1. ACM, Toronto ON Canada, 2848–2859. doi:10.1145/3690624.3709422 A Detailed Expe...

work page doi:10.1145/3690624.3709422 2025