pith. machine review for the scientific record. sign in

arxiv: 2605.06530 · v1 · submitted 2026-05-07 · 💻 cs.AI

Recognition: unknown

SpatialEpiBench: Benchmarking Spatial Information and Epidemic Priors in Forecasting

Authors on Pith no claims yet

Pith reviewed 2026-05-08 09:35 UTC · model grok-4.3

classification 💻 cs.AI
keywords epidemic forecastingspatiotemporal modelsbenchmarkspatial informationpublic healthforecast evaluationoutbreak predictionpersistence baseline
0
0 comments X

The pith

A new benchmark shows most spatiotemporal epidemic models underperform a simple last-value baseline even with priors and during outbreaks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SpatialEpiBench to test whether spatial information and epidemic priors can improve forecasts in realistic public health settings. It uses 11 epidemic datasets with rolling evaluations that mimic real-time conditions and metrics focused on outbreak periods. The tests find that models informed by geographic adjacency and standard epidemic priors mostly fail to beat a basic persistence model that repeats the most recent observation. This shortfall holds across forecast horizons from one day to one month and persists even when outbreaks are active. The work identifies three recurring problems: inability to anticipate new outbreaks, trouble with sparse noisy data, and weak value from standard geographic adjacency as a representation of relevant spatial structure.

Core claim

SpatialEpiBench demonstrates that adjacency-informed forecasting models equipped with widely used epidemic priors mostly underperform the last-value baseline from one day to one month ahead across eleven datasets, even during outbreaks, because of three failure modes: poor anticipation of new outbreaks, difficulty with sparsity and noise, and limited utility of common geographic adjacency for capturing epidemiological spatial information.

What carries the argument

SpatialEpiBench, a collection of eleven epidemic datasets together with standardized rolling evaluations and outbreak-specific metrics designed to assess the added value of spatial adjacency and epidemic priors.

Load-bearing premise

The standardized rolling evaluations and outbreak-specific metrics accurately reflect the challenges of real-time public health forecasting, and the selected models and adjacency representation adequately test the utility of spatial information.

What would settle it

A forecasting method that consistently outperforms the last-value baseline across all eleven datasets, multiple horizons from one day to one month, and both outbreak and non-outbreak periods in SpatialEpiBench would falsify the reported underperformance.

Figures

Figures reproduced from arXiv: 2605.06530 by Alistair Turcan, Bryan Wilder, Ruiqi Lyu.

Figure 1
Figure 1. Figure 1: Main results (a) Win rate of each method versus naive baseline (higher is better), dashed line at 0.5 indicates winning 50% of the time. (b) Overall RMSE relative to baseline of each method (lower is better), dashed line at 1.0 indicates equal error to baseline. (c-d) RMSE of each method at multiple horizons in (c) DVcli and (d) ILI2019. 95% CI’s are calculated for each plot by bootstrapping across months … view at source ↗
Figure 2
Figure 2. Figure 2: Outbreak results (a) Overall RMSE relative to baseline of each method (lower is better), dashed line at 1.0 indicates equal error to baseline. (b) Additional RMSE relative to baseline in outbreak periods versus all periods for each method. 95% CI’s are calculated for each plot by bootstrapping across months and meta-analyzing across horizons for (a) and datasets for (b). performance. Treating patches as hy… view at source ↗
Figure 3
Figure 3. Figure 3: Epidemic prior performance. (a) RMSE for each method–patch–dataset combination relative to the unpatched method. (b) RMSE relative to the naive baseline for each method with its best patch configuration; lower is better, and the dashed line at 1.0 marks equal error to baseline. 95% CIs are computed by bootstrapping across months and meta-analyzing across horizons. Missouri, shows several methods forecastin… view at source ↗
Figure 4
Figure 4. Figure 4: Failure modes. (a) Average signed error during outbreak and non-outbreak periods. (b) RMSE filtered versus RMSE, relative to the naive baseline. (c) Representative outbreak and method predictions at a 7-day horizon. (d) RMSE filtered for each method with its best patch at each horizon in JHUCase; lower is better. For all plots, 95% CIs are computed by bootstrapping across months; for (a–b), estimates are m… view at source ↗
Figure 5
Figure 5. Figure 5: Learning a new adjacency-informed model (a) RMSE relative to baseline of each method with its best possible patch configuration (lower is better), dashed line at 1.0 indicates equal error to baseline. Best graph-based refers to the top spatial model in a dataset. (b) Optimization trajectory of TusoAI. The 3 discoveries with largest validation performance gain are annotated with a summary of the resulting c… view at source ↗
Figure 6
Figure 6. Figure 6: Error of each method. 95% CI’s are calculated for each plot by bootstrapping across months view at source ↗
Figure 7
Figure 7. Figure 7: Error of each method. 95% CI’s are calculated for each plot by bootstrapping across months view at source ↗
Figure 8
Figure 8. Figure 8: Error of each method. 95% CI’s are calculated for each plot by bootstrapping across months view at source ↗
Figure 9
Figure 9. Figure 9: Error of each method. 95% CI’s are calculated for each plot by bootstrapping across months view at source ↗
Figure 10
Figure 10. Figure 10: Error of each method. 95% CI’s are calculated for each plot by bootstrapping across view at source ↗
Figure 11
Figure 11. Figure 11: Error of each method. 95% CI’s are calculated for each plot by bootstrapping across view at source ↗
Figure 12
Figure 12. Figure 12: Error of each method. 95% CI’s are calculated for each plot by bootstrapping across view at source ↗
Figure 13
Figure 13. Figure 13: Error of each method at multiple horizons in AUcase by each metric. 95% CI’s are view at source ↗
Figure 14
Figure 14. Figure 14: Error of each method at multiple horizons in AUcase by each metric. 95% CI’s are view at source ↗
Figure 15
Figure 15. Figure 15: Error of each method at multiple horizons in CAcase by each metric. 95% CI’s are view at source ↗
Figure 16
Figure 16. Figure 16: Error of each method at multiple horizons in CANPositivity by each metric. 95% CI’s are view at source ↗
Figure 17
Figure 17. Figure 17: Error of each method at multiple horizons in CHNGinpatient by each metric. 95% CI’s view at source ↗
Figure 18
Figure 18. Figure 18: Error of each method at multiple horizons in CHNGoutpatient by each metric. 95% CI’s view at source ↗
Figure 19
Figure 19. Figure 19: Error of each method at multiple horizons in CPRadmissions by each metric. 95% CI’s view at source ↗
Figure 20
Figure 20. Figure 20: Error of each method at multiple horizons in DVcli by each metric. 95% CI’s are calculated view at source ↗
Figure 21
Figure 21. Figure 21: Error of each method at multiple horizons in HHShosp by each metric. 95% CI’s are view at source ↗
Figure 22
Figure 22. Figure 22: Error of each method at multiple horizons in ILINet2019 by each metric. 95% CI’s are view at source ↗
Figure 23
Figure 23. Figure 23: Error of each method at multiple horizons in JHUcase by each metric. 95% CI’s are view at source ↗
Figure 24
Figure 24. Figure 24: Win rate of each method versus naive baseline (higher is better), dashed line at 0.5 view at source ↗
Figure 25
Figure 25. Figure 25: Win rate of each method versus naive baseline (higher is better), dashed line at 0.5 view at source ↗
Figure 26
Figure 26. Figure 26: Error of each method. 95% CI’s are calculated for each plot by bootstrapping across view at source ↗
Figure 27
Figure 27. Figure 27: Average error of each method in each dataset for non-outbreaks and outbreaks. Dashed view at source ↗
Figure 28
Figure 28. Figure 28: Error of each method. 95% CI’s are calculated for each plot by bootstrapping across view at source ↗
Figure 29
Figure 29. Figure 29: Win rate of each method versus naive baseline (higher is better), dashed line at 0.5 view at source ↗
read the original abstract

Accurate epidemic forecasting is crucial for public health response, resource allocation, and outbreak intervention, but remains difficult with sparse, noisy, and highly non-stationary data. Because epidemics unfold across interacting regions, spatiotemporal methods are natural candidates for improving forecasts. Despite growing interest in spatial information, no standardized benchmark exists, and current evaluations often use simple chronological train-test splits that do not reflect real-time forecasting practice. We address this gap with SpatialEpiBench, a challenging benchmark for spatiotemporal epidemic forecasting in realistic public-health settings. SpatialEpiBench includes 11 epidemic datasets with standardized rolling evaluations and outbreak-specific metrics. We evaluate adjacency-informed forecasting models with widely used epidemic priors that adapt general models to epidemiology, but find that most methods underperform a simple last-value baseline from 1 day to 1 month ahead, even during outbreaks and with these priors. We identify three major failure modes: (1) poor outbreak anticipation, (2) difficulty handling sparsity and noise, and (3) limited utility of common geographic adjacency for epidemiological spatial information. We release benchmark data, code, and instructions at https://github.com/Rachel-Lyu/SpatialEpiBench to support development of operationally useful epidemic forecasting models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SpatialEpiBench, a benchmark comprising 11 epidemic datasets with standardized rolling-window evaluations and outbreak-specific metrics. It evaluates adjacency-informed spatiotemporal forecasting models augmented with common epidemic priors and reports that the majority underperform a simple last-value baseline across 1-day to 1-month horizons, including during outbreaks. Three failure modes are identified: poor anticipation of outbreaks, difficulty with sparsity and noise, and limited utility of standard geographic adjacency for capturing epidemiological spatial structure. The benchmark data, code, and instructions are released publicly.

Significance. If the empirical findings prove robust, the work provides a valuable standardized challenge that documents concrete limitations of current spatial and prior-augmented approaches to epidemic forecasting. The public release of data and code is a clear strength that enables reproducibility and community follow-up. The results, if confirmed under more operationally realistic conditions, would usefully redirect attention toward methods that better handle non-stationarity, data revisions, and true spatial epidemiological signals rather than simple geographic adjacency.

major comments (2)
  1. [Abstract and benchmark evaluation description] Abstract and evaluation protocol: The central claim that adjacency-informed models with epidemic priors underperform the last-value baseline 'in realistic public-health settings' rests on rolling evaluations performed on finalized, complete time series. No simulation of reporting lags, backfills, or partial data availability is described, which directly affects whether the observed underperformance and the three failure modes (especially sparsity/noise) would hold under the operational conditions the benchmark purports to represent.
  2. [Results] Results section: The abstract states that 'most methods underperform' the baseline, yet no statistical significance tests, confidence intervals, or details on hyperparameter selection and exact train/validation/test splits are provided. Without these, it is impossible to determine whether the reported gaps are reliable or could be explained by optimization differences or particular data partitions.
minor comments (2)
  1. [Discussion] The three failure modes are listed but would benefit from quantitative backing (e.g., specific error breakdowns or dataset-level examples) to make the post-hoc diagnosis more falsifiable.
  2. [Methods] Notation for the adjacency matrices and the precise form of the epidemic priors should be defined once in a dedicated subsection rather than scattered across model descriptions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of operational realism and statistical rigor that we will address in revision. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract and benchmark evaluation description] Abstract and evaluation protocol: The central claim that adjacency-informed models with epidemic priors underperform the last-value baseline 'in realistic public-health settings' rests on rolling evaluations performed on finalized, complete time series. No simulation of reporting lags, backfills, or partial data availability is described, which directly affects whether the observed underperformance and the three failure modes (especially sparsity/noise) would hold under the operational conditions the benchmark purports to represent.

    Authors: We agree that our rolling-window evaluations are conducted on finalized, complete time series and do not simulate reporting lags, backfills, or partial data availability. This represents a genuine limitation relative to fully operational public-health conditions. The rolling protocol is intended to approximate real-time forecasting by restricting training data to information available at each forecast origin, but it does not capture data revisions or delayed reporting. We note that the observed underperformance of adjacency-informed models even under these idealized data conditions suggests the identified failure modes (particularly sparsity/noise handling) are likely to be at least as severe in operational settings. In revision we will (1) qualify the abstract language to specify that the benchmark assumes complete data while using rolling windows, (2) add an explicit limitations paragraph discussing the idealized data assumption, and (3) outline planned future extensions that incorporate simulated reporting delays. We do not claim the current results fully replicate operational conditions. revision: partial

  2. Referee: [Results] Results section: The abstract states that 'most methods underperform' the baseline, yet no statistical significance tests, confidence intervals, or details on hyperparameter selection and exact train/validation/test splits are provided. Without these, it is impossible to determine whether the reported gaps are reliable or could be explained by optimization differences or particular data partitions.

    Authors: We accept that the current manuscript lacks statistical significance tests, confidence intervals, and sufficient detail on hyperparameter selection and data splits. These omissions weaken the ability to assess the reliability of the performance gaps. In the revised manuscript we will add (1) paired statistical tests (Wilcoxon signed-rank tests across horizons and datasets) comparing each method against the last-value baseline, (2) bootstrap-derived 95% confidence intervals for all reported metrics, and (3) expanded methods text that fully documents the hyperparameter search ranges, validation strategy, and the precise rolling train/validation/test window definitions for every dataset and forecast horizon. These additions will be placed in the Results and Methods sections and will be accompanied by supplementary tables containing the raw per-dataset, per-horizon scores. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical benchmark with direct comparisons

full rationale

This is an empirical benchmarking paper that evaluates existing spatiotemporal models against a simple last-value baseline on 11 epidemic datasets using standardized rolling evaluations and outbreak-specific metrics. The abstract and description contain no derivations, first-principles predictions, fitted parameters renamed as forecasts, or self-citation chains that justify a mathematical claim. All reported findings are direct empirical results from model runs on external data, with the last-value baseline serving as an independent comparator. No load-bearing step reduces to its own inputs by construction, satisfying the criteria for a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central empirical claims rest on the validity of the chosen baseline, the representativeness of the 11 datasets, and the assumption that geographic adjacency captures epidemiologically relevant spatial structure. No free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Rolling chronological train-test splits reflect real-time forecasting practice
    Invoked to justify the evaluation protocol as realistic.
  • domain assumption Geographic adjacency provides relevant spatial information for epidemic spread
    Tested via adjacency-informed models; the paper concludes it has limited utility.

pith-pipeline@v0.9.0 · 5515 in / 1504 out tokens · 58969 ms · 2026-05-08T09:35:20.989574+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 9 canonical work pages · 1 internal anchor

  1. [1]

    Cramer et

    Estee Y . Cramer et. al. Evaluation of individual and ensemble probabilistic forecasts of covid-19 mortality in the united states.Proceedings of the National Academy of Sciences, 119(15): 9 e2113561119, 2022. doi: 10.1073/pnas.2113561119. URL https://www.pnas.org/doi/ abs/10.1073/pnas.2113561119

  2. [2]

    outbreak science

    Caitlin Rivers, Jean-Paul Chretien, Steven Riley, Julie A Pavlin, Alexandra Woodward, David Brett-Major, Irina Maljkovic Berry, Lindsay Morton, Richard G Jarman, Matthew Biggerstaff, et al. Using “outbreak science” to strengthen the use of models during epidemics.Nature communications, 10(1):3102, 2019

  3. [3]

    Quantifying the information in noisy epidemic curves.Nature Computational Science, 2(9):584–594, 2022

    Kris V Parag, Christl A Donnelly, and Alexander E Zarebski. Quantifying the information in noisy epidemic curves.Nature Computational Science, 2(9):584–594, 2022

  4. [4]

    Machine learning for data-centric epidemic forecasting.Nature Machine Intelligence, 6(10):1122–1131, 2024

    Alexander Rodriguez, Harshavardhan Kamarthi, Pulak Agarwal, Javen Ho, Mira Patel, Suchet Sapre, and B Aditya Prakash. Machine learning for data-centric epidemic forecasting.Nature Machine Intelligence, 6(10):1122–1131, 2024

  5. [5]

    Adaptive graph convolutional recurrent network for traffic forecasting.Advances in neural information processing systems, 33:17804–17815, 2020

    Lei Bai, Lina Yao, Can Li, Xianzhi Wang, and Can Wang. Adaptive graph convolutional recurrent network for traffic forecasting.Advances in neural information processing systems, 33:17804–17815, 2020

  6. [6]

    Diffusion convolutional recurrent neural network: Data-driven traffic forecasting.arXiv preprint arXiv:1707.01926, 2017

    Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting.arXiv preprint arXiv:1707.01926, 2017

  7. [7]

    Connecting the dots: Multivariate time series forecasting with graph neural networks

    Zonghan Wu, Shirui Pan, Guodong Long, Jing Jiang, Xiaojun Chang, and Chengqi Zhang. Connecting the dots: Multivariate time series forecasting with graph neural networks. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 753–763, 2020

  8. [8]

    InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining(Barcelona Spain, 2024-08-25)

    Zewen Liu, Guancheng Wan, B. Aditya Prakash, Max S.Y . Lau, and Wei Jin. A review of graph neural networks in epidemic modeling. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’24, page 6577–6587, New York, NY , USA, 2024. Association for Computing Machinery. ISBN 9798400704901. doi: 10.1145/3637528.3671455. UR...

  9. [9]

    Do we really need foundation models for multi-step-ahead epidemic forecasting? InNeurIPS Workshop on Time Series in the Age of Large Models, 2024

    Mrinmoy Dey, Aprameyo Chakrabartty, Dhruv Sarkar, and Tanujit Chakraborty. Do we really need foundation models for multi-step-ahead epidemic forecasting? InNeurIPS Workshop on Time Series in the Age of Large Models, 2024

  10. [10]

    Zero-shot forecasting of epidemics

    Madhurima Panja, Ojas Modak, Grace Younes, and Tanujit Chakraborty. Zero-shot forecasting of epidemics. InRecent Advances in Time Series Foundation Models Have We Reached the’BERT Moment’?, 2025

  11. [11]

    IDOBE: Infectious Disease Outbreak forecasting Benchmark Ecosystem

    Aniruddha Adiga, Jingyuan Chou, Anshul Chiranth, Bryan Lewis, Ana I Bento, Shaun Truelove, Geoffrey Fox, Madhav Marathe, Harry Hochheiser, and Srini Venkatramanan. Idobe: Infectious disease outbreak forecasting benchmark ecosystem.arXiv preprint arXiv:2604.18521, 2026

  12. [12]

    Cola-gnn: Cross-location attention based graph neural networks for long-term ili prediction

    Songgaojun Deng, Shusen Wang, Huzefa Rangwala, Lijing Wang, and Yue Ning. Cola-gnn: Cross-location attention based graph neural networks for long-term ili prediction. InProceedings of the 29th ACM international conference on information & knowledge management, pages 245–254, 2020

  13. [13]

    Earth: Epidemiology-aware neural ode with continuous disease transmission graph

    Guancheng Wan, Zewen Liu, Xiaojun Shan, Max SY Lau, B Aditya Prakash, and Wei Jin. Earth: Epidemiology-aware neural ode with continuous disease transmission graph. InForty-second International Conference on Machine Learning, 2025

  14. [14]

    Epignn: Exploring spatial transmission with graph neural network for regional epidemic forecasting

    Feng Xie, Zhong Zhang, Liang Li, Bin Zhou, and Yusong Tan. Epignn: Exploring spatial transmission with graph neural network for regional epidemic forecasting. InJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 469–485. Springer, 2022

  15. [15]

    Epidemiology-aware deep learning for infectious disease dynamics prediction

    Mutong Liu, Yang Liu, and Jiming Liu. Epidemiology-aware deep learning for infectious disease dynamics prediction. InProceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 4084–4088, 2023. 10

  16. [16]

    Fair allocation of scarce medical resources in the time of covid-19, 2020

    Ezekiel J Emanuel, Govind Persad, Ross Upshur, Beatriz Thome, Michael Parker, Aaron Glickman, Cathy Zhang, Connor Boyle, Maxwell Smith, and James P Phillips. Fair allocation of scarce medical resources in the time of covid-19, 2020

  17. [17]

    The mathematics of infectious diseases.SIAM review, 42(4):599–653, 2000

    Herbert W Hethcote. The mathematics of infectious diseases.SIAM review, 42(4):599–653, 2000

  18. [18]

    Inferring change points in the spread of covid-19 reveals the effectiveness of interventions.Science, 369(6500):eabb9789, 2020

    Jonas Dehning, Johannes Zierenberg, F Paul Spitzner, Michael Wibral, Joao Pinheiro Neto, Michael Wilczek, and Viola Priesemann. Inferring change points in the spread of covid-19 reveals the effectiveness of interventions.Science, 369(6500):eabb9789, 2020

  19. [19]

    Forecasting traffic congestion using arima modeling

    Taghreed Alghamdi, Khalid Elgazzar, Magdi Bayoumi, Taysseer Sharaf, and Sumit Shah. Forecasting traffic congestion using arima modeling. In2019 15th international wireless communications & mobile computing conference (IWCMC), pages 1227–1232. IEEE, 2019

  20. [20]

    Are transformers effective for time series forecasting? InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 11121–11128, 2023

    Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 11121–11128, 2023

  21. [21]

    The united states covid-19 forecast hub dataset.Scientific data, 9(1):462, 2022

    Estee Y Cramer, Yuxin Huang, Yijin Wang, Evan L Ray, Matthew Cornell, Johannes Bracher, Andrea Brennen, Alvaro J Castro Rivadeneira, Aaron Gerding, Katie House, et al. The united states covid-19 forecast hub dataset.Scientific data, 9(1):462, 2022

  22. [22]

    Spatial temporal graph convolutional networks for skeleton-based action recognition

    Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial temporal graph convolutional networks for skeleton-based action recognition. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

  23. [23]

    Graph wavenet for deep spatial-temporal graph model- ing.arXiv preprint arXiv:1906.00121,

    Zonghan Wu, Shirui Pan, Guodong Long, Jing Jiang, and Chengqi Zhang. Graph wavenet for deep spatial-temporal graph modeling.arXiv preprint arXiv:1906.00121, 2019

  24. [24]

    St-norm: Spatial and temporal normalization for multi-variate time series forecasting

    Jinliang Deng, Xiusi Chen, Renhe Jiang, Xuan Song, and Ivor W Tsang. St-norm: Spatial and temporal normalization for multi-variate time series forecasting. InProceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining, pages 269–278, 2021

  25. [25]

    Discrete graph structure learning for forecasting multiple time series.arXiv preprint arXiv:2101.06861, 2021

    Chao Shang, Jie Chen, and Jinbo Bi. Discrete graph structure learning for forecasting multiple time series.arXiv preprint arXiv:2101.06861, 2021

  26. [26]

    Spectral temporal graph neural network for multivariate time-series forecasting.Advances in neural information processing systems, 33:17766–17778, 2020

    Defu Cao, Yujing Wang, Juanyong Duan, Ce Zhang, Xia Zhu, Congrui Huang, Yunhai Tong, Bixiong Xu, Jing Bai, Jie Tong, et al. Spectral temporal graph neural network for multivariate time-series forecasting.Advances in neural information processing systems, 33:17766–17778, 2020

  27. [27]

    Basicts: An open source fair multivariate time series prediction benchmark

    Yubo Liang, Zezhi Shao, Fei Wang, Zhao Zhang, Tao Sun, and Yongjun Xu. Basicts: An open source fair multivariate time series prediction benchmark. InInternational symposium on benchmarking, measuring and optimization, pages 87–101. Springer, 2022

  28. [28]

    Causalgnn: Causal-based graph neural networks for spatio-temporal epidemic forecasting

    Lijing Wang, Aniruddha Adiga, Jiangzhuo Chen, Adam Sadilek, Srinivasan Venkatramanan, and Madhav Marathe. Causalgnn: Causal-based graph neural networks for spatio-temporal epidemic forecasting. InProceedings of the AAAI conference on artificial intelligence, volume 36, pages 12191–12199, 2022

  29. [29]

    Stan: spatio-temporal attention network for pandemic prediction using real-world evidence.Journal of the American Medical Informatics Association, 28(4): 733–743, 2021

    Junyi Gao, Rakshith Sharma, Cheng Qian, Lucas M Glass, Jeffrey Spaeder, Justin Romberg, Jimeng Sun, and Cao Xiao. Stan: spatio-temporal attention network for pandemic prediction using real-world evidence.Journal of the American Medical Informatics Association, 28(4): 733–743, 2021

  30. [30]

    Prior knowledge-enhanced spatio-temporal epidemic forecasting.arXiv preprint arXiv:2602.22270, 2026

    Sijie Ruan, Jinyu Li, Jia Wei, Zenghao Xu, Jie Bao, Junshi Xu, Junyang Qiu, Hanning Yuan, Xiaoxiao Wang, and Shuliang Wang. Prior knowledge-enhanced spatio-temporal epidemic forecasting.arXiv preprint arXiv:2602.22270, 2026

  31. [31]

    Evidence- driven spatiotemporal covid-19 hospitalization prediction with ising dynamics.Nature commu- nications, 14(1):3093, 2023

    Junyi Gao, Joerg Heintz, Christina Mack, Lucas Glass, Adam Cross, and Jimeng Sun. Evidence- driven spatiotemporal covid-19 hospitalization prediction with ising dynamics.Nature commu- nications, 14(1):3093, 2023. 11

  32. [32]

    A review of graph neural networks in epidemic modeling

    Zewen Liu, Guancheng Wan, B Aditya Prakash, Max SY Lau, and Wei Jin. A review of graph neural networks in epidemic modeling. InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pages 6577–6587, 2024

  33. [33]

    Combining digital data streams and epidemic networks for real time outbreak detection.arXiv preprint arXiv:2511.07163, 2025

    Ruiqi Lyu, Alistair Turcan, and Bryan Wilder. Combining digital data streams and epidemic networks for real time outbreak detection.arXiv preprint arXiv:2511.07163, 2025

  34. [34]

    Delphi epidata api.The Lancet Infectious Diseases, 2015

    David C Farrow, Logan C Brooks, Aaron Rumack, Ryan J Tibshirani, and Roni Rosenfeld. Delphi epidata api.The Lancet Infectious Diseases, 2015

  35. [35]

    An interactive web-based dashboard to track covid-19 in real time.The Lancet infectious diseases, 20(5):533–534, 2020

    Ensheng Dong, Hongru Du, and Lauren Gardner. An interactive web-based dashboard to track covid-19 in real time.The Lancet infectious diseases, 20(5):533–534, 2020

  36. [36]

    Us influenza surveillance: purpose and methods

    Centers for Disease Control, Prevention, et al. Us influenza surveillance: purpose and methods. National Center for Immunization and Respiratory Diseases (NCIRD)[cited 2024 Apr 29]. https://www. cdc. gov/flu/weekly/overview. htm, 2024

  37. [37]

    Accuracy of real-time multi-model ensemble forecasts for seasonal influenza in the us

    Nicholas G Reich, Craig J McGowan, Teresa K Yamana, Abhinav Tushar, Evan L Ray, Dave Osthus, Sasikiran Kandula, Logan C Brooks, Willow Crawford-Crudell, Graham Casey Gibson, et al. Accuracy of real-time multi-model ensemble forecasts for seasonal influenza in the us. PLoS computational biology, 15(11):e1007486, 2019

  38. [38]

    Deepar: Probabilistic forecasting with autoregressive recurrent networks.International journal of forecasting, 36(3): 1181–1191, 2020

    David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. Deepar: Probabilistic forecasting with autoregressive recurrent networks.International journal of forecasting, 36(3): 1181–1191, 2020

  39. [39]

    PhD thesis, Carnegie Mellon University, 2023

    Aaron Rumack.Modeling Epidemiological Time Series. PhD thesis, Carnegie Mellon University, 2023

  40. [40]

    Einns: epidemiologically-informed neural networks

    Alexander Rodríguez, Jiaming Cui, Naren Ramakrishnan, Bijaya Adhikari, and B Aditya Prakash. Einns: epidemiologically-informed neural networks. InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 14453–14460, 2023

  41. [41]

    The construction of next-generation matrices for compartmental epidemic models.Journal of the royal society interface, 7(47):873–885, 2010

    Odo Diekmann, Johan Andre Peter Heesterbeek, and Michael G Roberts. The construction of next-generation matrices for compartmental epidemic models.Journal of the royal society interface, 7(47):873–885, 2010

  42. [42]

    Tusoai: Agentic optimization for scientific methods.arXiv preprint arXiv:2509.23986, 2025

    Alistair Turcan, Kexin Huang, Lei Li, and Martin Jinye Zhang. Tusoai: Agentic optimization for scientific methods.arXiv preprint arXiv:2509.23986, 2025

  43. [43]

    we were unable to find the license for the dataset we used

    Gabrielle Thivierge, Aaron Rumack, and F William Townes. Does spatial information improve forecasting of influenza-like illness?Epidemics, 51:100820, 2025. 12 A Data overview Our benchmark aggregates public and semi-public epidemiological surveillance signals spanning syndromic surveillance, mortality, confirmed cases, testing, hospitalization, claims-bas...

  44. [44]

    Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...