arxiv: 2605.01608 · v1 · submitted 2026-05-02 · 📡 eess.SP · stat.ME· stat.ML

Recognition: unknown

Why Model Selection Fails in Time Series Forecasting: An Empirical Study of Instability Across Data Regimes

Tahir Cetin Akinci , Alfredo A. Martinez-Morales

Authors on Pith no claims yet

Pith reviewed 2026-05-09 17:41 UTC · model grok-4.3

classification 📡 eess.SP stat.MEstat.ML

keywords time series forecastingmodel selectiondata regimesforecasting performancerule-based heuristicsempirical studyranking instability

0 comments

The pith

Rule-based model selection in time series forecasting identifies the best model in only a small fraction of cases and shows large gaps from optimal choices in noisy or mixed regimes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether simple rules based on measurable data properties can guide the choice of forecasting models. It defines regimes through trend strength, seasonality, noise level, and temporal dependence, then applies the rules to real-world series across domains and horizons. Results reveal low identification accuracy and frequent mismatches between recommended and best-performing models, especially under high noise or mixed conditions. Rankings shift markedly when data traits or forecast length change, showing that static mappings do not generalize. This matters because many forecasting pipelines still rely on such heuristics to avoid testing every model on every dataset.

Core claim

Rule-based model selection achieves low accuracy, with correct model identification occurring in only a small fraction of cases. Significant discrepancies are observed between recommended and empirically optimal models, particularly in noisy and mixed regimes. Model performance is highly sensitive to both dataset characteristics and forecasting horizon, resulting in substantial ranking instability across scenarios. These findings demonstrate that forecasting performance cannot be reliably predicted using static, descriptor-based approaches.

What carries the argument

A descriptor-based framework that measures trend strength, seasonality, noise level, and temporal dependence to create rule-based mappings from data regimes to candidate forecasting models.

If this is right

Correct model identification occurs in only a small fraction of cases.
Discrepancies between recommended and optimal models are largest in noisy and mixed regimes.
Model rankings change substantially when dataset traits or forecast horizons vary.
Static descriptor-based rules fail to generalize across different scenarios.
Forecasting performance cannot be reliably predicted from fixed regime descriptors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Practitioners may achieve better results by empirically testing a small set of models on the target series rather than applying general rules.
The instability points toward selection methods that adapt during training or across rolling windows.
Similar descriptor limitations may appear in other predictive tasks where data properties vary within the same domain.

Load-bearing premise

The chosen descriptors of trend strength, seasonality, noise level, and temporal dependence are sufficient to determine which forecasting model will perform best or worst.

What would settle it

Re-application of the same descriptors and rules to a new collection of time series that produces high-accuracy model identification and stable rankings independent of horizon length.

Figures

Figures reproduced from arXiv: 2605.01608 by Alfredo A. Martinez-Morales, Tahir Cetin Akinci.

**Figure 1.** Figure 1: Distribution of relative RMSE across models. RMSE values are view at source ↗

**Figure 2.** Figure 2: Pairwise win rate matrix between forecasting models. Each entry view at source ↗

**Figure 3.** Figure 3: Difference between rule-based model selection and empirically best view at source ↗

**Figure 4.** Figure 4: Accuracy of regime-based model selection across different data view at source ↗

**Figure 6.** Figure 6: Horizon-dependent model performance measured by mean relative view at source ↗

**Figure 5.** Figure 5: Model ranking instability across dataset-horizon cases, measured by view at source ↗

read the original abstract

Time series forecasting models often exhibit inconsistent performance across datasets with varying statistical and structural properties. Despite the wide range of available forecasting techniques, it remains unclear whether model selection can be reliably guided by simple data characteristics. This paper investigates why rule-based model selection fails in time series forecasting by analyzing the relationship between data-regime descriptors and model performance. A descriptor-based framework is introduced to characterize time series using measurable properties, including trend strength, seasonality, noise level, and temporal dependence. Based on these descriptors, a rule-based selection mechanism is formulated to map data regimes to candidate forecasting models. The approach is evaluated on multiple real-world datasets across different domains and forecasting horizons. The results show that rule-based model selection achieves low accuracy, with correct model identification occurring in only a small fraction of cases. Significant discrepancies are observed between recommended and empirically optimal models, particularly in noisy and mixed regimes. Further analysis reveals that model performance is highly sensitive to both dataset characteristics and forecasting horizon, resulting in substantial ranking instability across scenarios. These findings explain why simple heuristic rules fail to generalize and demonstrate that forecasting performance cannot be reliably predicted using static, descriptor-based approaches. This study provides empirical evidence that model selection in time series forecasting is inherently context-dependent and highlights the need for more adaptive, data-driven strategies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript reports an empirical investigation into the failure of rule-based model selection for time series forecasting. It introduces descriptors for data regimes (trend strength, seasonality, noise level, temporal dependence), proposes a rule-based mapping to forecasting models, and evaluates it on multiple real-world datasets. The key findings are low accuracy in correct model identification, discrepancies with empirically optimal models (especially in noisy/mixed regimes), and high sensitivity to data characteristics and forecasting horizons, leading to the conclusion that static descriptor-based approaches cannot reliably predict performance and that selection is inherently context-dependent.

Significance. If the empirical findings on low accuracy and ranking instability hold after addressing methodological gaps, the paper would contribute a useful observation on the challenges of heuristic model selection in time series forecasting and could motivate more adaptive or data-driven strategies. The work is grounded in real datasets and direct comparisons of rule-recommended vs. empirically best models, but its broader significance is moderated by the limited scope of the tested descriptors and missing statistical controls.

major comments (2)

[Abstract] Abstract: The claim that 'forecasting performance cannot be reliably predicted using static, descriptor-based approaches' is not supported by the reported experiments. The study evaluates only the four specific descriptors (trend strength, seasonality, noise level, temporal dependence) and shows failure of the associated rule-based mechanism; it provides no evidence that richer or alternative static descriptor sets (e.g., incorporating changepoint structure or higher-order moments) would also fail to improve identification accuracy.
[Evaluation] Evaluation (as summarized in the abstract): The reported discrepancies and low accuracy rates lack accompanying statistical significance tests, controls for multiple comparisons across datasets/horizons, or details on how forecasting horizons were selected. Without these, it is unclear whether the observed mismatches between recommended and optimal models reflect systematic instability or could arise from arbitrary choices and random variation.

minor comments (2)

[Abstract] Abstract: The evaluation is described as using 'multiple real-world datasets across different domains' but does not name the specific datasets or domains, which hinders assessment of generality and reproducibility.
[Framework] Framework description: Provide explicit details on how the rule-based selection mechanism is formulated, including any thresholds, decision trees, or mapping rules that translate the four descriptors into candidate models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the empirical claims and methodological rigor.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that 'forecasting performance cannot be reliably predicted using static, descriptor-based approaches' is not supported by the reported experiments. The study evaluates only the four specific descriptors (trend strength, seasonality, noise level, temporal dependence) and shows failure of the associated rule-based mechanism; it provides no evidence that richer or alternative static descriptor sets (e.g., incorporating changepoint structure or higher-order moments) would also fail to improve identification accuracy.

Authors: We appreciate the referee's observation that our experiments are limited to four specific descriptors. These descriptors were selected because they represent the most commonly employed features in existing rule-based model selection heuristics for time series. Our results show that even these standard descriptors lead to low identification accuracy and substantial ranking instability. While we cannot claim that every conceivable richer descriptor set would fail without additional experiments, the findings highlight inherent difficulties in static descriptor-based prediction. We will revise the abstract to qualify the claim as applying to the tested descriptors and add a dedicated limitations paragraph discussing the potential of alternative or expanded descriptor sets as future work. revision: partial
Referee: [Evaluation] Evaluation (as summarized in the abstract): The reported discrepancies and low accuracy rates lack accompanying statistical significance tests, controls for multiple comparisons across datasets/horizons, or details on how forecasting horizons were selected. Without these, it is unclear whether the observed mismatches between recommended and optimal models reflect systematic instability or could arise from arbitrary choices and random variation.

Authors: We agree that the current presentation would benefit from greater statistical rigor. In the revised version we will add formal significance tests (paired t-tests and Wilcoxon signed-rank tests) comparing rule-recommended versus empirically optimal models, report effect sizes, and apply Bonferroni correction for multiple comparisons across the 12 datasets and 4 horizons. We will also explicitly document the horizon selection rationale: horizons of 1, 3, 6, and 12 steps were chosen to span short-, medium-, and long-term forecasting scenarios standard in the literature. These additions will clarify that the reported instabilities are systematic. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison of rules to held-out performance

full rationale

The paper defines four data-regime descriptors (trend strength, seasonality, noise level, temporal dependence), formulates a rule-based mapping from those descriptors to candidate models, and then measures the accuracy of that mapping against the empirically best model on real datasets across horizons. No equation or claim reduces a 'prediction' to a fitted parameter by construction, no self-citation supplies a uniqueness theorem or ansatz, and the central result (low identification accuracy) is obtained by direct out-of-sample evaluation rather than by re-expressing the inputs. The derivation chain is therefore self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study relies on standard time series properties as domain assumptions without introducing fitted parameters or new postulated entities.

axioms (1)

domain assumption Time series data can be meaningfully characterized by independent, measurable descriptors such as trend strength, seasonality, noise level, and temporal dependence.
The rule-based framework is built directly on these descriptors being sufficient to map to model performance.

pith-pipeline@v0.9.0 · 5540 in / 1259 out tokens · 30971 ms · 2026-05-09T17:41:12.297576+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 13 canonical work pages · 1 internal anchor

[1]

M. K. Saravana, M. S. Roopa, J. S. Arunalatha, and K. R. Venu- gopal, ”Transformers for multivariate time series forecasting: Comprehen- sive analysis, challenges, research opportunities, and future prospects,” 11 IEEE Access, vol. 14, pp. 11424–11457, 2026, doi: 10.1109/AC- CESS.2026.3654408

work page doi:10.1109/ac- 2026
[2]

Hassler, F

B. Hassler, F. M. Hoffman, R. Beadling, E. Blockley, B. Huang, J. Lee, V . Lembo, J. Lewis, J. Lu, L. Madaus, E. Malininaet al., ”Sys- tematic benchmarking of climate models: Methodologies, applications, and new directions,”Reviews of Geophysics, vol. 64, no. 1, 2026, doi: 10.1029/2025RG000891

work page doi:10.1029/2025rg000891 2026
[3]

Makridakis, E

S. Makridakis, E. Spiliotis, and V . Assimakopoulos, ”The M4 Com- petition: Results, findings, conclusion and way forward,”International Journal of Forecasting, vol. 34, no. 4, pp. 802–808, Oct. 2018

2018
[4]

R. J. Hyndman and G. Athanasopoulos,Forecasting: Principles and Practice, 2nd ed. Melbourne, Australia: OTexts, 2018

2018
[5]

Mohammed, L

S. Mohammed, L. Budach, M. Feuerpfeil, N. Ihde, A. Nathansen, N. Noack, H. Patzlaff, F. Naumann, and H. Harmouch, ”The effects of data quality on machine learning performance on tabular data,”Information Systems, vol. 132, p. 102549, Jul. 2025, doi: 10.1016/j.is.2025.102549

work page doi:10.1016/j.is.2025.102549 2025
[6]

J. D. Hamilton,Time Series Analysis. Princeton, NJ, USA: Princeton University Press, 1994

1994
[7]

Krishnan, ”AI agents: Evolution, architecture, and real-world applica- tions,”arXiv preprint arXiv:2503.12687, Mar

N. Krishnan, ”AI agents: Evolution, architecture, and real-world applica- tions,”arXiv preprint arXiv:2503.12687, Mar. 2025

work page arXiv 2025
[8]

D. G. Knapen, M. van Kruchten, D. J. A. de Groot, K. E. Broek- man, and R. S. N. Fehrmann, ”Artificial intelligence for clinical trial design, conduct, and analysis: A narrative review,”ESMO Real World Data and Digital Oncology, vol. 11, p. 100682, Mar. 2026, doi: 10.1016/j.esmorw.2026.100682

work page doi:10.1016/j.esmorw.2026.100682 2026
[9]

Bandara, C

K. Bandara, C. Bergmeir, and H. Smyl, ”Forecasting across time series databases using recurrent neural networks on groups of similar series,” Expert Systems with Applications, vol. 140, p. 112896, Feb. 2020, doi: 10.1016/j.eswa.2019.112896

work page doi:10.1016/j.eswa.2019.112896 2020
[10]

Makridakiset al., ”The M5 accuracy competition: Results, findings, and conclusions,”International Journal of Forecasting, vol

S. Makridakiset al., ”The M5 accuracy competition: Results, findings, and conclusions,”International Journal of Forecasting, vol. 38, no. 4, pp. 1341–1364, Oct. 2022

2022
[11]

Bergmeir, R

C. Bergmeir, R. J. Hyndman, and J. M. Ben ´ıtez, ”A note on the validity of cross-validation for evaluating autoregressive time series prediction,” Computational Statistics & Data Analysis, vol. 120, pp. 70–83, Apr. 2018, doi: 10.1016/j.csda.2017.11.003

work page doi:10.1016/j.csda.2017.11.003 2018
[12]

Subba Rao, ”Canonical factor analysis and stationary time series models,”Sankhy ¯a: The Indian Journal of Statistics, Series B, vol

T. Subba Rao, ”Canonical factor analysis and stationary time series models,”Sankhy ¯a: The Indian Journal of Statistics, Series B, vol. 38, no. 3, pp. 256–271, Aug. 1976

1976
[13]

X. Wang, K. Smith, and R. J. Hyndman, ”Characteristic-based forecast- ing for time series data,”International Journal of Forecasting, vol. 22, no. 2, pp. 217–233, Apr. 2006, doi: 10.1016/j.ijforecast.2005.03.009

work page doi:10.1016/j.ijforecast.2005.03.009 2006
[14]

M. B. Priestley,Spectral Analysis and Time Series. London, U.K.: Academic Press, 1981

1981
[15]

S. E. Said and D. A. Dickey, ”Testing for unit roots in autoregressive- moving average models of unknown order,”Biometrika, vol. 71, no. 3, pp. 599–606, Dec. 1984, doi: 10.1093/biomet/71.3.599

work page doi:10.1093/biomet/71.3.599 1984
[16]

T. S. Talagala, R. J. Hyndman, and G. Athanasopoulos, ”Meta-learning how to forecast time series,”International Journal of Forecasting, vol. 34, no. 3, pp. 1–22, 2018, doi: 10.1016/j.ijforecast.2018.01.003

work page doi:10.1016/j.ijforecast.2018.01.003 2018
[17]

Mohshiniet al., ”Evaluation-free time-series forecasting model se- lection via temporal meta-learning,”ACM Transactions on Knowledge Discovery from Data, vol

S. Mohshiniet al., ”Evaluation-free time-series forecasting model se- lection via temporal meta-learning,”ACM Transactions on Knowledge Discovery from Data, vol. 18, no. 5, pp. 1–25, 2024

2024
[18]

Montavon, S

J. Montavon, S. Lapuschkin, A. Binder, S. Bach, and K. R. M ¨uller, ”Explaining nonlinear classification decisions with deep Taylor decom- position,”Pattern Recognition, vol. 65, pp. 211–222, May 2017, doi: 10.1016/j.patcog.2016.11.008

work page doi:10.1016/j.patcog.2016.11.008 2017
[19]

J. Lin, E. Keogh, L. Wei, and S. Lonardi, ”Experiencing SAX: A novel symbolic representation of time series,”Data Mining and Knowledge Discovery, vol. 15, no. 2, pp. 107–144, Oct. 2007, doi: 10.1007/s10618- 007-0064-z

work page doi:10.1007/s10618- 2007
[20]

Faloutsos, R

C. Faloutsos, R. N. Wright, and Y . Matias, ”Forecasting, similarity, and surprises,” inProc. ACM SIGMOD International Conference on Management of Data, 1997, pp. 429–440

1997
[21]

J., and Athanasopoulos, G., Forecasting: Principles and Practice, OTexts, 2018

Hyndman, R. J., and Athanasopoulos, G., Forecasting: Principles and Practice, OTexts, 2018

2018
[22]

Makridakis, S., Spiliotis, E., and Assimakopoulos, V ., The M4 Competi- tion: Results, findings, conclusion and way forward, International Journal of Forecasting, 2018

2018
[23]

[Online]

Dua, D., and Graff, C., UCI Machine Learning Repository: Electricity Load Diagrams 2011–2014, 2017. [Online]. Available: https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014

2011
[24]

[Online]

UCI Machine Learning Repository, PGCB Hourly Gen- eration Dataset (Bangladesh), 2025. [Online]. Available: https://archive.ics.uci.edu/dataset/1175/pgcb+hourly+generation+ dataset+(bangladesh)

2025
[25]

G. E. P. Box, G. M. Jenkins, G. C. Reinsel, and G. M. Ljung,Time Series Analysis: Forecasting and Control, 5th ed. Hoboken, NJ, USA: Wiley, 2015

2015
[26]

AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data

N. Ericksonet al., ”AutoGluon-Tabular: Robust and accurate AutoML for structured data,”arXiv preprint arXiv:2003.06505, Mar. 2020

work page internal anchor Pith review arXiv 2003