Pretrained Time-Series Foundation Models for Financial Return Forecasting
Pith reviewed 2026-06-26 01:33 UTC · model grok-4.3
The pith
Pretrained time-series foundation models often rank first in equity return forecasts yet produce only sparse, small gains over random walks that pass statistical tests in just two cases.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under a conservative rolling-origin protocol with equalized context on AAPL, AMZN, GOOG, JPM and META, pretrained models account for eight of ten task-level wins, with Moirai-2.0 and TimesFM-2.5 posting the best average ranks; however, only Chronos on AMZN and Moirai-2.0 on GOOG reject the null of equal or worse accuracy than a random walk at conventional significance levels, while the iTransformer trained locally wins both META tasks.
What carries the argument
Pretraining as an inductive prior that supplies useful attention geometry and PAC-Bayes-style transfer without requiring asset-specific data, evaluated through equal-context rolling-origin forecasts and Diebold-Mariano tests against random-walk and scratch-trained baselines.
If this is right
- Pretrained models lower the data and compute needed to reach competitive forecast accuracy in new assets.
- Local supervised training can still beat generic pretraining on particular stocks.
- Model ranking order does not translate into reliable economic alpha once noise and multiple-testing effects are accounted for.
- Information-theoretic limits on predictability remain binding even for the best-ranked pretrained models.
Where Pith is reading between the lines
- The same cost-saving role could appear in other low-signal domains such as energy demand or macroeconomic nowcasting where labeled data are scarce.
- Adding explicit economic loss functions or position-sizing constraints during fine-tuning might change which models survive the Diebold-Mariano filter.
- The observed pattern suggests that future benchmarks should report both ranking and economic metrics rather than ranking alone.
Load-bearing premise
The five chosen liquid equities together with the equal-context rolling-origin protocol and Diebold-Mariano testing give a representative picture of whether pretrained models can produce economically usable predictability.
What would settle it
A replication that expands to at least twenty equities across multiple sectors and frequencies and finds that at least half the pretrained models pass the Diebold-Mariano test against random walk with positive Sharpe improvement after transaction costs would falsify the central claim.
Figures
read the original abstract
Financial return forecasting is a difficult test case for time-series foundation models (TSFMs) due to low signal-to-noise ratios, structural breaks, heavy tails, and weak persistence. This paper benchmarks pretrained TSFMs against train-from-scratch neural baselines in a deliberately conservative financial setting. We evaluate TimeGPT/TimeGPT-LH, TimesFM-2.5, Moirai-2.0, Chronos, and Chronos-2 against NBEATS, NHITS, PatchTST, iTransformer, and KAN on five liquid U.S. equities (AAPL, AMZN, GOOG, JPM, META) using linear and log returns. Models are compared under an equalized context budget, a rolling-origin protocol, and against random-walk benchmarks. We provide a theoretical framing of pretraining as an inductive prior, linking PAC-Bayes transfer intuition, information-theoretic predictability limits, and attention geometry. This clarifies why strong model rankings need not imply economically meaningful predictability in noisy markets. Pragmatically, pretrained TSFMs dominate the ranking distribution, accounting for 8 of 10 task-level wins. Moirai-2.0 and TimesFM-2.5 achieve the strongest average ranks, leading tasks for AAPL, JPM, GOOG, and AMZN, while Chronos wins the remaining AMZN task. However, the iTransformer baseline wins both META tasks, showing local supervised learning can still outperform generic pretraining for specific assets. Crucially, gains over the random-walk benchmark are small and sparse. A one-sided Diebold-Mariano test rejects equal or inferior predictive accuracy only for Chronos on AMZN and Moirai-2.0 on GOOG. We conclude that TSFMs serve as useful practical priors that reduce model-development costs in low-data financial forecasting, but are not universal engines for statistically reliable alpha generation in realistic empirical deployment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript benchmarks pretrained time-series foundation models (TimeGPT, TimesFM-2.5, Moirai-2.0, Chronos, Chronos-2) against train-from-scratch baselines (NBEATS, NHITS, PatchTST, iTransformer, KAN) for linear and log return forecasting on five liquid U.S. equities. Using an equal-context rolling-origin protocol and one-sided Diebold-Mariano tests against random-walk benchmarks, it reports that TSFMs achieve 8 of 10 task-level ranking wins (led by Moirai-2.0 and TimesFM-2.5) but statistically significant gains over the random walk occur in only two cases (Chronos on AMZN, Moirai-2.0 on GOOG). The paper frames pretraining as an inductive prior and concludes that TSFMs reduce model-development costs in low-signal settings without delivering reliable alpha.
Significance. If the results hold, the work supplies a conservative, protocol-grounded demonstration that TSFMs function as practical priors for financial forecasting while underscoring the distinction between ranking dominance and statistically reliable predictability. The explicit linkage of PAC-Bayes transfer ideas to attention geometry and information-theoretic limits, together with the external random-walk anchor and Diebold-Mariano testing, strengthens the pragmatic takeaway.
minor comments (2)
- The evaluation is restricted to five equities; while the paper correctly labels the setting conservative, a brief discussion of how results might generalize to a broader cross-section (e.g., small-cap or international assets) would clarify scope.
- The manuscript notes the absence of public code artifacts; releasing the rolling-origin evaluation scripts would directly support the reproducibility claim already implicit in the protocol description.
Simulated Author's Rebuttal
We thank the referee for the positive and accurate summary of the manuscript, the assessment of its significance, and the recommendation to accept. The report correctly identifies the core empirical result (ranking dominance with sparse statistical gains over the random walk) and the pragmatic framing of TSFMs as inductive priors.
Circularity Check
No significant circularity detected
full rationale
The paper presents an empirical benchmarking study of pretrained TSFMs versus baselines on five equities under a rolling-origin protocol with Diebold-Mariano tests against an external random-walk benchmark. No load-bearing derivations, equations, or self-citations reduce the central claims (model rankings and sparse statistical significance) to fitted inputs or prior author results by construction. The theoretical framing on inductive priors is interpretive context rather than a closed derivation loop, and all performance metrics are anchored to independent data and standard external tests.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Diebold-Mariano test assumptions hold for the forecast errors in this financial setting
- domain assumption Equalized context budget and rolling-origin protocol remove confounding differences in information access across models
Reference graph
Works this paper leans on
-
[1]
URLhttps://arxiv.org/abs/2403.07815. Abdul Fatir Ansari, Oleksandr Shchur, Jaris Küken, Andreas Auer, Boran Han, Pedro Mercado, Syama Sundar Rangapuram, Huibin Shen, Lorenzo Stella, Xiyuan Zhang, Mononito Goswami, Shubham Kapoor, Danielle C. Maddix, Pablo Guerron, Tony Hu, Junming Yin, Nick Erickson, Prateek Mutalik Desai, Hao Wang, Huzefa Rangwala, Georg...
-
[2]
URL https://arxiv.org/abs/2510.15821. Vladimir I. Arnold. On functions of three variables.Doklady Akademii Nauk SSSR, 114:679–681,
-
[3]
URL https://arxiv.org/abs/2005.14165. Olivier Catoni.PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning, volume 56 ofInstitute of Mathematical Statistics Lecture Notes – Monograph Series. Institute of Mathematical Statistics,
Pith/arXiv arXiv 2005
-
[4]
Ching Chang, Wei-Yao Wang, Wen-Chih Peng, and Tien-Fu Chen
URLhttps://arxiv.org/abs/2201.12886. Ching Chang, Wei-Yao Wang, Wen-Chih Peng, and Tien-Fu Chen. Llm4ts: Aligning pre-trained llms as data-efficient time-series forecasters,
-
[5]
URLhttps://arxiv.org/abs/2308.08469. Kuo-Tsai Chen. Integration of paths—a faithful representation of paths by noncommutative formal power series.Transactions of the American Mathematical Society, 89(2):395–407,
- [6]
-
[7]
Carl de Boor.A Practical Guide to Splines, volume 27 ofApplied Mathematical Sciences
URLhttps://arxiv.org/abs/2310.10688. Carl de Boor.A Practical Guide to Splines, volume 27 ofApplied Mathematical Sciences. Springer, revised edition,
-
[8]
URL https: //arxiv.org/abs/2310.03589. Tilmann Gneiting and Adrian E. Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378,
-
[9]
URLhttps://arxiv.org/abs/2310.01728. J. L. Kelly. A new interpretation of information rate.Bell System Technical Journal, 35(4):917–926,
-
[10]
Version cited in manuscript as Moirai 2.0
URL https://arxiv.org/abs/2511.11698. Version cited in manuscript as Moirai 2.0. Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long. itransformer: Inverted transformers are effective for time series forecasting,
-
[11]
URLhttps: //arxiv.org/abs/2310.06625. Terry J. Lyons. Differential equations driven by rough signals.Revista Matemática Iberoamericana, 14(2):215–310,
-
[12]
URLhttps://arxiv.org/abs/2303.08774. Boris N. Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. N-beats: Neural basis expansion analysis for interpretable time series forecasting,
-
[13]
URLhttps://arxiv.org/abs/ 1905.10437. Felix Otto. The geometry of dissipative evolution equations: the porous medium equation.Commu- nications in Partial Differential Equations, 26(1–2):101–174,
arXiv 1905
-
[14]
URL https: //arxiv.org/abs/2312.11805. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models,
-
[15]
Cédric Villani.Topics in Optimal Transportation, volume 58 ofGraduate Studies in Mathematics
URLhttps://arxiv.org/abs/2302.13971. Cédric Villani.Topics in Optimal Transportation, volume 58 ofGraduate Studies in Mathematics. American Mathematical Society,
-
[16]
URLhttps://arxiv.org/abs/2406.02496. Hao Xue and Flora D. Salim. Promptcast: A new prompt-based learning paradigm for time series forecasting,
-
[17]
URLhttps://arxiv.org/abs/2210.08964. 37
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.