Toto 2.0: Time Series Forecasting Enters the Scaling Era
Pith reviewed 2026-06-30 18:09 UTC · model grok-4.3
The pith
A single training recipe scales time series forecasting models from 4M to 2.5B parameters with consistent quality gains.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We show that time series foundation models scale: a single training recipe produces reliable forecast-quality improvements from 4M to 2.5B parameters. We release Toto 2.0, a family of five open-weights forecasting models trained under this recipe. The Toto 2.0 family sets a new state of the art on three forecasting benchmarks: BOOM, our observability benchmark; GIFT-Eval, the standard general-purpose benchmark; and the recent contamination-resistant TIME benchmark.
What carries the argument
The u-muP hyperparameter transfer pipeline that allows the same training recipe to be applied across model sizes without per-scale retuning.
If this is right
- Larger models in the family will deliver higher forecast accuracy on the same tasks.
- The open weights allow direct fine-tuning or deployment without retraining from scratch.
- Time series tasks can now be approached with the same scaling playbook used in language and vision.
- The training recipe removes the need to redesign hyperparameters for each new model size.
Where Pith is reading between the lines
- The same recipe could be tested on multivariate or high-frequency sensor data to check breadth.
- If scaling continues, task-specific feature engineering may become less central than data volume and model size.
- Practitioners might shift from training many small models to fine-tuning one large base checkpoint.
Load-bearing premise
The three benchmarks measure genuine generalization rather than contamination or overfitting to the training distribution used for Toto 2.0.
What would settle it
Performance on a fresh time series dataset collected after model training and held completely out of all three benchmarks would fail to show continued gains with scale.
read the original abstract
We show that time series foundation models scale: a single training recipe produces reliable forecast-quality improvements from 4M to 2.5B parameters. We release Toto 2.0, a family of five open-weights forecasting models trained under this recipe. The Toto 2.0 family sets a new state of the art on three forecasting benchmarks: BOOM, our observability benchmark; GIFT-Eval, the standard general-purpose benchmark; and the recent contamination-resistant TIME benchmark. This report describes our experimental results and details the design decisions behind Toto 2.0: its architecture and training recipe, training data, and the u-muP hyperparameter transfer pipeline. All five base checkpoints are released under Apache 2.0.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that time series foundation models exhibit scaling behavior: a single training recipe yields consistent forecast-quality gains as model size increases from 4M to 2.5B parameters. It introduces the Toto 2.0 family of five open-weights models, reports new state-of-the-art results on the BOOM observability benchmark, the GIFT-Eval general-purpose benchmark, and the contamination-resistant TIME benchmark, and details the architecture, training recipe, data, and u-muP hyperparameter transfer pipeline.
Significance. If the scaling results prove robust to contamination and statistical variation, the work would establish the first clear demonstration of scaling laws in time-series forecasting, analogous to developments in language and vision models. The public release of all five base checkpoints under Apache 2.0 provides a concrete, reproducible artifact that could accelerate follow-on research.
major comments (3)
- [Abstract / experimental results] Abstract and experimental-results section: the SOTA claims on BOOM, GIFT-Eval, and TIME are presented without error bars, standard deviations across seeds, or statistical significance tests. Because the central claim is that forecast quality improves reliably with scale under a fixed recipe, the absence of these controls makes it impossible to distinguish genuine scaling from run-to-run variance.
- [Training data / benchmarks] Training-data and benchmark sections: no decontamination statistics, membership-inference results, or quantitative overlap metrics are supplied between the Toto 2.0 training corpus and the three evaluation sets. The paper notes that TIME is contamination-resistant, yet larger models (up to 2.5 B) are precisely those most able to exploit any residual distributional overlap; without explicit verification, the observed scaling could be an artifact of leakage rather than a property of the architecture or recipe.
- [u-muP hyperparameter transfer pipeline] u-muP hyperparameter-transfer pipeline section: the description does not state whether the transferred hyperparameters were re-validated on held-out data after scaling or whether any post-transfer fine-tuning occurred. This detail is load-bearing for the claim that a single recipe suffices across four orders of magnitude in parameter count.
minor comments (2)
- [Abstract] The abstract refers to “five base checkpoints” but does not list their exact parameter counts or the precise scaling schedule; adding a short table would improve clarity.
- [u-muP hyperparameter transfer pipeline] Notation for the u-muP scaling factors is introduced without an explicit reference to the original μP paper; a citation would help readers unfamiliar with the method.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important aspects of experimental rigor in our scaling study. Below we respond to each major comment and indicate the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Abstract / experimental results] Abstract and experimental-results section: the SOTA claims on BOOM, GIFT-Eval, and TIME are presented without error bars, standard deviations across seeds, or statistical significance tests. Because the central claim is that forecast quality improves reliably with scale under a fixed recipe, the absence of these controls makes it impossible to distinguish genuine scaling from run-to-run variance.
Authors: We concur that variability measures are essential to substantiate the scaling behavior. The revised manuscript will include error bars and standard deviations computed over multiple random seeds for all model sizes where such runs were performed. For the 2.5B model, we will report the single run but note the computational limitations. Additionally, we will include pairwise statistical significance tests between consecutive model sizes to support the claim of reliable improvements. revision: partial
-
Referee: [Training data / benchmarks] Training-data and benchmark sections: no decontamination statistics, membership-inference results, or quantitative overlap metrics are supplied between the Toto 2.0 training corpus and the three evaluation sets. The paper notes that TIME is contamination-resistant, yet larger models (up to 2.5 B) are precisely those most able to exploit any residual distributional overlap; without explicit verification, the observed scaling could be an artifact of leakage rather than a property of the architecture or recipe.
Authors: We appreciate this concern regarding potential data leakage. The TIME benchmark was specifically constructed to be contamination-resistant, and our scaling results hold on this benchmark, supporting that the improvements are not due to leakage. For BOOM and GIFT-Eval, we will add quantitative metrics such as the percentage of overlapping sequences or n-gram overlap statistics in the revised version. Performing full membership inference on the 2.5B model is resource-intensive, but we will provide these overlap metrics as a practical verification. revision: partial
-
Referee: [u-muP hyperparameter transfer pipeline] u-muP hyperparameter-transfer pipeline section: the description does not state whether the transferred hyperparameters were re-validated on held-out data after scaling or whether any post-transfer fine-tuning occurred. This detail is load-bearing for the claim that a single recipe suffices across four orders of magnitude in parameter count.
Authors: The u-muP hyperparameters were re-validated on held-out data following the transfer to each scale, and no post-transfer fine-tuning was conducted; the identical training recipe was applied uniformly. We will revise the relevant section to explicitly document these steps, thereby reinforcing that a single recipe was maintained across all model sizes. revision: yes
Circularity Check
No circularity: empirical scaling results are direct observations, not derived by construction
full rationale
The paper reports an empirical finding that a fixed training recipe yields forecast improvements as model size grows from 4M to 2.5B parameters, validated on three external benchmarks. No equations, derivations, or first-principles claims appear that could reduce to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. The u-muP pipeline is presented as a design choice for hyperparameter transfer rather than a uniqueness theorem or ansatz that forces the scaling result. The central claim rests on observable performance deltas, not on any closed logical loop internal to the paper's own definitions or prior self-citations.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Falcon-X: A Time Series Foundation Model for Heterogeneous Multivariate Modeling
Falcon-X introduces a latent prototype space with Unified Prototype Diff-Attention and Latent Entity Attention for heterogeneous multivariate time series forecasting.
Reference graph
Works this paper leans on
-
[1]
doi: 10.1145/3292500.3330701. Taha Aksu, Gerald Woo, Juncheng Liu, Xu Liu, Chenghao Liu, Silvio Savarese, Caiming Xiong, and Doyen Sahoo. GIFT-Eval: A benchmark for general time series forecasting model evaluation.arXiv preprint arXiv:2410.10393, 2024a. URLhttps://arxiv.org/abs/2410.10393. Taha Aksu, Gerald Woo, Juncheng Liu, Xu Liu, Chenghao Liu, Silvio ...
-
[2]
Abdul Fatir Ansari, Lorenzo Stella, Ali Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Hao Shen, Oleksandr Shchur, Syama S
URL https://openreview.net/forum?id=yRtgZ1K8hO. Abdul Fatir Ansari, Lorenzo Stella, Ali Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Hao Shen, Oleksandr Shchur, Syama S. Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, Jasper Zschiegner, Danielle C. Maddix, Hao Wang, Michael W. Mahoney, Kari Torkkola, Andrew Gordon Wilson, Michael Bohlke-Schneider, and...
2026
-
[3]
Chronos: Learning the Language of Time Series
URL https://arxiv.org/abs/ 2403.07815. Abdul Fatir Ansari, Oleksandr Shchur, Jasper Küken, Andreas Auer, Boran Han, Pedro Mercado, Syama S. Rangapuram, Hao Shen, Lorenzo Stella, Xiyuan Zhang, Mononito Goswami, Sanyam Kapoor, Danielle C. Maddix, Pablo Guerron, Tony Hu, Junming Yin, Nick Erickson, Prateek M. Desai, Hao Wang, Huzefa Rangwala, George Karypis,...
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Chronos-2: From Univariate to Universal Forecasting
URLhttps://arxiv.org/abs/2510.15821. Andreas Auer, Patrick Podest, Daniel Klotz, Sebastian Böck, Günter Klambauer, and Sepp Hochreiter. TiRex: Zero-shot forecasting across long and short horizons with enhanced in-context learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
URLhttps://arxiv.org/abs/1705.07774. Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael K Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xLSTM: Extended long short-term memory. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,
-
[7]
Workshop at NeurIPS 2025, San Diego
URL https://berts-workshop.github.io/. Workshop at NeurIPS 2025, San Diego. Charlie Blake, Douglas Orr, and Carlo Luschi. Unit scaling: Out-of-the-box low-precision training. InProceedings of the 40th International Conference on Machine Learning, pages 2548–2576. PMLR,
2025
-
[8]
Ching Chang, Jeehyun Hwang, Yidan Shi, Haixin Wang, Wen-Chih Peng, Tien-Fu Chen, and Wei Wang
URL https://openreview.net/forum?id= P7KRIiLM8T. Ching Chang, Jeehyun Hwang, Yidan Shi, Haixin Wang, Wen-Chih Peng, Tien-Fu Chen, and Wei Wang. Time-IMM: A dataset and benchmark for irregular multimodal multivariate time series. InAdvances in Neural Information Processing Systems (NeurIPS 2025 Datasets and Benchmarks T rack),
2025
-
[9]
URLhttps://arxiv.org/abs/2506.10412. Lizhang Chen, Jonathan Li, Kaizhao Liang, Baiyu Su, Cong Xie, Nuo Wang Pierse, Chen Liang, Ni Lao, and Qiang Liu. Cautious weight decay.arXiv preprint arXiv:2510.12402,
-
[10]
Tianqi Chen and Carlos Guestrin
URLhttps://arxiv.org/abs/2510.12402. Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794,
-
[11]
XGBoost: A scalable tree boosting system,
doi: 10.1145/2939672.2939785. Ben Cohen, Emaad Khwaja, Kan Wang, Clément Masson, Elise Ramé, Youssef Doubli, and Othmane Abou-Amal. Toto: Time series optimized transformer for observability.arXiv preprint arXiv:2407.07874,
-
[12]
URL https://arxiv.org/ abs/2407.07874. Ben Cohen, Emaad Khwaja, Youssef Doubli, Salahidine Lemaachi, Chris Lettieri, Charles Masson, Hugo Miccinilli, Elise Ramé, Qiqi Ren, Afshin Rostamizadeh, Jean Ogier du Terrail, Anna-Monica Toon, Kan Wang, Stephan Xie, Zongzhe Xu, Viktoriya Zhukova, David Asker, Ameet Talwalkar, and Othmane Abou-Amal. This time is dif...
-
[14]
A decoder-only foundation model for time-series forecasting
URLhttps://arxiv.org/abs/2310.10688. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transform- ers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguis...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[15]
Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URLhttps://aclanthology.org/N19-1423/. Federico Garza, Kin Gutiérrez, Cristian Challu, Jose Moralez, Ricardo Olivares, and Max Mergenthaler. tsfeatures: Calculates various features from time series data. python implementation of the r package tsfeatures,
-
[16]
©Datadog 2026 17 T echnical Report Lars Graf, Thomas Ortner, Stanisław Wo´ zniak, and Angeliki Pantazi
URLhttps://huggingface.co/google/timesfm-2.5-200m-pytorch. ©Datadog 2026 17 T echnical Report Lars Graf, Thomas Ortner, Stanisław Wo´ zniak, and Angeliki Pantazi. FlowState: Sampling-rate invariant time series foundation model with dynamic forecasting horizons. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,
2026
-
[18]
URLhttps://arxiv.org/abs/2010.04245. Shengding Hu, Yuge Tu, Xu Han, Ganqu Cui, Chaoqun He, Weilin Zhao, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Xinrong Zhang, Zhen Leng Thai, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, dahai li, Zhiyuan Liu, and Maosong Sun. MiniCPM: Unveiling the po...
-
[20]
Scaling Laws for Neural Language Models
URLhttps://arxiv.org/abs/2001.08361. Andrej Karpathy. Beating GPT-2 for <<$100: the nanochat journey. GitHub Discussions,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[22]
Adam: A Method for Stochastic Optimization
URLhttps://arxiv.org/abs/1412.6980. Roger Koenker and Gilbert Bassett. Regression quantiles.Econometrica, 46(1):33–50,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
URLhttps://arxiv.org/abs/2510.05491. Chenghao Liu, Taha Aksu, Juncheng Liu, Xu Liu, Hanshu Yan, Quang Pham, Silvio Savarese, Doyen Sahoo, Caiming Xiong, and Junnan Li. Moirai 2.0: When less is more for time series forecasting.arXiv preprint arXiv:2511.11698, 2025a. URLhttps://arxiv.org/abs/2511.11698. Haoxin Liu, Shangqing Xu, Zhiyuan Zhao, Lingkai Kong, ...
-
[25]
Muon is Scalable for LLM Training
URLhttps://openreview.net/forum?id=Z1TMV4bGuu. Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for LLM training.arXiv preprint arXiv:2502.16982, 2025b. URL https://arxiv.org/abs/ 2502.16982. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularizati...
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
doi: https://doi.org/10
ISSN 0169-2070. doi: https://doi.org/10. 1016/j.ijforecast.2019.04.014. URL https://www.sciencedirect.com/science/article/pii/S0169207019301128. M4 Competition. Pablo Montero-Manso, George Athanasopoulos, Rob J. Hyndman, and Thiyanga S. Talagala. FFORMA: Feature-based forecast model averaging.International Journal of Forecasting, 36(1):86–92,
2070
-
[27]
Vladyslav Moroshan, Julien Siems, Arber Zela, Timur Carstensen, and Frank Hutter
doi: 10.1016/j.ijforecast.2019.02.011. Vladyslav Moroshan, Julien Siems, Arber Zela, Timur Carstensen, and Frank Hutter. TempoPFN: Synthetic pre- training of linear RNNs for zero-shot time series forecasting.arXiv preprint arXiv:2510.25502,
-
[28]
Samuel Müller, Noah Hollmann, Sebastian Pineda Arango, Josif Grabocka, and Frank Hutter
URL https: //arxiv.org/abs/2510.25502. Samuel Müller, Noah Hollmann, Sebastian Pineda Arango, Josif Grabocka, and Frank Hutter. Transformers can do Bayesian inference. InInternational Conference on Learning Representations,
-
[29]
©Datadog 2026 18 T echnical Report Yuqi Nie, Nam H
URL https://openreview.net/forum? id=KSugKcbNf9. ©Datadog 2026 18 T echnical Report Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. InInternational Conference on Learning Representations,
2026
-
[30]
A Time Series is Worth 64 Words: Long-term Forecasting with Transformers
URL https://arxiv.org/ abs/2211.14730. Zhongzheng Qiao, Sheng Pan, Anni Wang, Viktoriya Zhukova, Yong Liu, Xudong Jiang, Qingsong Wen, Mingsheng Long, Ming Jin, and Chenghao Liu. It’s TIME: Towards the next generation of time series forecasting benchmarks. arXiv preprint arXiv:2602.12147,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
It's TIME: Towards the Next Generation of Time Series Forecasting Benchmarks
URLhttps://arxiv.org/abs/2602.12147. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners,
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
URLhttps://arxiv.org/abs/2304.11127. Stephan Xie, Ben Cohen, Mononito Goswami, Junhong Shen, Emaad Khwaja, Chenghao Liu, David Asker, Othmane Abou-Amal, and Ameet Talwalkar. ARFBench: Benchmarking time series question answering ability for software incident response.arXiv preprint arXiv:2604.21199,
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response
URLhttps://arxiv.org/abs/2604.21199. Zhijian Xu, Wanxu Cai, Xilin Dai, Zhaorong Deng, and Qiang Xu. Fidel-TS: A high-fidelity multimodal benchmark for time series forecasting.arXiv preprint arXiv:2509.24789,
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
Fidel-TS: A High-Fidelity Multimodal Benchmark for Time Series Forecasting
URLhttps://arxiv.org/abs/2509.24789. Greg Yang and Edward J. Hu. Tensor programs IV: Feature learning in infinite-width neural networks. InProceedings of the 38th International Conference on Machine Learning, volume 139, pages 11727–11737. PMLR,
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.