pith. sign in

arxiv: 2605.20119 · v2 · pith:T3KMXW7Pnew · submitted 2026-05-19 · 💻 cs.LG · cs.AI

Toto 2.0: Time Series Forecasting Enters the Scaling Era

Pith reviewed 2026-06-30 18:09 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords time series forecastingfoundation modelsscaling lawsToto 2.0BOOM benchmarkGIFT-EvalTIME benchmarkhyperparameter transfer
0
0 comments X

The pith

A single training recipe scales time series forecasting models from 4M to 2.5B parameters with consistent quality gains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that time series foundation models follow scaling behavior similar to other domains. A fixed training recipe produces steady improvements in forecast accuracy as parameter count rises from 4 million to 2.5 billion. Five open-weights models in the Toto 2.0 family are released and achieve new state-of-the-art results on the BOOM, GIFT-Eval, and TIME benchmarks. The work details the architecture, data mixture, and u-muP hyperparameter transfer method used to train the family efficiently. This establishes that larger models trained under the recipe capture more temporal structure without per-size retuning.

Core claim

We show that time series foundation models scale: a single training recipe produces reliable forecast-quality improvements from 4M to 2.5B parameters. We release Toto 2.0, a family of five open-weights forecasting models trained under this recipe. The Toto 2.0 family sets a new state of the art on three forecasting benchmarks: BOOM, our observability benchmark; GIFT-Eval, the standard general-purpose benchmark; and the recent contamination-resistant TIME benchmark.

What carries the argument

The u-muP hyperparameter transfer pipeline that allows the same training recipe to be applied across model sizes without per-scale retuning.

If this is right

  • Larger models in the family will deliver higher forecast accuracy on the same tasks.
  • The open weights allow direct fine-tuning or deployment without retraining from scratch.
  • Time series tasks can now be approached with the same scaling playbook used in language and vision.
  • The training recipe removes the need to redesign hyperparameters for each new model size.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same recipe could be tested on multivariate or high-frequency sensor data to check breadth.
  • If scaling continues, task-specific feature engineering may become less central than data volume and model size.
  • Practitioners might shift from training many small models to fine-tuning one large base checkpoint.

Load-bearing premise

The three benchmarks measure genuine generalization rather than contamination or overfitting to the training distribution used for Toto 2.0.

What would settle it

Performance on a fresh time series dataset collected after model training and held completely out of all three benchmarks would fail to show continued gains with scale.

read the original abstract

We show that time series foundation models scale: a single training recipe produces reliable forecast-quality improvements from 4M to 2.5B parameters. We release Toto 2.0, a family of five open-weights forecasting models trained under this recipe. The Toto 2.0 family sets a new state of the art on three forecasting benchmarks: BOOM, our observability benchmark; GIFT-Eval, the standard general-purpose benchmark; and the recent contamination-resistant TIME benchmark. This report describes our experimental results and details the design decisions behind Toto 2.0: its architecture and training recipe, training data, and the u-muP hyperparameter transfer pipeline. All five base checkpoints are released under Apache 2.0.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that time series foundation models exhibit scaling behavior: a single training recipe yields consistent forecast-quality gains as model size increases from 4M to 2.5B parameters. It introduces the Toto 2.0 family of five open-weights models, reports new state-of-the-art results on the BOOM observability benchmark, the GIFT-Eval general-purpose benchmark, and the contamination-resistant TIME benchmark, and details the architecture, training recipe, data, and u-muP hyperparameter transfer pipeline.

Significance. If the scaling results prove robust to contamination and statistical variation, the work would establish the first clear demonstration of scaling laws in time-series forecasting, analogous to developments in language and vision models. The public release of all five base checkpoints under Apache 2.0 provides a concrete, reproducible artifact that could accelerate follow-on research.

major comments (3)
  1. [Abstract / experimental results] Abstract and experimental-results section: the SOTA claims on BOOM, GIFT-Eval, and TIME are presented without error bars, standard deviations across seeds, or statistical significance tests. Because the central claim is that forecast quality improves reliably with scale under a fixed recipe, the absence of these controls makes it impossible to distinguish genuine scaling from run-to-run variance.
  2. [Training data / benchmarks] Training-data and benchmark sections: no decontamination statistics, membership-inference results, or quantitative overlap metrics are supplied between the Toto 2.0 training corpus and the three evaluation sets. The paper notes that TIME is contamination-resistant, yet larger models (up to 2.5 B) are precisely those most able to exploit any residual distributional overlap; without explicit verification, the observed scaling could be an artifact of leakage rather than a property of the architecture or recipe.
  3. [u-muP hyperparameter transfer pipeline] u-muP hyperparameter-transfer pipeline section: the description does not state whether the transferred hyperparameters were re-validated on held-out data after scaling or whether any post-transfer fine-tuning occurred. This detail is load-bearing for the claim that a single recipe suffices across four orders of magnitude in parameter count.
minor comments (2)
  1. [Abstract] The abstract refers to “five base checkpoints” but does not list their exact parameter counts or the precise scaling schedule; adding a short table would improve clarity.
  2. [u-muP hyperparameter transfer pipeline] Notation for the u-muP scaling factors is introduced without an explicit reference to the original μP paper; a citation would help readers unfamiliar with the method.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of experimental rigor in our scaling study. Below we respond to each major comment and indicate the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract / experimental results] Abstract and experimental-results section: the SOTA claims on BOOM, GIFT-Eval, and TIME are presented without error bars, standard deviations across seeds, or statistical significance tests. Because the central claim is that forecast quality improves reliably with scale under a fixed recipe, the absence of these controls makes it impossible to distinguish genuine scaling from run-to-run variance.

    Authors: We concur that variability measures are essential to substantiate the scaling behavior. The revised manuscript will include error bars and standard deviations computed over multiple random seeds for all model sizes where such runs were performed. For the 2.5B model, we will report the single run but note the computational limitations. Additionally, we will include pairwise statistical significance tests between consecutive model sizes to support the claim of reliable improvements. revision: partial

  2. Referee: [Training data / benchmarks] Training-data and benchmark sections: no decontamination statistics, membership-inference results, or quantitative overlap metrics are supplied between the Toto 2.0 training corpus and the three evaluation sets. The paper notes that TIME is contamination-resistant, yet larger models (up to 2.5 B) are precisely those most able to exploit any residual distributional overlap; without explicit verification, the observed scaling could be an artifact of leakage rather than a property of the architecture or recipe.

    Authors: We appreciate this concern regarding potential data leakage. The TIME benchmark was specifically constructed to be contamination-resistant, and our scaling results hold on this benchmark, supporting that the improvements are not due to leakage. For BOOM and GIFT-Eval, we will add quantitative metrics such as the percentage of overlapping sequences or n-gram overlap statistics in the revised version. Performing full membership inference on the 2.5B model is resource-intensive, but we will provide these overlap metrics as a practical verification. revision: partial

  3. Referee: [u-muP hyperparameter transfer pipeline] u-muP hyperparameter-transfer pipeline section: the description does not state whether the transferred hyperparameters were re-validated on held-out data after scaling or whether any post-transfer fine-tuning occurred. This detail is load-bearing for the claim that a single recipe suffices across four orders of magnitude in parameter count.

    Authors: The u-muP hyperparameters were re-validated on held-out data following the transfer to each scale, and no post-transfer fine-tuning was conducted; the identical training recipe was applied uniformly. We will revise the relevant section to explicitly document these steps, thereby reinforcing that a single recipe was maintained across all model sizes. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical scaling results are direct observations, not derived by construction

full rationale

The paper reports an empirical finding that a fixed training recipe yields forecast improvements as model size grows from 4M to 2.5B parameters, validated on three external benchmarks. No equations, derivations, or first-principles claims appear that could reduce to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. The u-muP pipeline is presented as a design choice for hyperparameter transfer rather than a uniqueness theorem or ansatz that forces the scaling result. The central claim rests on observable performance deltas, not on any closed logical loop internal to the paper's own definitions or prior self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5709 in / 1013 out tokens · 20267 ms · 2026-06-30T18:09:29.080661+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Falcon-X: A Time Series Foundation Model for Heterogeneous Multivariate Modeling

    cs.LG 2026-05 unverdicted novelty 5.0

    Falcon-X introduces a latent prototype space with Unified Prototype Diff-Attention and Latent Entity Attention for heterogeneous multivariate time series forecasting.

Reference graph

Works this paper leans on

28 extracted references · 22 canonical work pages · cited by 1 Pith paper · 11 internal anchors

  1. [1]

    Taha Aksu, Gerald Woo, Juncheng Liu, Xu Liu, Chenghao Liu, Silvio Savarese, Caiming Xiong, and Doyen Sahoo

    doi: 10.1145/3292500.3330701. Taha Aksu, Gerald Woo, Juncheng Liu, Xu Liu, Chenghao Liu, Silvio Savarese, Caiming Xiong, and Doyen Sahoo. GIFT-Eval: A benchmark for general time series forecasting model evaluation.arXiv preprint arXiv:2410.10393, 2024a. URLhttps://arxiv.org/abs/2410.10393. Taha Aksu, Gerald Woo, Juncheng Liu, Xu Liu, Chenghao Liu, Silvio ...

  2. [2]

    Abdul Fatir Ansari, Lorenzo Stella, Ali Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Hao Shen, Oleksandr Shchur, Syama S

    URL https://openreview.net/forum?id=yRtgZ1K8hO. Abdul Fatir Ansari, Lorenzo Stella, Ali Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Hao Shen, Oleksandr Shchur, Syama S. Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, Jasper Zschiegner, Danielle C. Maddix, Hao Wang, Michael W. Mahoney, Kari Torkkola, Andrew Gordon Wilson, Michael Bohlke-Schneider, and...

  3. [3]

    Chronos: Learning the Language of Time Series

    URL https://arxiv.org/abs/ 2403.07815. Abdul Fatir Ansari, Oleksandr Shchur, Jasper Küken, Andreas Auer, Boran Han, Pedro Mercado, Syama S. Rangapuram, Hao Shen, Lorenzo Stella, Xiyuan Zhang, Mononito Goswami, Sanyam Kapoor, Danielle C. Maddix, Pablo Guerron, Tony Hu, Junming Yin, Nick Erickson, Prateek M. Desai, Hao Wang, Huzefa Rangwala, George Karypis,...

  4. [4]

    Chronos-2: From Univariate to Universal Forecasting

    URLhttps://arxiv.org/abs/2510.15821. Andreas Auer, Patrick Podest, Daniel Klotz, Sebastian Böck, Günter Klambauer, and Sepp Hochreiter. TiRex: Zero-shot forecasting across long and short horizons with enhanced in-context learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

  5. [6]

    Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael K Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter

    URLhttps://arxiv.org/abs/1705.07774. Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael K Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xLSTM: Extended long short-term memory. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,

  6. [7]

    Workshop at NeurIPS 2025, San Diego

    URL https://berts-workshop.github.io/. Workshop at NeurIPS 2025, San Diego. Charlie Blake, Douglas Orr, and Carlo Luschi. Unit scaling: Out-of-the-box low-precision training. InProceedings of the 40th International Conference on Machine Learning, pages 2548–2576. PMLR,

  7. [8]

    Ching Chang, Jeehyun Hwang, Yidan Shi, Haixin Wang, Wen-Chih Peng, Tien-Fu Chen, and Wei Wang

    URL https://openreview.net/forum?id= P7KRIiLM8T. Ching Chang, Jeehyun Hwang, Yidan Shi, Haixin Wang, Wen-Chih Peng, Tien-Fu Chen, and Wei Wang. Time-IMM: A dataset and benchmark for irregular multimodal multivariate time series. InAdvances in Neural Information Processing Systems (NeurIPS 2025 Datasets and Benchmarks T rack),

  8. [9]

    Lizhang Chen, Jonathan Li, Kaizhao Liang, Baiyu Su, Cong Xie, Nuo Wang Pierse, Chen Liang, Ni Lao, and Qiang Liu

    URLhttps://arxiv.org/abs/2506.10412. Lizhang Chen, Jonathan Li, Kaizhao Liang, Baiyu Su, Cong Xie, Nuo Wang Pierse, Chen Liang, Ni Lao, and Qiang Liu. Cautious weight decay.arXiv preprint arXiv:2510.12402,

  9. [10]

    Tianqi Chen and Carlos Guestrin

    URLhttps://arxiv.org/abs/2510.12402. Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794,

  10. [11]

    XGBoost: A scalable tree boosting system,

    doi: 10.1145/2939672.2939785. Ben Cohen, Emaad Khwaja, Kan Wang, Clément Masson, Elise Ramé, Youssef Doubli, and Othmane Abou-Amal. Toto: Time series optimized transformer for observability.arXiv preprint arXiv:2407.07874,

  11. [12]

    URL https://arxiv.org/ abs/2407.07874. Ben Cohen, Emaad Khwaja, Youssef Doubli, Salahidine Lemaachi, Chris Lettieri, Charles Masson, Hugo Miccinilli, Elise Ramé, Qiqi Ren, Afshin Rostamizadeh, Jean Ogier du Terrail, Anna-Monica Toon, Kan Wang, Stephan Xie, Zongzhe Xu, Viktoriya Zhukova, David Asker, Ameet Talwalkar, and Othmane Abou-Amal. This time is dif...

  12. [14]

    A decoder-only foundation model for time-series forecasting

    URLhttps://arxiv.org/abs/2310.10688. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transform- ers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguis...

  13. [15]

    doi: 10.18653/v1/N19-1423

    Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URLhttps://aclanthology.org/N19-1423/. Federico Garza, Kin Gutiérrez, Cristian Challu, Jose Moralez, Ricardo Olivares, and Max Mergenthaler. tsfeatures: Calculates various features from time series data. python implementation of the r package tsfeatures,

  14. [16]

    ©Datadog 2026 17 T echnical Report Lars Graf, Thomas Ortner, Stanisław Wo´ zniak, and Angeliki Pantazi

    URLhttps://huggingface.co/google/timesfm-2.5-200m-pytorch. ©Datadog 2026 17 T echnical Report Lars Graf, Thomas Ortner, Stanisław Wo´ zniak, and Angeliki Pantazi. FlowState: Sampling-rate invariant time series foundation model with dynamic forecasting horizons. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

  15. [18]

    URLhttps://arxiv.org/abs/2010.04245. Shengding Hu, Yuge Tu, Xu Han, Ganqu Cui, Chaoqun He, Weilin Zhao, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Xinrong Zhang, Zhen Leng Thai, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, dahai li, Zhiyuan Liu, and Maosong Sun. MiniCPM: Unveiling the po...

  16. [20]

    Scaling Laws for Neural Language Models

    URLhttps://arxiv.org/abs/2001.08361. Andrej Karpathy. Beating GPT-2 for <<$100: the nanochat journey. GitHub Discussions,

  17. [22]

    Adam: A Method for Stochastic Optimization

    URLhttps://arxiv.org/abs/1412.6980. Roger Koenker and Gilbert Bassett. Regression quantiles.Econometrica, 46(1):33–50,

  18. [24]

    Chenghao Liu, Taha Aksu, Juncheng Liu, Xu Liu, Hanshu Yan, Quang Pham, Silvio Savarese, Doyen Sahoo, Caiming Xiong, and Junnan Li

    URLhttps://arxiv.org/abs/2510.05491. Chenghao Liu, Taha Aksu, Juncheng Liu, Xu Liu, Hanshu Yan, Quang Pham, Silvio Savarese, Doyen Sahoo, Caiming Xiong, and Junnan Li. Moirai 2.0: When less is more for time series forecasting.arXiv preprint arXiv:2511.11698, 2025a. URLhttps://arxiv.org/abs/2511.11698. Haoxin Liu, Shangqing Xu, Zhiyuan Zhao, Lingkai Kong, ...

  19. [25]

    Muon is Scalable for LLM Training

    URLhttps://openreview.net/forum?id=Z1TMV4bGuu. Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for LLM training.arXiv preprint arXiv:2502.16982, 2025b. URL https://arxiv.org/abs/ 2502.16982. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularizati...

  20. [26]

    doi: https://doi.org/10

    ISSN 0169-2070. doi: https://doi.org/10. 1016/j.ijforecast.2019.04.014. URL https://www.sciencedirect.com/science/article/pii/S0169207019301128. M4 Competition. Pablo Montero-Manso, George Athanasopoulos, Rob J. Hyndman, and Thiyanga S. Talagala. FFORMA: Feature-based forecast model averaging.International Journal of Forecasting, 36(1):86–92,

  21. [27]

    Vladyslav Moroshan, Julien Siems, Arber Zela, Timur Carstensen, and Frank Hutter

    doi: 10.1016/j.ijforecast.2019.02.011. Vladyslav Moroshan, Julien Siems, Arber Zela, Timur Carstensen, and Frank Hutter. TempoPFN: Synthetic pre- training of linear RNNs for zero-shot time series forecasting.arXiv preprint arXiv:2510.25502,

  22. [28]

    Samuel Müller, Noah Hollmann, Sebastian Pineda Arango, Josif Grabocka, and Frank Hutter

    URL https: //arxiv.org/abs/2510.25502. Samuel Müller, Noah Hollmann, Sebastian Pineda Arango, Josif Grabocka, and Frank Hutter. Transformers can do Bayesian inference. InInternational Conference on Learning Representations,

  23. [29]

    ©Datadog 2026 18 T echnical Report Yuqi Nie, Nam H

    URL https://openreview.net/forum? id=KSugKcbNf9. ©Datadog 2026 18 T echnical Report Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. InInternational Conference on Learning Representations,

  24. [30]

    A Time Series is Worth 64 Words: Long-term Forecasting with Transformers

    URL https://arxiv.org/ abs/2211.14730. Zhongzheng Qiao, Sheng Pan, Anni Wang, Viktoriya Zhukova, Yong Liu, Xudong Jiang, Qingsong Wen, Mingsheng Long, Ming Jin, and Chenghao Liu. It’s TIME: Towards the next generation of time series forecasting benchmarks. arXiv preprint arXiv:2602.12147,

  25. [31]

    It's TIME: Towards the Next Generation of Time Series Forecasting Benchmarks

    URLhttps://arxiv.org/abs/2602.12147. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners,

  26. [33]

    Tree-Structured Parzen Estimator: Understanding Its Algorithm Components and Their Roles for Better Empirical Performance

    URLhttps://arxiv.org/abs/2304.11127. Stephan Xie, Ben Cohen, Mononito Goswami, Junhong Shen, Emaad Khwaja, Chenghao Liu, David Asker, Othmane Abou-Amal, and Ameet Talwalkar. ARFBench: Benchmarking time series question answering ability for software incident response.arXiv preprint arXiv:2604.21199,

  27. [34]

    ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response

    URLhttps://arxiv.org/abs/2604.21199. Zhijian Xu, Wanxu Cai, Xilin Dai, Zhaorong Deng, and Qiang Xu. Fidel-TS: A high-fidelity multimodal benchmark for time series forecasting.arXiv preprint arXiv:2509.24789,

  28. [35]

    Fidel-TS: A High-Fidelity Multimodal Benchmark for Time Series Forecasting

    URLhttps://arxiv.org/abs/2509.24789. Greg Yang and Edward J. Hu. Tensor programs IV: Feature learning in infinite-width neural networks. InProceedings of the 38th International Conference on Machine Learning, volume 139, pages 11727–11737. PMLR,