Time Series as Language: A Universal Tokenizer for General-Purpose Time Series Foundation Models

Jiale Zheng; Jianfeng Zhang; Junchi Yan; Lujia Pan; Ruiying Qi; Yunhao Zhang

arxiv: 2606.09861 · v1 · pith:Q4F3D4ANnew · submitted 2026-05-31 · 💻 cs.LG · cs.AI

Time Series as Language: A Universal Tokenizer for General-Purpose Time Series Foundation Models

Yunhao Zhang , Ruiying Qi , Jiale Zheng , Jianfeng Zhang , Lujia Pan , Junchi Yan This is my paper

Pith reviewed 2026-06-28 17:54 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords time seriestokenizerfoundation modelnext token predictionzero-shot forecastingin-context learningvector quantization

0 comments

The pith

A universal tokenizer converts continuous time series into discrete tokens so that an unmodified large language model can perform zero-shot forecasting, generation, and classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces UniTok, a tokenizer that turns time series data into discrete tokens using a vector-quantized autoencoder. It then pretrains UniTok-FM, based on a standard LLM, using next-token prediction on groups of similar time series. This setup allows the model to handle multiple tasks like forecasting and classification in zero-shot or few-shot in-context ways without additional training. The approach aims to unify time series modeling under the language model paradigm by capturing shared dynamics across series.

Core claim

UniTok is a vector-quantized autoencoder with prefix normalization for scale stabilization, a progressive-resolution causal architecture, and a structure-preserving reconstruction loss. UniTok-FM uses an off-the-shelf LLM architecture pretrained via next-token prediction on context windows of multiple series with similar patterns, enabling it to support zero-shot and prompt-boosted forecasting as well as training-free in-context inference for generation and classification.

What carries the argument

UniTok, the universal tokenizer that is a vector-quantized autoencoder transforming time series into discrete tokens while preserving structure through specific normalization and loss terms.

If this is right

A single model can outperform statistical and supervised baselines on forecasting, generation, and classification tasks.
The model achieves competitive performance with task-specific foundation models.
Training-free in-context inference becomes possible across different time series tasks.
Pretraining on grouped similar series captures shared dynamics for general-purpose use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the tokenizer works across domains, it could allow foundation models to handle mixed data types like text and time series in one model.
Extending the context window grouping to more diverse patterns might improve robustness to distribution shifts.
Testing on very long time series or high-frequency data could reveal limits of the progressive-resolution architecture.

Load-bearing premise

Pretraining an unmodified LLM via next-token prediction on context windows of multiple similar time series will capture enough shared dynamics to enable general-purpose performance across forecasting, generation, and classification without task-specific fine-tuning.

What would settle it

A benchmark where UniTok-FM fails to outperform simple statistical baselines in zero-shot forecasting on a held-out dataset with different patterns from the pretraining groups.

Figures

Figures reproduced from arXiv: 2606.09861 by Jiale Zheng, Jianfeng Zhang, Junchi Yan, Lujia Pan, Ruiying Qi, Yunhao Zhang.

**Figure 1.** Figure 1: (a) Incremental vs. non-incremental tokenization. Incremental tokenization makes prefix tokens independent of future observations, so appending data extends the token sequence, aligning with the NTP paradigm. Otherwise, incompatible tokens for a prefix and its extension limit generalization from long to short series. (b) Overview of UniTok. The raw TS is decomposed into scale statistics and a normalized se… view at source ↗

**Figure 2.** Figure 2: Progressive-Resolution Causal Autoencoder. Each block applies causal convolution and attention, allowing each latent vector to attend only to the past. At block s, the first 2 s − 1 vectors are preserved, while the remaining are downsampled/upsampled, yielding a progressive-resolution architecture in which earlier tokens with smaller receptive fields receive finer representations. Progressive-Resolution … view at source ↗

**Figure 3.** Figure 3: Token arrangement for in-context NTP pretraining and training-free in-context inference. In pretraining, multiple series with similar patterns are concatenated into a context window. In zero-shot forecasting, lookback tokens condition AR generation of future tokens. In promptboosted forecasting and few-shot generation, similar-pattern series are prepended as contextual prompts. In few-shot classification,… view at source ↗

**Figure 4.** Figure 4: All prompt examples (red) and sampled generations (blue) of UniTok-FM on Stocks. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Scaling behavior across LLM backbone sizes. Qwen3 backbones of three sizes are evaluated: Small (14M), Medium (26M), and Base (129M). (a) Training loss. (b) Forecasting MASE. (c) Generation discriminative score. (d) Classification accuracy. 5.2 Main Results Zero-Shot&Prompt-Boosted Forecasting We evaluate forecasting on GIFT-Eval [1], which comprises 97 tasks with different datasets and prediction horizon… view at source ↗

**Figure 6.** Figure 6: Qualitative comparison between the full UniTok and ablated variants. (a) Zero-shot forecasting: full v.s. prefix normalization ablated. (b) Series reconstruction: full v.s. progressiveresolution causal autoencoder ablated. (c) Series generation: full v.s. structure-preserving reconstruction loss ablated, using same prompts as [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Zero-shot forecasting efficiency comparison between Chronos and UniTok-FM on Jena Weather. (a) Inference time per instance w.r.t number of sampling trajectories. (b) Memory occupancy. (c) Forecasting performance (MASE). Inference Efficiency UniTok produces much shorter token sequences than point-wise binning of Chronos, resulting in improved LLM inference efficiency. We compare Chronos (Base) and UniTok… view at source ↗

read the original abstract

While Next-Token Prediction (NTP) has unified LLM pretraining, its adaptation to unbounded, continuous time series (TS) remains open. To bridge the gap, we introduce UniTok, a universal tokenizer that transforms TS into discrete tokens, and UniTok-FM, a foundation model pretrained via NTP on these tokens. UniTok-FM is a general-purpose foundation model that supports zero-shot and prompt-boosted forecasting, as well as few-shot generation and classification via training-free in-context inference--a capability not achieved by prior works. Technically, UniTok is a vector-quantized autoencoder incorporating prefix normalization for scale stabilization, a progressive-resolution causal architecture for encoding and decoding, and a structure-preserving reconstruction loss for training. UniTok-FM adopts an off-the-shelf LLM architecture without TS-specific modifications. Instead of pretraining on isolated TS, it performs NTP on context windows formed by multiple series with similar patterns, aiming to capture their shared dynamics. Experiments on forecasting, generation, and classification show that a single unified UniTok-FM consistently outperforms statistical and supervised baselines, achieves competitive performance with task-specific foundation models, and uniquely enables training-free in-context inference across tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UniTok-FM applies next-token prediction to time series via a custom tokenizer, but the lack of reported metrics makes the claims hard to evaluate.

read the letter

This paper introduces UniTok, a tokenizer that converts time series data into discrete tokens, allowing them to pretrain a foundation model called UniTok-FM using next-token prediction on an unmodified LLM architecture.

The new part is the tokenizer's combination of prefix normalization to stabilize scales, a progressive-resolution causal encoder and decoder, and a structure-preserving reconstruction loss. They also pretrain on context windows made from multiple similar-pattern series rather than treating each series in isolation.

This approach makes sense for handling the continuous and variable nature of time series while trying to capture shared dynamics across series. The technical choices line up with the requirements for causality and structure in the data.

The soft spot is the complete lack of any performance numbers, dataset descriptions, or ablation studies in the abstract. Without those, the claims about outperforming baselines, achieving competitive results with task-specific models, and enabling training-free in-context inference across tasks remain untested in what we've seen. The central assumption that this pretraining will lead to general-purpose capabilities needs the full experimental section to evaluate.

This kind of work is aimed at people building or using time series foundation models. A reader interested in adapting language modeling techniques to other modalities would get value from the tokenizer details.

It deserves a serious referee because the pipeline is coherent and the idea addresses a real gap, even if the results need verification.

Referee Report

0 major / 3 minor

Summary. The paper introduces UniTok, a VQ-VAE tokenizer for continuous time series that incorporates prefix normalization for scale stabilization, a progressive-resolution causal encoder/decoder architecture, and a structure-preserving reconstruction loss. It then pretrains UniTok-FM, an unmodified off-the-shelf LLM, via next-token prediction on context windows formed by grouping multiple time series that exhibit similar patterns. The central claim is that this yields a single general-purpose foundation model supporting zero-shot and prompt-boosted forecasting as well as training-free in-context few-shot generation and classification, with experiments showing consistent outperformance over statistical and supervised baselines and competitiveness with task-specific foundation models.

Significance. If the empirical results hold, the work would be significant for demonstrating that an unmodified LLM architecture, when paired with a carefully designed universal tokenizer and grouped-pattern pretraining, can deliver cross-task generalization including training-free in-context inference—a capability not previously achieved for time series. Credit is due for the tokenizer components that directly target scale invariance, causality, and structure preservation, and for the pretraining strategy that aims to capture shared dynamics without task-specific modifications.

minor comments (3)

[Abstract] Abstract: performance claims are stated without any quantitative metrics, baseline names, or dataset identifiers, which reduces immediate readability even though the full experiments section presumably supplies them.
[§3.2] §3.2: the structure-preserving loss is described in prose; adding an explicit equation would clarify its distinction from standard VQ-VAE reconstruction terms and aid reproducibility.
[Experiments] Table captions and axis labels in the experimental figures should explicitly state the number of series, context lengths, and whether results are averaged over multiple random seeds.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work, the recognition of its potential significance, and the recommendation for minor revision. We appreciate the credit given to the tokenizer design and the pretraining strategy.

Circularity Check

0 steps flagged

No significant circularity; empirical pipeline with no derivation chain

full rationale

The paper describes an empirical construction (VQ-VAE tokenizer with prefix norm, progressive causal encoder, structure-preserving loss, followed by unmodified LLM NTP on grouped similar-pattern windows) and reports experimental results on forecasting/generation/classification. No mathematical derivation, equations, or 'first-principles' claims are present in the provided text. All performance claims are benchmark-driven rather than derived from inputs by construction. No self-citations, fitted parameters renamed as predictions, or uniqueness theorems are invoked in a load-bearing way. The design choices are presented as direct engineering responses to stated requirements, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond the high-level architectural choices; the tokenizer and foundation model are the primary contributions but rest on standard VQ-VAE and LLM assumptions not detailed here.

pith-pipeline@v0.9.1-grok · 5750 in / 1163 out tokens · 20934 ms · 2026-06-28T17:54:22.558976+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 14 canonical work pages · 6 internal anchors

[1]

Gift-eval: A benchmark for general time series forecasting model evaluation

Taha Aksu, Gerald Woo, Juncheng Liu, Xu Liu, Chenghao Liu, Silvio Savarese, Caiming Xiong, and Doyen Sahoo. Gift-eval: A benchmark for general time series forecasting model evaluation. arXiv preprint arXiv:2410.10393, 2024

work page arXiv 2024
[2]

Chronos-2: From Univariate to Universal Forecasting

Abdul Fatir Ansari, Oleksandr Shchur, Jaris Küken, Andreas Auer, Boran Han, Pedro Mercado, Syama Sundar Rangapuram, Huibin Shen, Lorenzo Stella, Xiyuan Zhang, Mononito Goswami, Shubham Kapoor, Danielle C. Maddix, Pablo Guerron, Tony Hu, Junming Yin, Nick Erickson, Prateek Mutalik Desai, Hao Wang, Huzefa Rangwala, George Karypis, Yuyang Wang, and Michael B...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Chronos: Learning the language of time series.Transactions on Machine Learning Research (TMLR), 2024

Abdul Fatir Ansari, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, et al. Chronos: Learning the language of time series.Transactions on Machine Learning Research (TMLR), 2024

2024
[4]

Fast and accurate zero-shot forecasting with chronos-bolt and autogluon

Abdul Fatir Ansari, Caner Turkmen, Oleksandr Shchur, and Lorenzo Stella. Fast and accurate zero-shot forecasting with chronos-bolt and autogluon. ht tp s: // aw s. am az on .c om /b lo gs /m ac hi ne -l ea rn in g/ fa st -a nd -a cc ur at e-z er o-s ho t-f or ec as ti ng -w it h-c hr on os -b ol t-a nd -a ut og lu on, 2024

2024
[5]

Tirex: Zero-shot forecasting across long and short horizons with enhanced in-context learning

Andreas Auer, Patrick Podest, Daniel Klotz, Sebastian Böck, Günter Klambauer, and Sepp Hochreiter. Tirex: Zero-shot forecasting across long and short horizons with enhanced in-context learning. InConference on Neural Information Processing Systems (NeurIPS), 2025

2025
[6]

VisionTS: Visual masked autoencoders are free-lunch zero-shot time series forecasters

Mouxiang Chen, Lefei Shen, Zhuo Li, Xiaoyun Joy Wang, Jianling Sun, and Chenghao Liu. VisionTS: Visual masked autoencoders are free-lunch zero-shot time series forecasters. In International Conference on Machine Learning (ICML), 2025. 10

2025
[7]

Sdformer: Similarity-driven discrete transformer for time series generation

Zhicheng Chen, FENG SHIBO, Zhong Zhang, Xi Xiao, Xingyu Gao, and Peilin Zhao. Sdformer: Similarity-driven discrete transformer for time series generation. InConference on Neural Information Processing Systems (NeurIPS), 2024

2024
[8]

This time is different: An observability perspective on time series foundation models.arXiv preprint arXiv:2505.14766, 2025

Ben Cohen, Emaad Khwaja, Youssef Doubli, Salahidine Lemaachi, Chris Lettieri, Charles Masson, Hugo Miccinilli, Elise Ramé, Qiqi Ren, Afshin Rostamizadeh, et al. This time is different: An observability perspective on time series foundation models.arXiv preprint arXiv:2505.14766, 2025

work page arXiv 2025
[9]

The ucr time series classification archive

Hoang Anh Dau, Eamonn Keogh, Kaveh Kamgar, Chin-Chia Michael Yeh, Yan Zhu, Shaghayegh Gharghabi, Chotirat Ann Ratanamahatana, Yanping, Bing Hu, Nurjahan Begum, Anthony Bagnall, Abdullah Mueen, Gustavo Batista, and Hexagon-ML. The ucr time series classification archive. ht tp s: // ww w. cs .u cr .e du /~e am on n/ ti me _s er ie s_ da ta _2 01 8, 2018

2018
[10]

A time series forest for classification and feature extraction.Information Sciences, 2013

Houtao Deng, George Runger, Eugene Tuv, and Martyanov Vladimir. A time series forest for classification and feature extraction.Information Sciences, 2013

2013
[11]

Timevae: A variational auto-encoder for multivariate time series generation.arXiv preprint arXiv:2111.08095, 2021

Abhyuday Desai, Cynthia Freeman, Zuhui Wang, and Ian Beaver. Timevae: A variational auto-encoder for multivariate time series generation.arXiv preprint arXiv:2111.08095, 2021

work page arXiv 2021
[12]

Ideal spatial adaptation by wavelet shrinkage

David L Donoho and Iain M Johnstone. Ideal spatial adaptation by wavelet shrinkage. biometrika, 1994

1994
[13]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InIEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 2021

2021
[14]

Hdt: Hierarchical dis- crete transformer for multivariate time series forecasting

Shibo Feng, Peilin Zhao, Liu Liu, Pengcheng Wu, and Zhiqi Shen. Hdt: Hierarchical dis- crete transformer for multivariate time series forecasting. InAAAI Conference on Artificial Intelligence (AAAI), 2025

2025
[15]

Mantis: Lightweight calibrated foundation model for user-friendly time series classification.1st ICML Workshop on Foundation Models for Structured Data, 2025

Vasilii Feofanov, Songkang Wen, Marius Alonso, Romain Ilbert, Hongbo Guo, Malik Tiomoko, Lujia Pan, Jianfeng Zhang, and Ievgen Redko. Mantis: Lightweight calibrated foundation model for user-friendly time series classification.1st ICML Workshop on Foundation Models for Structured Data, 2025

2025
[16]

Units: A unified multi-task time series model

Shanghua Gao, Teddy Koker, Owen Queen, Tom Hartvigsen, Theodoros Tsiligkaridis, and Marinka Zitnik. Units: A unified multi-task time series model. InConference on Neural Information Processing Systems (NeurIPS), 2024

2024
[17]

Moment: a family of open time-series foundation models

Mononito Goswami, Konrad Szafer, Arjun Choudhry, Yifu Cai, Shuo Li, and Artur Dubrawski. Moment: a family of open time-series foundation models. InInternational Conference on Machine Learning (ICML), 2024

2024
[18]

Random dilated shapelet transform: A new approach for time series shapelets

Antoine Guillaume, Christel Vrain, and Wael Elloumi. Random dilated shapelet transform: A new approach for time series shapelets. InInternational Conference on Pattern Recognition and Artificial Intelligence (ICPRAI), 2022

2022
[19]

Look into the lite in deep learning for time series classification.International Journal of Data Science and Analytics, 2025

Ali Ismail-Fawaz, Maxime Devanne, Stefano Berretti, Jonathan Weber, and Germain Forestier. Look into the lite in deep learning for time series classification.International Journal of Data Science and Analytics, 2025

2025
[20]

Inceptiontime: Finding alexnet for time series classification.Data Mining and Knowledge Discovery, 2020

Hassan Ismail Fawaz, Benjamin Lucas, Germain Forestier, Charlotte Pelletier, Daniel F Schmidt, Jonathan Weber, Geoffrey I Webb, Lhassane Idoumghar, Pierre-Alain Muller, and François Petitjean. Inceptiontime: Finding alexnet for time series classification.Data Mining and Knowledge Discovery, 2020

2020
[21]

Jian Jia, Jingtong Gao, Ben Xue, Junhao Wang, Qingpeng Cai, Quan Chen, Xiangyu Zhao, Peng Jiang, and Kun Gai. From principles to applications: A comprehensive survey of discrete tokenizers in generation, comprehension, recommendation, and information retrieval.arXiv preprint arXiv:2502.12448, 2025. 11

work page arXiv 2025
[22]

Time-llm: Time series forecasting by repro- gramming large language models

Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, et al. Time-llm: Time series forecasting by repro- gramming large language models. InInternational Conference on Learning Representations (ICLR), 2024

2024
[23]

Photo-realistic single image super-resolution using a generative adversarial network

Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. InIEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 2017

2017
[24]

Vector quantized time series generation with a bidirectional prior model

Daesoo Lee, Sara Malacarne, and Erlend Aune. Vector quantized time series generation with a bidirectional prior model. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2023

2023
[25]

Autoregressive image generation using residual quantization

Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization. InIEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 2022

2022
[26]

Your diffusion model is secretly a zero-shot classifier

Alexander Cong Li, Mihir Prabhudesai, Shivam Duggal, Ellis Langham Brown, and Deepak Pathak. Your diffusion model is secretly a zero-shot classifier. InICML 2023 Workshop on Structured Probabilistic Inference & Generative Modeling, 2023

2023
[27]

Autoregressive image generation without vector quantization

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization. InConference on Neural Information Processing Systems (NeurIPS), 2024

2024
[28]

itransformer: Inverted transformers are effective for time series forecasting

Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long. itransformer: Inverted transformers are effective for time series forecasting. InInternational Conference on Learning Representations (ICLR), 2024

2024
[29]

Sundial: A family of highly capable time series foundation models

Yong Liu, Guo Qin, Zhiyuan Shi, Zhi Chen, Caiyin Yang, Xiangdong Huang, Jianmin Wang, and Mingsheng Long. Sundial: A family of highly capable time series foundation models. In International Conference on Machine Learning (ICML), 2025

2025
[30]

Timer: Generative pre-trained transformers are large time series models

Yong Liu, Haoran Zhang, Chenyu Li, Xiangdong Huang, Jianmin Wang, and Mingsheng Long. Timer: Generative pre-trained transformers are large time series models. InInternational Conference on Machine Learning (ICML), 2024

2024
[31]

Finite scalar quantization: VQ-V AE made simple

Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: VQ-V AE made simple. InInternational Conference on Learning Representations (ICLR), 2024

2024
[32]

Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam

Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. InInternational Conference on Learning Representations (ICLR), 2023

2023
[33]

Language models are unsupervised multitask learners.OpenAI blog, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 2019

2019
[34]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research (JMLR), 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research (JMLR), 2020

2020
[35]

Lag-llama: To- wards foundation models for time series forecasting

Kashif Rasul, Arjun Ashok, Andrew Robert Williams, Arian Khorasani, George Adamopoulos, Rishika Bhagwatkar, Marin Biloš, Hena Ghonia, Nadhir Hassen, Anderson Schneider, Sahil Garg, Alexandre Drouin, Nicolas Chapados, Yuriy Nevmyvaka, and Irina Rish. Lag-llama: To- wards foundation models for time series forecasting. InNeurIPS Workshop R0-FoMo:Robustness o...

2023
[36]

Generating diverse high-fidelity images with vq-vae-2

Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with vq-vae-2. InConference on Neural Information Processing Systems (NeurIPS), 2019. 12

2019
[37]

Time-moe: Billion-scale time series foundation models with mixture of experts

Xiaoming Shi, Shiyu Wang, Yuqi Nie, Dianqi Li, Zhou Ye, Qingsong Wen, and Ming Jin. Time-moe: Billion-scale time series foundation models with mixture of experts. InInternational Conference on Learning Representations (ICLR), 2025

2025
[38]

Kronos: A foundation model for the language of financial markets.arXiv preprint arXiv:2508.02739, 2025

Yu Shi, Zongliang Fu, Shuo Chen, Bohan Zhao, Wei Xu, Changshui Zhang, and Jian Li. Kronos: A foundation model for the language of financial markets.arXiv preprint arXiv:2508.02739, 2025

work page arXiv 2025
[39]

Xihe: Scalable zero-shot time series learner via hierarchical interleaved block attention.arXiv preprint arXiv:2510.21795, 2025

Yinbo Sun, Yuchen Fang, Zhibo Zhu, Jia Li, Yu Liu, Qiwen Deng, Jun Zhou, Hang Yu, Xingyu Lu, and Lintao Ma. Xihe: Scalable zero-shot time series learner via hierarchical interleaved block attention.arXiv preprint arXiv:2510.21795, 2025

work page arXiv 2025
[40]

Totem: Tokenized time series embeddings for general time series analysis.Transactions on Machine Learning Research (TMLR), 2024

Sabera Talukder, Yisong Yue, and Georgia Gkioxari. Totem: Tokenized time series embeddings for general time series analysis.Transactions on Machine Learning Research (TMLR), 2024

2024
[41]

From Values to Tokens: An LLM-Driven Framework for Context-aware Time Series Forecasting via Symbolic Discretization

Xiaoyu Tao, Shilong Zhang, Mingyue Cheng, Daoyu Wang, Tingyue Pan, Bokai Pan, Changqing Zhang, and Shijin Wang. From values to tokens: An llm-driven framework for context-aware time series forecasting via symbolic discretization.arXiv preprint arXiv:2508.09191, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Visual autoregressive modeling: Scalable image generation via next-scale prediction

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. InConference on Neural Information Processing Systems (NeurIPS), 2024

2024
[44]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Conditional image generation with pixelcnn decoders

Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with pixelcnn decoders. InConference on Neural Information Processing Systems (NeurIPS), 2016

2016
[46]

Neural discrete representation learning

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. InConference on Neural Information Processing Systems (NeurIPS), 2017

2017
[47]

Output scaling: Yinglong- delayed chain of thought in a large pretrained time series forecasting model.arXiv preprint arXiv:2506.11029, 2025

Xue Wang, Tian Zhou, Jinyang Gao, Bolin Ding, and Jingren Zhou. Output scaling: Yinglong- delayed chain of thought in a large pretrained time series forecasting model.arXiv preprint arXiv:2506.11029, 2025

work page arXiv 2025
[48]

Time series classification from scratch with deep neural networks: A strong baseline

Zhiguang Wang, Weizhong Yan, and Tim Oates. Time series classification from scratch with deep neural networks: A strong baseline. InInternational Joint Conference on Neural Networks (IJCNN), 2017

2017
[49]

Abstracted shapes as tokens-a generalizable and interpretable model for time-series classification

Yunshi Wen, Tengfei Ma, Lily Weng, Lam Nguyen, and Anak Agung Julius. Abstracted shapes as tokens-a generalizable and interpretable model for time-series classification. InConference on Neural Information Processing Systems (NeurIPS), 2024

2024
[50]

Unified training of universal time series forecasting transformers

Gerald Woo, Chenghao Liu, Akshat Kumar, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. Unified training of universal time series forecasting transformers. InInternational Conference on Machine Learning (ICML), 2024

2024
[51]

Cot-gan: Generating sequential data via causal optimal transport

Tianlin Xu, Li Kevin Wenliang, Michael Munn, and Beatrice Acciaio. Cot-gan: Generating sequential data via causal optimal transport. InConference on Neural Information Processing Systems (NeurIPS), 2020

2020
[52]

FITS: Modeling time series with $10k$ parameters

Zhijian Xu, Ailing Zeng, and Qiang Xu. FITS: Modeling time series with $10k$ parameters. In International Conference on Learning Representations (ICLR), 2024. 13

2024
[53]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Time-series generative adversarial networks

Jinsung Yoon, Daniel Jarrett, and Mihaela Van der Schaar. Time-series generative adversarial networks. InConference on Neural Information Processing Systems (NeurIPS), 2019

2019
[55]

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, et al. Language model beats diffusion–tokenizer is key to visual generation.arXiv preprint arXiv:2310.05737, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

An image is worth 32 tokens for reconstruction and generation

Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. An image is worth 32 tokens for reconstruction and generation. InConference on Neural Information Processing Systems (NeurIPS), 2024

2024
[57]

Diffusion-TS: Interpretable diffusion for general time series generation

Xinyu Yuan and Yan Qiao. Diffusion-TS: Interpretable diffusion for general time series generation. InInternational Conference on Learning Representations (ICLR), 2024

2024
[58]

Are transformers effective for time series forecasting? InAAAI Conference on Artificial Intelligence (AAAI), 2023

Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? InAAAI Conference on Artificial Intelligence (AAAI), 2023

2023
[59]

The unreason- able effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreason- able effectiveness of deep features as a perceptual metric. InIEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 2018

2018
[60]

MMPD: Diverse time series forecasting via multi-mode patch diffusion loss

Yunhao Zhang, Wenyao Hu, Jiale Zheng, Lujia Pan, and Junchi Yan. MMPD: Diverse time series forecasting via multi-mode patch diffusion loss. InInternational Conference on Learning Representations (ICLR), 2026

2026
[61]

Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting

Yunhao Zhang and Junchi Yan. Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting. InInternational Conference on Learning Representa- tions (ICLR), 2023. 14 A Design Details and Referenced Methods in UniTok A.1 Length Mapping Functions Accounting for the 4 special tokens, 2×8 scale statistic tokens and the...

2023
[62]

GIFT-Pretrain[ 1]: This is large-scale corpus released alongside the GIFT-Eval benchmark. A strict split-checking procedure is applied to ensure that no test data from GIFT-Eval appears in the pretraining set, guaranteeing a fully zero-shot evaluation for TSFMs trained on it. The dataset consists of 88 sub-datasets spanning 7 domains and 13 sampling frequ...
[63]

Dataset/Frequency/Prediction Term

Chronos-Dataset[ 3]: This is the dataset for training and evaluation of Chronos. The original dataset contains 67 subsets spanning 8 domains and is publicly available at https://huggingface. co/datasets/autogluon/chronos_datasets. These two datasets partially overlap. We carefully construct their union and avoid test-set leak- age by adding the following ...
[64]

Normalization by Seasonal-Naive:For each task i, the raw score si is normalized by the corre- sponding score of Seasonal-Naives (season) i : esi = si s(season) i (19) This normalization reflects the relative performance of the evaluated model compared to the Seasonal- Naive baseline
[65]

Dataset / Frequency / Prediction Term

Geometric Mean Aggregation:The final aggregated score is computed as the geometric mean of the normalized scores across allN= 97tasks: sagg = ( NY i=1 esi)1/N (20) C.2 Few-Shot Generation DatasetsWe adopt four real-world datasets from Diffusion-TS [ 57] (i.e., Stocks, ETTh, Energy, fMRI) for generation evaluation. For each dataset, only the first channel ...

work page arXiv 2019

[1] [1]

Gift-eval: A benchmark for general time series forecasting model evaluation

Taha Aksu, Gerald Woo, Juncheng Liu, Xu Liu, Chenghao Liu, Silvio Savarese, Caiming Xiong, and Doyen Sahoo. Gift-eval: A benchmark for general time series forecasting model evaluation. arXiv preprint arXiv:2410.10393, 2024

work page arXiv 2024

[2] [2]

Chronos-2: From Univariate to Universal Forecasting

Abdul Fatir Ansari, Oleksandr Shchur, Jaris Küken, Andreas Auer, Boran Han, Pedro Mercado, Syama Sundar Rangapuram, Huibin Shen, Lorenzo Stella, Xiyuan Zhang, Mononito Goswami, Shubham Kapoor, Danielle C. Maddix, Pablo Guerron, Tony Hu, Junming Yin, Nick Erickson, Prateek Mutalik Desai, Hao Wang, Huzefa Rangwala, George Karypis, Yuyang Wang, and Michael B...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Chronos: Learning the language of time series.Transactions on Machine Learning Research (TMLR), 2024

Abdul Fatir Ansari, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebastian Pineda Arango, Shubham Kapoor, et al. Chronos: Learning the language of time series.Transactions on Machine Learning Research (TMLR), 2024

2024

[4] [4]

Fast and accurate zero-shot forecasting with chronos-bolt and autogluon

Abdul Fatir Ansari, Caner Turkmen, Oleksandr Shchur, and Lorenzo Stella. Fast and accurate zero-shot forecasting with chronos-bolt and autogluon. ht tp s: // aw s. am az on .c om /b lo gs /m ac hi ne -l ea rn in g/ fa st -a nd -a cc ur at e-z er o-s ho t-f or ec as ti ng -w it h-c hr on os -b ol t-a nd -a ut og lu on, 2024

2024

[5] [5]

Tirex: Zero-shot forecasting across long and short horizons with enhanced in-context learning

Andreas Auer, Patrick Podest, Daniel Klotz, Sebastian Böck, Günter Klambauer, and Sepp Hochreiter. Tirex: Zero-shot forecasting across long and short horizons with enhanced in-context learning. InConference on Neural Information Processing Systems (NeurIPS), 2025

2025

[6] [6]

VisionTS: Visual masked autoencoders are free-lunch zero-shot time series forecasters

Mouxiang Chen, Lefei Shen, Zhuo Li, Xiaoyun Joy Wang, Jianling Sun, and Chenghao Liu. VisionTS: Visual masked autoencoders are free-lunch zero-shot time series forecasters. In International Conference on Machine Learning (ICML), 2025. 10

2025

[7] [7]

Sdformer: Similarity-driven discrete transformer for time series generation

Zhicheng Chen, FENG SHIBO, Zhong Zhang, Xi Xiao, Xingyu Gao, and Peilin Zhao. Sdformer: Similarity-driven discrete transformer for time series generation. InConference on Neural Information Processing Systems (NeurIPS), 2024

2024

[8] [8]

This time is different: An observability perspective on time series foundation models.arXiv preprint arXiv:2505.14766, 2025

Ben Cohen, Emaad Khwaja, Youssef Doubli, Salahidine Lemaachi, Chris Lettieri, Charles Masson, Hugo Miccinilli, Elise Ramé, Qiqi Ren, Afshin Rostamizadeh, et al. This time is different: An observability perspective on time series foundation models.arXiv preprint arXiv:2505.14766, 2025

work page arXiv 2025

[9] [9]

The ucr time series classification archive

Hoang Anh Dau, Eamonn Keogh, Kaveh Kamgar, Chin-Chia Michael Yeh, Yan Zhu, Shaghayegh Gharghabi, Chotirat Ann Ratanamahatana, Yanping, Bing Hu, Nurjahan Begum, Anthony Bagnall, Abdullah Mueen, Gustavo Batista, and Hexagon-ML. The ucr time series classification archive. ht tp s: // ww w. cs .u cr .e du /~e am on n/ ti me _s er ie s_ da ta _2 01 8, 2018

2018

[10] [10]

A time series forest for classification and feature extraction.Information Sciences, 2013

Houtao Deng, George Runger, Eugene Tuv, and Martyanov Vladimir. A time series forest for classification and feature extraction.Information Sciences, 2013

2013

[11] [11]

Timevae: A variational auto-encoder for multivariate time series generation.arXiv preprint arXiv:2111.08095, 2021

Abhyuday Desai, Cynthia Freeman, Zuhui Wang, and Ian Beaver. Timevae: A variational auto-encoder for multivariate time series generation.arXiv preprint arXiv:2111.08095, 2021

work page arXiv 2021

[12] [12]

Ideal spatial adaptation by wavelet shrinkage

David L Donoho and Iain M Johnstone. Ideal spatial adaptation by wavelet shrinkage. biometrika, 1994

1994

[13] [13]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InIEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 2021

2021

[14] [14]

Hdt: Hierarchical dis- crete transformer for multivariate time series forecasting

Shibo Feng, Peilin Zhao, Liu Liu, Pengcheng Wu, and Zhiqi Shen. Hdt: Hierarchical dis- crete transformer for multivariate time series forecasting. InAAAI Conference on Artificial Intelligence (AAAI), 2025

2025

[15] [15]

Mantis: Lightweight calibrated foundation model for user-friendly time series classification.1st ICML Workshop on Foundation Models for Structured Data, 2025

Vasilii Feofanov, Songkang Wen, Marius Alonso, Romain Ilbert, Hongbo Guo, Malik Tiomoko, Lujia Pan, Jianfeng Zhang, and Ievgen Redko. Mantis: Lightweight calibrated foundation model for user-friendly time series classification.1st ICML Workshop on Foundation Models for Structured Data, 2025

2025

[16] [16]

Units: A unified multi-task time series model

Shanghua Gao, Teddy Koker, Owen Queen, Tom Hartvigsen, Theodoros Tsiligkaridis, and Marinka Zitnik. Units: A unified multi-task time series model. InConference on Neural Information Processing Systems (NeurIPS), 2024

2024

[17] [17]

Moment: a family of open time-series foundation models

Mononito Goswami, Konrad Szafer, Arjun Choudhry, Yifu Cai, Shuo Li, and Artur Dubrawski. Moment: a family of open time-series foundation models. InInternational Conference on Machine Learning (ICML), 2024

2024

[18] [18]

Random dilated shapelet transform: A new approach for time series shapelets

Antoine Guillaume, Christel Vrain, and Wael Elloumi. Random dilated shapelet transform: A new approach for time series shapelets. InInternational Conference on Pattern Recognition and Artificial Intelligence (ICPRAI), 2022

2022

[19] [19]

Look into the lite in deep learning for time series classification.International Journal of Data Science and Analytics, 2025

Ali Ismail-Fawaz, Maxime Devanne, Stefano Berretti, Jonathan Weber, and Germain Forestier. Look into the lite in deep learning for time series classification.International Journal of Data Science and Analytics, 2025

2025

[20] [20]

Inceptiontime: Finding alexnet for time series classification.Data Mining and Knowledge Discovery, 2020

Hassan Ismail Fawaz, Benjamin Lucas, Germain Forestier, Charlotte Pelletier, Daniel F Schmidt, Jonathan Weber, Geoffrey I Webb, Lhassane Idoumghar, Pierre-Alain Muller, and François Petitjean. Inceptiontime: Finding alexnet for time series classification.Data Mining and Knowledge Discovery, 2020

2020

[21] [21]

Jian Jia, Jingtong Gao, Ben Xue, Junhao Wang, Qingpeng Cai, Quan Chen, Xiangyu Zhao, Peng Jiang, and Kun Gai. From principles to applications: A comprehensive survey of discrete tokenizers in generation, comprehension, recommendation, and information retrieval.arXiv preprint arXiv:2502.12448, 2025. 11

work page arXiv 2025

[22] [22]

Time-llm: Time series forecasting by repro- gramming large language models

Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, et al. Time-llm: Time series forecasting by repro- gramming large language models. InInternational Conference on Learning Representations (ICLR), 2024

2024

[23] [23]

Photo-realistic single image super-resolution using a generative adversarial network

Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. InIEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 2017

2017

[24] [24]

Vector quantized time series generation with a bidirectional prior model

Daesoo Lee, Sara Malacarne, and Erlend Aune. Vector quantized time series generation with a bidirectional prior model. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2023

2023

[25] [25]

Autoregressive image generation using residual quantization

Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization. InIEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 2022

2022

[26] [26]

Your diffusion model is secretly a zero-shot classifier

Alexander Cong Li, Mihir Prabhudesai, Shivam Duggal, Ellis Langham Brown, and Deepak Pathak. Your diffusion model is secretly a zero-shot classifier. InICML 2023 Workshop on Structured Probabilistic Inference & Generative Modeling, 2023

2023

[27] [27]

Autoregressive image generation without vector quantization

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization. InConference on Neural Information Processing Systems (NeurIPS), 2024

2024

[28] [28]

itransformer: Inverted transformers are effective for time series forecasting

Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long. itransformer: Inverted transformers are effective for time series forecasting. InInternational Conference on Learning Representations (ICLR), 2024

2024

[29] [29]

Sundial: A family of highly capable time series foundation models

Yong Liu, Guo Qin, Zhiyuan Shi, Zhi Chen, Caiyin Yang, Xiangdong Huang, Jianmin Wang, and Mingsheng Long. Sundial: A family of highly capable time series foundation models. In International Conference on Machine Learning (ICML), 2025

2025

[30] [30]

Timer: Generative pre-trained transformers are large time series models

Yong Liu, Haoran Zhang, Chenyu Li, Xiangdong Huang, Jianmin Wang, and Mingsheng Long. Timer: Generative pre-trained transformers are large time series models. InInternational Conference on Machine Learning (ICML), 2024

2024

[31] [31]

Finite scalar quantization: VQ-V AE made simple

Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: VQ-V AE made simple. InInternational Conference on Learning Representations (ICLR), 2024

2024

[32] [32]

Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam

Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. InInternational Conference on Learning Representations (ICLR), 2023

2023

[33] [33]

Language models are unsupervised multitask learners.OpenAI blog, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 2019

2019

[34] [34]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research (JMLR), 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research (JMLR), 2020

2020

[35] [35]

Lag-llama: To- wards foundation models for time series forecasting

Kashif Rasul, Arjun Ashok, Andrew Robert Williams, Arian Khorasani, George Adamopoulos, Rishika Bhagwatkar, Marin Biloš, Hena Ghonia, Nadhir Hassen, Anderson Schneider, Sahil Garg, Alexandre Drouin, Nicolas Chapados, Yuriy Nevmyvaka, and Irina Rish. Lag-llama: To- wards foundation models for time series forecasting. InNeurIPS Workshop R0-FoMo:Robustness o...

2023

[36] [36]

Generating diverse high-fidelity images with vq-vae-2

Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with vq-vae-2. InConference on Neural Information Processing Systems (NeurIPS), 2019. 12

2019

[37] [37]

Time-moe: Billion-scale time series foundation models with mixture of experts

Xiaoming Shi, Shiyu Wang, Yuqi Nie, Dianqi Li, Zhou Ye, Qingsong Wen, and Ming Jin. Time-moe: Billion-scale time series foundation models with mixture of experts. InInternational Conference on Learning Representations (ICLR), 2025

2025

[38] [38]

Kronos: A foundation model for the language of financial markets.arXiv preprint arXiv:2508.02739, 2025

Yu Shi, Zongliang Fu, Shuo Chen, Bohan Zhao, Wei Xu, Changshui Zhang, and Jian Li. Kronos: A foundation model for the language of financial markets.arXiv preprint arXiv:2508.02739, 2025

work page arXiv 2025

[39] [39]

Xihe: Scalable zero-shot time series learner via hierarchical interleaved block attention.arXiv preprint arXiv:2510.21795, 2025

Yinbo Sun, Yuchen Fang, Zhibo Zhu, Jia Li, Yu Liu, Qiwen Deng, Jun Zhou, Hang Yu, Xingyu Lu, and Lintao Ma. Xihe: Scalable zero-shot time series learner via hierarchical interleaved block attention.arXiv preprint arXiv:2510.21795, 2025

work page arXiv 2025

[40] [40]

Totem: Tokenized time series embeddings for general time series analysis.Transactions on Machine Learning Research (TMLR), 2024

Sabera Talukder, Yisong Yue, and Georgia Gkioxari. Totem: Tokenized time series embeddings for general time series analysis.Transactions on Machine Learning Research (TMLR), 2024

2024

[41] [41]

From Values to Tokens: An LLM-Driven Framework for Context-aware Time Series Forecasting via Symbolic Discretization

Xiaoyu Tao, Shilong Zhang, Mingyue Cheng, Daoyu Wang, Tingyue Pan, Bokai Pan, Changqing Zhang, and Shijin Wang. From values to tokens: An llm-driven framework for context-aware time series forecasting via symbolic discretization.arXiv preprint arXiv:2508.09191, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [43]

Visual autoregressive modeling: Scalable image generation via next-scale prediction

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. InConference on Neural Information Processing Systems (NeurIPS), 2024

2024

[44] [44]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [45]

Conditional image generation with pixelcnn decoders

Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with pixelcnn decoders. InConference on Neural Information Processing Systems (NeurIPS), 2016

2016

[46] [46]

Neural discrete representation learning

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. InConference on Neural Information Processing Systems (NeurIPS), 2017

2017

[47] [47]

Output scaling: Yinglong- delayed chain of thought in a large pretrained time series forecasting model.arXiv preprint arXiv:2506.11029, 2025

Xue Wang, Tian Zhou, Jinyang Gao, Bolin Ding, and Jingren Zhou. Output scaling: Yinglong- delayed chain of thought in a large pretrained time series forecasting model.arXiv preprint arXiv:2506.11029, 2025

work page arXiv 2025

[48] [48]

Time series classification from scratch with deep neural networks: A strong baseline

Zhiguang Wang, Weizhong Yan, and Tim Oates. Time series classification from scratch with deep neural networks: A strong baseline. InInternational Joint Conference on Neural Networks (IJCNN), 2017

2017

[49] [49]

Abstracted shapes as tokens-a generalizable and interpretable model for time-series classification

Yunshi Wen, Tengfei Ma, Lily Weng, Lam Nguyen, and Anak Agung Julius. Abstracted shapes as tokens-a generalizable and interpretable model for time-series classification. InConference on Neural Information Processing Systems (NeurIPS), 2024

2024

[50] [50]

Unified training of universal time series forecasting transformers

Gerald Woo, Chenghao Liu, Akshat Kumar, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. Unified training of universal time series forecasting transformers. InInternational Conference on Machine Learning (ICML), 2024

2024

[51] [51]

Cot-gan: Generating sequential data via causal optimal transport

Tianlin Xu, Li Kevin Wenliang, Michael Munn, and Beatrice Acciaio. Cot-gan: Generating sequential data via causal optimal transport. InConference on Neural Information Processing Systems (NeurIPS), 2020

2020

[52] [52]

FITS: Modeling time series with $10k$ parameters

Zhijian Xu, Ailing Zeng, and Qiang Xu. FITS: Modeling time series with $10k$ parameters. In International Conference on Learning Representations (ICLR), 2024. 13

2024

[53] [53]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [54]

Time-series generative adversarial networks

Jinsung Yoon, Daniel Jarrett, and Mihaela Van der Schaar. Time-series generative adversarial networks. InConference on Neural Information Processing Systems (NeurIPS), 2019

2019

[55] [55]

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, et al. Language model beats diffusion–tokenizer is key to visual generation.arXiv preprint arXiv:2310.05737, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[56] [56]

An image is worth 32 tokens for reconstruction and generation

Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. An image is worth 32 tokens for reconstruction and generation. InConference on Neural Information Processing Systems (NeurIPS), 2024

2024

[57] [57]

Diffusion-TS: Interpretable diffusion for general time series generation

Xinyu Yuan and Yan Qiao. Diffusion-TS: Interpretable diffusion for general time series generation. InInternational Conference on Learning Representations (ICLR), 2024

2024

[58] [58]

Are transformers effective for time series forecasting? InAAAI Conference on Artificial Intelligence (AAAI), 2023

Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? InAAAI Conference on Artificial Intelligence (AAAI), 2023

2023

[59] [59]

The unreason- able effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreason- able effectiveness of deep features as a perceptual metric. InIEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 2018

2018

[60] [60]

MMPD: Diverse time series forecasting via multi-mode patch diffusion loss

Yunhao Zhang, Wenyao Hu, Jiale Zheng, Lujia Pan, and Junchi Yan. MMPD: Diverse time series forecasting via multi-mode patch diffusion loss. InInternational Conference on Learning Representations (ICLR), 2026

2026

[61] [61]

Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting

Yunhao Zhang and Junchi Yan. Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting. InInternational Conference on Learning Representa- tions (ICLR), 2023. 14 A Design Details and Referenced Methods in UniTok A.1 Length Mapping Functions Accounting for the 4 special tokens, 2×8 scale statistic tokens and the...

2023

[62] [62]

GIFT-Pretrain[ 1]: This is large-scale corpus released alongside the GIFT-Eval benchmark. A strict split-checking procedure is applied to ensure that no test data from GIFT-Eval appears in the pretraining set, guaranteeing a fully zero-shot evaluation for TSFMs trained on it. The dataset consists of 88 sub-datasets spanning 7 domains and 13 sampling frequ...

[63] [63]

Dataset/Frequency/Prediction Term

Chronos-Dataset[ 3]: This is the dataset for training and evaluation of Chronos. The original dataset contains 67 subsets spanning 8 domains and is publicly available at https://huggingface. co/datasets/autogluon/chronos_datasets. These two datasets partially overlap. We carefully construct their union and avoid test-set leak- age by adding the following ...

[64] [64]

Normalization by Seasonal-Naive:For each task i, the raw score si is normalized by the corre- sponding score of Seasonal-Naives (season) i : esi = si s(season) i (19) This normalization reflects the relative performance of the evaluated model compared to the Seasonal- Naive baseline

[65] [65]

Dataset / Frequency / Prediction Term

Geometric Mean Aggregation:The final aggregated score is computed as the geometric mean of the normalized scores across allN= 97tasks: sagg = ( NY i=1 esi)1/N (20) C.2 Few-Shot Generation DatasetsWe adopt four real-world datasets from Diffusion-TS [ 57] (i.e., Stocks, ETTh, Energy, fMRI) for generation evaluation. For each dataset, only the first channel ...

work page arXiv 2019