arxiv: 2403.07815 · v3 · submitted 2024-03-12 · 💻 cs.LG · cs.AI

Recognition: 3 theorem links

· Lean Theorem

Chronos: Learning the Language of Time Series

Abdul Fatir Ansari, Andrew Gordon Wilson, Caner Turkmen, Danielle C. Maddix, Hao Wang, Huibin Shen, Jasper Zschiegner, Kari Torkkola, Lorenzo Stella, Michael Bohlke-Schneider, Michael W. Mahoney, Oleksandr Shchur, Pedro Mercado, Sebastian Pineda Arango, Shubham Kapoor, Syama Sundar Rangapuram, Xiyuan Zhang, Yuyang Wang

Pith reviewed 2026-05-13 08:21 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords time series forecastingpretrained modelstransformerzero-shot learningprobabilistic forecastingtokenizationsynthetic dataT5

0 comments

The pith

Pretrained transformers match or beat specialized models on new time series forecasting tasks after tokenizing values into a fixed vocabulary.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Chronos shows that time series forecasting can be reframed as a language modeling problem by first scaling and quantizing the numerical values into discrete tokens drawn from a fixed vocabulary. Standard transformer architectures from the T5 family are then trained on this tokenized data using the cross-entropy loss, after pretraining on a large mix of public real-world datasets and additional synthetic series generated from Gaussian processes. The resulting models significantly outperform baselines on data they saw during training and deliver comparable or better zero-shot accuracy on entirely new datasets than methods trained specifically for those datasets. This matters because it offers a route to general forecasting tools that avoid building and retraining a fresh model for every new task or domain.

Core claim

Chronos tokenizes time series values using scaling and quantization into a fixed vocabulary and trains existing transformer-based language model architectures on these tokenized time series via the cross-entropy loss. Models based on the T5 family, ranging from 20M to 710M parameters, are pretrained on a large collection of publicly available datasets augmented by Gaussian-process synthetic data. In a benchmark of 42 datasets, these models significantly outperform other methods on training-corpus data and achieve comparable or occasionally superior zero-shot performance on new datasets relative to methods trained specifically on them.

What carries the argument

Tokenization of continuous time series values into a fixed vocabulary through scaling and quantization, which converts the forecasting problem into standard language-model training with cross-entropy loss.

Load-bearing premise

Scaling and quantization into a fixed vocabulary must preserve enough information to support accurate probabilistic forecasts, and the added Gaussian-process synthetic data must improve rather than harm generalization to real-world distributions.

What would settle it

On a new real-world dataset never seen in pretraining, if a Chronos model used zero-shot produces substantially higher error than a model trained from scratch specifically on that same dataset, the claim of effective zero-shot transfer would be falsified.

read the original abstract

We introduce Chronos, a simple yet effective framework for pretrained probabilistic time series models. Chronos tokenizes time series values using scaling and quantization into a fixed vocabulary and trains existing transformer-based language model architectures on these tokenized time series via the cross-entropy loss. We pretrained Chronos models based on the T5 family (ranging from 20M to 710M parameters) on a large collection of publicly available datasets, complemented by a synthetic dataset that we generated via Gaussian processes to improve generalization. In a comprehensive benchmark consisting of 42 datasets, and comprising both classical local models and deep learning methods, we show that Chronos models: (a) significantly outperform other methods on datasets that were part of the training corpus; and (b) have comparable and occasionally superior zero-shot performance on new datasets, relative to methods that were trained specifically on them. Our results demonstrate that Chronos models can leverage time series data from diverse domains to improve zero-shot accuracy on unseen forecasting tasks, positioning pretrained models as a viable tool to greatly simplify forecasting pipelines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Chronos shows that scaling and quantizing time series into a fixed vocabulary for T5-style pretraining can produce competitive zero-shot forecasters across 42 datasets, but the discretization step still needs tighter validation on precision and tail behavior.

read the letter

The core result is that a straightforward tokenization of time series—scale then quantize into a fixed vocab—lets you pretrain transformers on a big mix of public data plus Gaussian-process synthetics and get models that beat or match specialized methods in zero-shot settings. The benchmark covers both in-distribution wins and new datasets, which is the practical payoff they are selling.

Referee Report

3 major / 3 minor

Summary. The paper introduces Chronos, a framework that tokenizes time series values via scaling and quantization into a fixed vocabulary and trains T5-family transformer models (20M–710M parameters) with cross-entropy loss. Models are pretrained on a large collection of public datasets augmented by Gaussian-process synthetic data; a benchmark on 42 datasets shows significant outperformance versus classical and deep-learning baselines on in-distribution tasks and comparable or superior zero-shot performance on unseen datasets.

Significance. If the central performance claims hold after verification of data integrity, the work establishes that language-model pretraining can be effectively transferred to probabilistic time series forecasting, leveraging cross-domain data to simplify pipelines and improve zero-shot accuracy. The scale of the benchmark (42 datasets, multiple model sizes) and explicit comparison to both local and deep baselines are strengths; the approach also supplies a concrete, reproducible tokenization recipe that could serve as a baseline for future pretrained time-series models.

major comments (3)

[§4] Experimental setup (likely §4): the manuscript must explicitly list which of the 42 benchmark datasets appear in the pretraining corpus and confirm zero overlap with the zero-shot test sets. Without this partition table, leakage cannot be ruled out and the zero-shot superiority claim cannot be evaluated.
[§3.1] Tokenization procedure (likely §3.1–3.2): scaling followed by fixed-vocabulary quantization is load-bearing for the probabilistic claim. The paper should report an ablation on vocabulary size (and the per-series scaling rule) showing that discretization does not systematically bias the predictive distribution on high-dynamic-range or heavy-tailed series; otherwise the cross-entropy training may be optimizing a coarsened rather than faithful likelihood.
[§4.3] Synthetic-data ablation (likely §4.3): the Gaussian-process augmentation is presented as improving generalization, yet no controlled comparison (with vs. without GP data) on zero-shot CRPS or coverage is supplied. Because the GP prior is stationary and smooth, its net effect on real-world non-stationary series must be demonstrated rather than assumed.

minor comments (3)

Clarify the precise probabilistic metrics (CRPS, quantile loss, etc.) and their normalization; state whether they are computed on the original scale or the quantized tokens.
In result tables, report both mean and standard deviation across random seeds or cross-validation folds for the largest Chronos model.
Add a short paragraph comparing the chosen T5 architecture and training recipe to prior time-series transformer work (e.g., Informer, Autoformer) to highlight the novelty of the quantization-plus-LM approach.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the detailed and insightful review of our paper on Chronos. We have carefully considered each of the major comments and provide point-by-point responses below. We will incorporate the suggested changes into the revised manuscript to address the concerns regarding experimental transparency, tokenization validation, and synthetic data ablation.

read point-by-point responses

Referee: [§4] Experimental setup (likely §4): the manuscript must explicitly list which of the 42 benchmark datasets appear in the pretraining corpus and confirm zero overlap with the zero-shot test sets. Without this partition table, leakage cannot be ruled out and the zero-shot superiority claim cannot be evaluated.

Authors: We agree that an explicit partition is necessary to substantiate the zero-shot claims and rule out any possibility of leakage. In the revised manuscript, we will add a dedicated table in Section 4 that enumerates all 42 datasets, indicates precisely which ones were included in the pretraining corpus, and confirms that the zero-shot evaluation sets have zero overlap with the pretraining data. This will clearly separate the in-distribution and out-of-distribution results. revision: yes
Referee: [§3.1] Tokenization procedure (likely §3.1–3.2): scaling followed by fixed-vocabulary quantization is load-bearing for the probabilistic claim. The paper should report an ablation on vocabulary size (and the per-series scaling rule) showing that discretization does not systematically bias the predictive distribution on high-dynamic-range or heavy-tailed series; otherwise the cross-entropy training may be optimizing a coarsened rather than faithful likelihood.

Authors: We recognize that validating the discretization step is important for the probabilistic interpretation. While the current implementation uses a vocabulary size of 4096 with per-series min-max scaling, we will add an ablation study in the revised version. This will compare vocabulary sizes (1024, 4096, 16384) and alternative scaling rules, with specific analysis on high-dynamic-range and heavy-tailed series subsets using CRPS and coverage to confirm that the learned distributions remain faithful rather than coarsened. revision: yes
Referee: [§4.3] Synthetic-data ablation (likely §4.3): the Gaussian-process augmentation is presented as improving generalization, yet no controlled comparison (with vs. without GP data) on zero-shot CRPS or coverage is supplied. Because the GP prior is stationary and smooth, its net effect on real-world non-stationary series must be demonstrated rather than assumed.

Authors: We agree that a controlled ablation is required to quantify the contribution of the GP synthetic data. In the revision, we will include a direct comparison of models trained with and without the GP-augmented data, reporting zero-shot CRPS and coverage metrics on the full benchmark. This will demonstrate the net effect on non-stationary real-world series and address the concern about the stationary GP prior. revision: yes

Circularity Check

0 steps flagged

No significant circularity in Chronos derivation chain

full rationale

The paper's core derivation consists of tokenizing time series via scaling and quantization into a fixed vocabulary, then training T5-based transformers with cross-entropy loss on a mix of public datasets and GP-generated synthetic data. Performance claims rest on empirical benchmarks across 42 datasets, explicitly distinguishing in-corpus results from zero-shot evaluation on held-out datasets never used in pretraining. No equations reduce a claimed prediction to a fitted input by construction, no load-bearing self-citations justify uniqueness or ansatzes, and no renaming of known results occurs. The framework is self-contained through standard pretraining and independent evaluation.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that discrete tokenization of scaled values is sufficient for probabilistic forecasting and that synthetic GP data aids transfer without introducing harmful distribution shift.

free parameters (2)

quantization vocabulary size
Number of discrete tokens chosen for the fixed vocabulary; value not stated in abstract.
per-series scaling rule
Method and parameters used to normalize each series before quantization; not detailed in abstract.

axioms (1)

domain assumption Time series values can be losslessly represented for forecasting purposes by scaling followed by quantization into a fixed vocabulary.
This premise enables the language-model training pipeline and is invoked in the tokenization step described in the abstract.

pith-pipeline@v0.9.0 · 5560 in / 1311 out tokens · 45903 ms · 2026-05-13T08:21:55.992531+00:00 · methodology

discussion (0)

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TimeClaw: A Time-Series AI Agent with Exploratory Execution Learning
cs.AI 2026-05 unverdicted novelty 7.0

TimeClaw is an exploratory execution learning system that turns multiple valid tool-use paths into hierarchical distilled experience for improved time-series reasoning without test-time adaptation.
Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agentic Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

MarsTSC is a VLM-based agentic reasoning framework with a self-evolving knowledge bank and Generator-Reflector-Modifier roles that achieves better few-shot multimodal time series classification than baselines on 12 be...
FactoryBench: Evaluating Industrial Machine Understanding
cs.AI 2026-05 unverdicted novelty 7.0

FactoryBench reveals that frontier LLMs achieve under 50% on structured causal questions and under 18% on decision-making in industrial robotic telemetry.
Physiology-Aware Masked Cross-Modal Reconstruction for Biosignal Representation Learning
cs.LG 2026-05 unverdicted novelty 7.0

xMAE pretrains biosignal representations via masked cross-modal reconstruction of temporally ordered signals like ECG and PPG, outperforming baselines on 15 of 19 downstream tasks including cardiovascular prediction a...
Explainable Load Forecasting with Covariate-Informed Time Series Foundation Models
cs.LG 2026-04 unverdicted novelty 7.0

Time series foundation models match the performance of specialized models for day-ahead load forecasting while providing explanations that match domain knowledge on weather and calendar effects.
Adaptive Conformal Anomaly Detection with Time Series Foundation Models for Signal Monitoring
cs.LG 2026-04 unverdicted novelty 7.0

A model-agnostic adaptive conformal anomaly detection approach uses weighted quantile bounds learned from past foundation model predictions to deliver interpretable p-value scores with stable calibration under shifts ...
TempusBench: An Evaluation Framework for Time-Series Forecasting
cs.LG 2026-04 unverdicted novelty 7.0

TempusBench is a new evaluation framework for time-series forecasting models that supplies fresh non-overlapping datasets, tasks beyond horizon and domain, consistent tuning across models, and visualization tools.
TimeSeriesExamAgent: Creating Time Series Reasoning Benchmarks at Scale
cs.AI 2026-04 conditional novelty 7.0

TimeSeriesExamAgent combines templates and LLM agents to generate scalable time series reasoning benchmarks, demonstrating that current LLMs have limited performance on both abstract and domain-specific tasks.
LoRM: Learning the Language of Rotating Machinery for Self-Supervised Condition Monitoring
cs.CL 2026-04 unverdicted novelty 7.0

LoRM is a self-supervised framework that models multi-modal rotating machinery signals as token sequences for prediction with fine-tuned language models, using prediction errors to monitor machine health in real time.
MILM: Large Language Models for Multimodal Irregular Time Series with Informative Sampling
cs.LG 2026-05 unverdicted novelty 6.0

MILM fine-tunes LLMs on XML-encoded multimodal irregular time series via a two-stage process that exploits informative sampling patterns to achieve top performance on EHR classification datasets.
HEPA: A Self-Supervised Horizon-Conditioned Event Predictive Architecture for Time Series
cs.LG 2026-05 unverdicted novelty 6.0

HEPA combines JEPA self-supervised pretraining with horizon-conditioned fine-tuning to predict rare events in multivariate time series as a monotonic survival distribution, outperforming PatchTST, iTransformer, MAE, a...
HEPA: A Self-Supervised Horizon-Conditioned Event Predictive Architecture for Time Series
cs.LG 2026-05 unverdicted novelty 6.0

HEPA combines self-supervised JEPA pretraining on time series representations with horizon-conditioned finetuning to predict rare events via survival CDFs, outperforming PatchTST, iTransformer, MAE, and Chronos-2 on a...
RareCP: Regime-Aware Retrieval for Efficient Conformal Prediction
cs.LG 2026-05 unverdicted novelty 6.0

RareCP improves interval efficiency for time series conformal prediction by retrieving and weighting regime-specific calibration examples while adapting to drift and maintaining coverage.
Continuity Laws for Sequential Models
cs.LG 2026-05 unverdicted novelty 6.0

S4 models exhibit stable time-continuity unlike sensitive S6 models, with task continuity predicting performance and enabling temporal subsampling for better efficiency.
Can Transformers predict system collapse in dynamical systems?
nlin.CD 2026-05 unverdicted novelty 6.0

Transformers fail to predict catastrophic collapse in unseen parameter regimes of nonlinear dynamical systems, while reservoir computing reliably succeeds.
FETS Benchmark: Foundation Models Outperform Dataset-specific Machine Learning in Energy Time Series Forecasting
cs.LG 2026-04 unverdicted novelty 6.0

Foundation models outperform dataset-specific machine learning in energy time series forecasting across 54 datasets in 9 categories.
Sonata: A Hybrid World Model for Inertial Kinematics under Clinical Data Scarcity
cs.LG 2026-04 unverdicted novelty 6.0

Sonata is a small hybrid world model pre-trained to predict future IMU states that outperforms autoregressive baselines on clinical discrimination, fall-risk prediction, and cross-cohort transfer while fitting on-devi...
Predicting Power-System Dynamic Trajectories with Foundation Models
cs.AI 2026-04 unverdicted novelty 6.0

LASS-ODE-Power is a pretrained model that predicts power-system dynamic trajectories across regimes in a zero-shot manner after large-scale ODE pretraining and targeted fine-tuning.
FM-CAC: Carbon-Aware Control for Battery-Buffered Edge AI via Time-Series Foundation Models
eess.SY 2026-04 unverdicted novelty 6.0

FM-CAC uses battery buffering and time-series foundation models for zero-shot carbon forecasting in a dynamic programming optimizer to reduce edge AI carbon emissions by up to 65.6% with near-maximum accuracy.
A Quantum Inspired Variational Kernel and Explainable AI Framework for Cross Region Solar and Wind Energy Forecasting
cs.CL 2026-05 unverdicted novelty 5.0

A hybrid classical-plus-quantum-inspired framework for cross-region renewable energy forecasting matches top baselines within 1% accuracy and separates calm versus stormy conditions with a 15-fold higher Fisher discri...
Sessa: Selective State Space Attention
cs.LG 2026-04 unverdicted novelty 5.0

Sessa integrates attention within recurrent paths to achieve power-law memory tails and flexible non-decaying selective retrieval, outperforming baselines on long-context tasks.
Wearable AI in the Era of Large Sensor Models
eess.SP 2026-04 unverdicted novelty 5.0

Large Sensor Models trained on large-scale multimodal wearable data can provide a scalable, general framework for wearable AI by learning transferable representations across modalities and tasks.
Thermal-GEMs: Generalized Models for Building Thermal Dynamics
eess.SY 2026-04 unverdicted novelty 5.0

Multi-source transfer learning for building thermal dynamics yields up to 63% lower forecasting errors than single-source models and outperforms time series foundation models when pretrained on 16-32 buildings over one year.
Foundation Models Defining A New Era In Sensor-based Human Activity Recognition: A Survey And Outlook
eess.SP 2026-04 accept novelty 5.0

The survey organizes foundation models for sensor-based HAR into a lifecycle taxonomy and identifies three trajectories: HAR-specific models from scratch, adaptation of general time-series models, and integration with...
Empirical Assessment of Time-Series Foundation Models For Power System Forecasting Applications
eess.SY 2026-04 unverdicted novelty 4.0

The paper benchmarks foundation models like TimesFM and Chronos against baselines on eight forecasting capabilities for power system time series.
Preliminary Insights in Chronos Frequency Data Understanding and Reconstruction
cs.LG 2026-05 unverdicted novelty 3.0

Chronos encodes frequency content in decoder representations with quality that varies across the spectrum, as revealed by minimum description length probes on sinusoid inputs.

Reference graph

Works this paper leans on

104 extracted references · 104 canonical work pages · cited by 25 Pith papers · 11 internal anchors

[1]

GluonTS: Probabilistic and Neural Time Series Modeling in Python

Alexander Alexandrov, Konstantinos Benidis, Michael Bohlke-Schneider, Valentin Flunkert, Jan Gasthaus, Tim Januschowski, Danielle C Maddix, Syama Rangapuram, David Salinas, Jasper Schulz, et al. GluonTS: Probabilistic and Neural Time Series Modeling in Python . The Journal of Machine Learning Research, 21 0 (1): 0 4629--4634, 2020

work page 2020
[2]

Deep Explicit Duration Switching Models for Time Series

Abdul Fatir Ansari, Konstantinos Benidis, Richard Kurle, Ali Caner Turkmen, Harold Soh, Alexander J Smola, Bernie Wang, and Tim Januschowski. Deep Explicit Duration Switching Models for Time Series . Advances in Neural Information Processing Systems, 34, 2021

work page 2021
[3]

Neural continuous-discrete state space models for irregularly-sampled time series

Abdul Fatir Ansari, Alvin Heng, Andre Lim, and Harold Soh. Neural continuous-discrete state space models for irregularly-sampled time series. In International Conference on Machine Learning, pp.\ 926--951. PMLR, 2023

work page 2023
[4]

Assimakopoulos and K

V. Assimakopoulos and K. Nikolopoulos. The theta model: a decomposition approach to forecasting . International Journal of Forecasting, 16 0 (4): 0 521--530, 2000

work page 2000
[5]

Hyndman, Haiyan Song, and Doris C

George Athanasopoulos, Rob J. Hyndman, Haiyan Song, and Doris C. Wu. The tourism forecasting competition. International Journal of Forecasting, 27 0 (3): 0 822--844, 2011

work page 2011
[6]

Deep learning for time series forecasting: Tutorial and literature survey

Konstantinos Benidis, Syama Sundar Rangapuram, Valentin Flunkert, Yuyang Wang, Danielle Maddix, Caner Turkmen, Jan Gasthaus, Michael Bohlke-Schneider, David Salinas, Lorenzo Stella, Fran c ois-Xavier Aubet, Laurent Callot, and Tim Januschowski. Deep learning for time series forecasting: Tutorial and literature survey. ACM Comput. Surv., 55 0 (6), 2022

work page 2022
[7]

Multi-objective model selection for time series forecasting

Oliver Borchert, David Salinas, Valentin Flunkert, Tim Januschowski, and Stephan G \"u nnemann. Multi-objective model selection for time series forecasting. arXiv preprint arXiv:2202.08485, 2022

work page arXiv 2022
[8]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert - Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litw...

work page 2020
[9]

Neural Contextual Anomaly Detection for Time Series

Chris U Carmona, Fran c ois-Xavier Aubet, Valentin Flunkert, and Jan Gasthaus. Neural Contextual Anomaly Detection for Time Series . arXiv:2107.07702, 2021

work page arXiv 2021
[10]

N-HiTS: Neural Hierarchical Interpolation for Time Series Forecasting

Cristian Challu, Kin G Olivares, Boris N Oreshkin, Federico Garza Ramirez, Max Mergenthaler Canseco, and Artur Dubrawski. N-HiTS: Neural Hierarchical Interpolation for Time Series Forecasting . In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 2023

work page 2023
[11]

A neural network approach to ordinal regression

Jianlin Cheng, Zheng Wang, and Gianluca Pollastri. A neural network approach to ordinal regression. In 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence), pp.\ 1279--1284. IEEE, 2008

work page 2008
[12]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling Language Modeling with Pathways . Journal of Machine Learning Research, 24 0 (240): 0 1--113, 2023

work page 2023
[13]

Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling Instruction-Finetuned Language Models . arXiv:2210.11416, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning . arXiv:2307.08691, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

TSMix : time series data augmentation by mixing sources

Luke Nicholas Darlow, Artjom Joosen, Martin Asenov, Qiwen Deng, Jianfeng Wang, and Adam Barker. TSMix : time series data augmentation by mixing sources. In Proceedings of the 3rd Workshop on Machine Learning and Systems, pp.\ 109--114, 2023

work page 2023
[16]

A decoder-only foundation model for time-series forecasting.arXiv preprint arXiv:2310.10688, 2023

Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. A decoder-only foundation model for time-series forecasting. arXiv:2310.10688, 2023

work page arXiv 2023
[17]

The UCR Time Series Classification Archive , October 2018

Hoang Anh Dau, Eamonn Keogh, Kaveh Kamgar, Chin-Chia Michael Yeh, Yan Zhu, Shaghayegh Gharghabi, Chotirat Ann Ratanamahatana, Yanping, Bing Hu, Nurjahan Begum, Anthony Bagnall, Abdullah Mueen, Gustavo Batista, and Hexagon-ML. The UCR Time Series Classification Archive , October 2018. https://www.cs.ucr.edu/ eamonn/time_series_data_2018/

work page 2018
[18]

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale . arXiv:2208.07339, 2022

work page internal anchor Pith review arXiv 2022
[19]

Simmtm: A simple pre-training framework for masked time-series modeling

Jiaxiang Dong, Haixu Wu, Haoran Zhang, Li Zhang, Jianmin Wang, and Mingsheng Long. SimMTM: A Simple Pre-Training Framework for Masked Time-Series Modeling . arXiv:2302.00861, 2023

work page arXiv 2023
[20]

ForecastPFN: Synthetically-Trained Zero-Shot Forecasting

Samuel Dooley, Gurnoor Singh Khurana, Chirag Mohapatra, Siddartha Naidu, and Colin White. ForecastPFN: Synthetically-Trained Zero-Shot Forecasting . In Advances in Neural Information Processing Systems, 2023

work page 2023
[21]

Structure Discovery in Nonparametric Regression through Compositional Kernel Search

David Duvenaud, James Lloyd, Roger Grosse, Joshua Tenenbaum, and Ghahramani Zoubin. Structure Discovery in Nonparametric Regression through Compositional Kernel Search . In International Conference on Machine Learning, pp.\ 1166--1174. PMLR, 2013

work page 2013
[22]

BuildingsBench: A Large-Scale Dataset of 900K Buildings and Benchmark for Short-Term Load Forecasting

Patrick Emami, Abhijeet Sahu, and Peter Graf. BuildingsBench: A Large-Scale Dataset of 900K Buildings and Benchmark for Short-Term Load Forecasting . arXiv:2307.00142, 2023

work page arXiv 2023
[23]

Hierarchical neural story generation.CoRR, abs/1805.04833, 2018

Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical Neural Story Generation . arXiv:1805.04833, 2018

work page arXiv 2018
[24]

Stop regressing: Training value functions via classification for scalable deep rl

Jesse Farebrother, Jordi Orbay, Quan Vuong, Adrien Ali Ta \" ga, Yevgen Chebotar, Ted Xiao, Alex Irpan, Sergey Levine, Pablo Samuel Castro, Aleksandra Faust, et al. Stop regressing: Training value functions via classification for scalable deep rl. arXiv preprint arXiv:2403.03950, 2024

work page arXiv 2024
[25]

How not to lie with statistics: the correct way to summarize benchmark results

Philip J Fleming and John J Wallace. How not to lie with statistics: the correct way to summarize benchmark results. Communications of the ACM, 29 0 (3): 0 218--221, 1986

work page 1986
[26]

Beam Search Strategies for Neural Machine Translation

Markus Freitag and Yaser Al-Onaizan. Beam Search Strategies for Neural Machine Translation . arXiv:1702.01806, 2017

work page arXiv 2017
[27]

Breaking the Sequential Dependency of LLM Inference Using Lookahead Decoding , November 2023

Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Breaking the Sequential Dependency of LLM Inference Using Lookahead Decoding , November 2023. URL https://lmsys.org/blog/2023-11-21-lookahead-decoding/

work page 2023
[28]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The Pile: An 800GB Dataset of Diverse Text for Language Modeling . arXiv:2101.00027, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020
[29]

Olivares

Federico Garza, Max Mergenthaler Canseco, Cristian Challú, and Kin G. Olivares. StatsForecast : Lightning fast forecasting with statistical and econometric models. PyCon Salt Lake City, Utah, US 2022, 2022. URL https://github.com/Nixtla/statsforecast

work page 2022
[30]

Probabilistic Forecasting with Spline Quantile Function RNNs

Jan Gasthaus, Konstantinos Benidis, Yuyang Wang, Syama Sundar Rangapuram, David Salinas, Valentin Flunkert, and Tim Januschowski. Probabilistic Forecasting with Spline Quantile Function RNNs . In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pp.\ ...

work page 1901
[31]

Strictly proper scoring rules, prediction, and estimation

Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association, 102 0 (477): 0 359--378, 2007

work page 2007
[32]

Webb, Rob J

Rakshitha Godahewa, Christoph Bergmeir, Geoffrey I. Webb, Rob J. Hyndman, and Pablo Montero-Manso. Monash Time Series Forecasting Archive . In Neural Information Processing Systems Track on Datasets and Benchmarks, 2021

work page 2021
[33]

Moment: A family of open time-series foundation models

Mononito Goswami, Konrad Szafer, Arjun Choudhry, Yifu Cai, Shuo Li, and Artur Dubrawski. Moment: A family of open time-series foundation models. arXiv preprint arXiv:2402.03885, 2024

work page arXiv 2024
[34]

Large Language Models Are Zero-Shot Time Series Forecasters

Nate Gruver, Marc Finzi, Shikai Qiu, and Andrew Gordon Wilson. Large Language Models Are Zero-Shot Time Series Forecasters . In Advances in Neural Information Processing Systems, 2023

work page 2023
[35]

The Curious Case of Neural Text Degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. arXiv:1904.09751, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[36]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models . In International Conference on Learning Representations, 2022

work page 2022
[37]

Transformer-based deep survival analysis

Shi Hu, Egill Fridgeirsson, Guido van Wingen, and Max Welling. Transformer-based deep survival analysis. In Survival Prediction-Algorithms, Challenges and Applications, pp.\ 132--148. PMLR, 2021

work page 2021
[38]

Forecasting with exponential smoothing: the state space approach

Rob Hyndman, Anne B Koehler, J Keith Ord, and Ralph D Snyder. Forecasting with exponential smoothing: the state space approach. Springer Science & Business Media, 2008

work page 2008
[39]

Forecasting: principles and practice

Rob J Hyndman and George Athanasopoulos. Forecasting: principles and practice. OTexts, 2018

work page 2018
[40]

Another look at measures of forecast accuracy

Rob J Hyndman and Anne B Koehler. Another look at measures of forecast accuracy. International journal of forecasting, 22 0 (4): 0 679--688, 2006

work page 2006
[41]

Deep learning for time series classification: a review

Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, Lhassane Idoumghar, and Pierre-Alain Muller. Deep learning for time series classification: a review. Data mining and knowledge discovery, 33 0 (4): 0 917--963, 2019

work page 2019
[42]

Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, and Qingsong Wen

Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y. Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, and Qingsong Wen. Time- LLM : Time series forecasting by reprogramming large language models. In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[43]

Domain adaptation for time series forecasting via attention sharing

Xiaoyong Jin, Youngsuk Park, Danielle Maddix, Hao Wang, and Yuyang Wang. Domain adaptation for time series forecasting via attention sharing. In International Conference on Machine Learning, pp.\ 10280--10297. PMLR, 2022

work page 2022
[44]

LightGBM: A Highly Efficient Gradient Boosting Decision Tree

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. LightGBM: A Highly Efficient Gradient Boosting Decision Tree . Advances in neural information processing systems, 30, 2017

work page 2017
[45]

Quantile regression

Roger Koenker and Kevin F Hallock. Quantile regression. Journal of economic perspectives, 15 0 (4): 0 143--156, 2001

work page 2001
[46]

A classification of business forecasting problems

Stephan Kolassa and Tim Januschowski. A classification of business forecasting problems. Foresight, 52, 2019

work page 2019
[47]

Predict, Refine, Synthesize: Self-Guiding Diffusion Models for Probabilistic Time Series Forecasting

Marcel Kollovieh, Abdul Fatir Ansari, Michael Bohlke-Schneider, Jasper Zschiegner, Hao Wang, and Yuyang Wang. Predict, Refine, Synthesize: Self-Guiding Diffusion Models for Probabilistic Time Series Forecasting . In Advances in Neural Information Processing Systems, volume 36, pp.\ 28341--28364. Curran Associates, Inc., 2023

work page 2023
[48]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp.\ 19274--19286. PMLR, 2023

work page 2023
[49]

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension . arXiv:1910.13461, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[50]

Temporal fusion transformers for interpretable multi-horizon time series forecasting

Bryan Lim, Sercan \"O Ar k, Nicolas Loeff, and Tomas Pfister. Temporal fusion transformers for interpretable multi-horizon time series forecasting. International Journal of Forecasting, 37 0 (4): 0 1748--1764, 2021

work page 2021
[51]

Largest: A benchmark dataset for large-scale traffic forecasting

Xu Liu, Yutong Xia, Yuxuan Liang, Junfeng Hu, Yiwei Wang, Lei Bai, Chao Huang, Zhenguang Liu, Bryan Hooi, and Roger Zimmermann. Largest: A benchmark dataset for large-scale traffic forecasting. arXiv:2306.08259, 2023

work page arXiv 2023
[52]

The M3-Competition: results, conclusions and implications

Spyros Makridakis and Michele Hibon. The M3-Competition: results, conclusions and implications . International journal of forecasting, 16 0 (4): 0 451--476, 2000

work page 2000
[53]

Accuracy of forecasting: An empirical investigation

Spyros Makridakis, Michele Hibon, and Claus Moser. Accuracy of forecasting: An empirical investigation. Journal of the Royal Statistical Society. Series A (General), 142 0 (2): 0 97--145, 1979

work page 1979
[54]

The M4 Competition: 100,000 time series and 61 forecasting methods

Spyros Makridakis, Evangelos Spiliotis, and Vassilios Assimakopoulos. The M4 Competition: 100,000 time series and 61 forecasting methods . International Journal of Forecasting, 36 0 (1): 0 54--74, 2020

work page 2020
[55]

M5 accuracy competition: Results, findings, and conclusions

Spyros Makridakis, Evangelos Spiliotis, and Vassilios Assimakopoulos. M5 accuracy competition: Results, findings, and conclusions . International Journal of Forecasting, 38 0 (4): 0 1346--1364, 2022

work page 2022
[56]

Regression models for ordinal data

Peter McCullagh. Regression models for ordinal data. Journal of the Royal Statistical Society: Series B (Methodological), 42 0 (2): 0 109--127, 1980

work page 1980
[57]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv:1609.07843, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[58]

Large language models as general pattern machines

Suvir Mirchandani, Fei Xia, Pete Florence, Brian Ichter, Danny Driess, Montserrat Gonzalez Arenas, Kanishka Rao, Dorsa Sadigh, and Andy Zeng. Large language models as general pattern machines. In Proceedings of The 7th Conference on Robot Learning, volume 229 of Proceedings of Machine Learning Research, pp.\ 2498--2518. PMLR, 2023

work page 2023
[59]

Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam

Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. In International Conference on Learning Representations, 2023

work page 2023
[60]

Olivares, Cristian Challú, Federico Garza, Max Mergenthaler Canseco, and Artur Dubrawski

Kin G. Olivares, Cristian Challú, Federico Garza, Max Mergenthaler Canseco, and Artur Dubrawski. NeuralForecast : User friendly state-of-the-art neural forecasting models. PyCon Salt Lake City, Utah, US 2022, 2022. URL https://github.com/Nixtla/neuralforecast

work page 2022
[61]

WaveNet: A Generative Model for Raw Audio

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv:1609.03499, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[62]

Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio

Boris N. Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting . In International Conference on Learning Representations, 2020

work page 2020
[63]

Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio

Boris N. Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. Meta-learning framework with applications to zero-shot time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2021

work page 2021
[64]

Bernardo P \' e rez Orozco and Stephen J. Roberts. Zero-shot and few-shot time series forecasting with ordinal regression recurrent neural networks. In 28th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, pp.\ 503--508, 2020

work page 2020
[65]

Learning quantile functions without quantile crossing for distribution-free time series forecasting

Youngsuk Park, Danielle Maddix, Fran c ois-Xavier Aubet, Kelvin Kan, Jan Gasthaus, and Yuyang Wang. Learning quantile functions without quantile crossing for distribution-free time series forecasting. In International Conference on Artificial Intelligence and Statistics, pp.\ 8127--8150. PMLR, 2022

work page 2022
[66]

A simple combination of univariate models

Fotios Petropoulos and Ivan Svetunkov. A simple combination of univariate models. International journal of forecasting, 36 0 (1): 0 110--115, 2020

work page 2020
[67]

The effectiveness of discretization in forecasting: An empirical study on neural time series models

Stephan Rabanser, Tim Januschowski, Valentin Flunkert, David Salinas, and Jan Gasthaus. The effectiveness of discretization in forecasting: An empirical study on neural time series models. arXiv:2005.10111, 2020

work page arXiv 2005
[68]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

work page 2019
[69]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21 0 (1): 0 5485--5551, 2020

work page 2020
[70]

Integrating multimodal information in large pretrained transformers

Wasifur Rahman, Md Kamrul Hasan, Sangwu Lee, Amir Zadeh, Chengfeng Mao, Louis-Philippe Morency, and Ehsan Hoque. Integrating multimodal information in large pretrained transformers. In Proceedings of the conference. Association for Computational Linguistics. Meeting, volume 2020, pp.\ 2359. NIH Public Access, 2020

work page 2020
[71]

Deep state space models for time series forecasting

Syama Sundar Rangapuram, Matthias W Seeger, Jan Gasthaus, Lorenzo Stella, Yuyang Wang, and Tim Januschowski. Deep state space models for time series forecasting. Advances in neural information processing systems, 31, 2018

work page 2018
[72]

Autoregressive denoising diffusion models for multivariate probabilistic time series forecasting

Kashif Rasul, Calvin Seward, Ingmar Schuster, and Roland Vollgraf. Autoregressive denoising diffusion models for multivariate probabilistic time series forecasting. In International Conference on Machine Learning, pp.\ 8857--8868. PMLR, 2021

work page 2021
[73]

Lag-llama: Towards foundation models for time series forecasting, 2023

Kashif Rasul, Arjun Ashok, Andrew Robert Williams, Arian Khorasani, George Adamopoulos, Rishika Bhagwatkar, Marin Biloš, Hena Ghonia, Nadhir Vincent Hassen, Anderson Schneider, Sahil Garg, Alexandre Drouin, Nicolas Chapados, Yuriy Nevmyvaka, and Irina Rish. Lag-llama: Towards foundation models for time series forecasting, 2023

work page 2023
[74]

Conformalized quantile regression

Yaniv Romano, Evan Patterson, and Emmanuel Candes. Conformalized quantile regression. Advances in neural information processing systems, 32, 2019

work page 2019
[75]

Latent ordinary differential equations for irregularly-sampled time series

Yulia Rubanova, Ricky TQ Chen, and David K Duvenaud. Latent ordinary differential equations for irregularly-sampled time series. Advances in neural information processing systems, 32, 2019

work page 2019
[76]

Deepar: Probabilistic forecasting with autoregressive recurrent networks

David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. Deepar: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting, 36 0 (3): 0 1181--1191, 2020

work page 2020
[77]

Neural Machine Translation of Rare Words with Subword Units

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv:1508.07909, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[78]

Autogluon--timeseries: Automl for probabilistic time series forecasting

Oleksandr Shchur, Ali Caner Turkmen, Nick Erickson, Huibin Shen, Alexander Shirkov, Tony Hu, and Bernie Wang. Autogluon--timeseries: Automl for probabilistic time series forecasting. In International Conference on Automated Machine Learning, pp.\ 9--1. PMLR, 2023

work page 2023
[79]

Conformal time-series forecasting

Kamile Stankeviciute, Ahmed M Alaa, and Mihaela van der Schaar. Conformal time-series forecasting. Advances in neural information processing systems, 34: 0 6216--6228, 2021

work page 2021
[80]

Regression as classification: Influence of task formulation on neural network features

Lawrence Stewart, Francis Bach, Quentin Berthet, and Jean-Philippe Vert. Regression as classification: Influence of task formulation on neural network features. In International Conference on Artificial Intelligence and Statistics, pp.\ 11563--11582. PMLR, 2023

work page 2023

Showing first 80 references.