Kairos: Toward Adaptive and Parameter-Efficient Time Series Foundation Models

Kan Ren; Kun Feng; Lintao Ma; Shaocheng Lan; Shuqi Gu; Sihan Lu; Wenchao He; Xingyu Lu; Yuchen Fang

arxiv: 2509.25826 · v3 · pith:7LDJ6CTFnew · submitted 2025-09-30 · 💻 cs.LG

Kairos: Toward Adaptive and Parameter-Efficient Time Series Foundation Models

Kun Feng , Shaocheng Lan , Yuchen Fang , Wenchao He , Sihan Lu , Shuqi Gu , Lintao Ma , Xingyu Lu

show 1 more author

Kan Ren

This is my paper

Pith reviewed 2026-05-18 13:16 UTC · model grok-4.3

classification 💻 cs.LG

keywords time series foundation modelszero-shot forecastingdynamic tokenizationparameter efficiencytemporal heterogeneitypositional embeddingforecasting benchmarks

0 comments

The pith

Kairos decouples temporal heterogeneity from model capacity using dynamic patching tokenizer and mixture-of-size encoding for time series forecasting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the problem of temporal heterogeneity, such as varying sampling densities and periodic structures, that hinders zero-shot generalization in Time Series Foundation Models. Existing approaches absorb this heterogeneity through massive parameterization and static tokenization schemes that encourage memorization. Kairos instead introduces a dynamic patching tokenizer and mixture-of-size encoding to adapt observational granularity to local information density, plus multi-granularity positional embeddings based on dynamic rotary encodings conditioned on spectral features. These components are designed to work without increasing model width or depth. When trained on a Predictability-Stratified Time-Series corpus, the resulting model reports stronger zero-shot results on GIFT-Eval and Time-Series-Library with substantially fewer parameters.

Core claim

Kairos introduces a dynamic patching tokenizer and a mixture-of-size encoding that adapt observational granularity to local information density, enabling fine-grained temporal abstraction without increasing model width or depth. In addition, we design a multi-granularity positional embedding based on dynamic rotary encodings, which conditions on instance-level spectral features and temporal structure induced by dynamic patching tokenization, allowing robust modeling of diverse temporal dependencies. Trained on a novel Predictability-Stratified Time-Series (PreSTS) corpus, Kairos achieves superior zero-shot performance with substantially fewer parameters on two mainstream benchmarks.

What carries the argument

Dynamic patching tokenizer paired with mixture-of-size encoding and multi-granularity positional embedding via dynamic rotary encodings.

Load-bearing premise

Decoupling temporal heterogeneity through dynamic patching tokenizer and mixture-of-size encoding can be done without increasing model width or depth while still preserving the information needed for accurate forecasting.

What would settle it

A baseline model using only static tokenization and positional encoding, trained and evaluated at the same parameter count, matching or exceeding Kairos zero-shot accuracy on both GIFT-Eval and Time-Series-Library would falsify the necessity of the dynamic components.

Figures

Figures reproduced from arXiv: 2509.25826 by Kan Ren, Kun Feng, Lintao Ma, Shaocheng Lan, Shuqi Gu, Sihan Lu, Wenchao He, Xingyu Lu, Yuchen Fang.

**Figure 1.** Figure 1: (a) The trade-off between performance (normalized MASE) and the number of parameters on GIFT-Eval benchmark (Aksu et al., 2024) for existing TSFMs. Our KAIROS achieves a superior performance at a comparable parameter scale. (b) (c) Significant variation exists in information density across and within different time series datasets. (d) Existing TSFMs primarily use tokenization methods like point-wise or fi… view at source ↗

**Figure 2.** Figure 2: The architecture of KAIROS, which including (1) Mixture-of-Size Dynamic Patching [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Zero-shot forecasting performance on TSLib. Results are averaged across prediction lengths [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Patch size preferences in GIFT-Eval test datasets. Darker shades indicate a smaller weighted [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Causal analysis of adaptive modulation by IARoPE. This experiment validates the criticality of matching positional encodings to the unique characteristics of each time series instance by disrupting or removing this adaptation. We test this by manipulating the RoPE frequencies θ under several conditions: IARoPE (standard), Intra-Dataset Shuffle (θ modulations permuted between different instances within th… view at source ↗

**Figure 6.** Figure 6: Performance analysis of multi-patch prediction on the GIFT-Eval benchmark across [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Example of forecasts from KAIROSb on the test datasets used in experiments. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗

read the original abstract

Inherent temporal heterogeneity, such as varying sampling densities and periodic structures, has posed substantial challenges in zero-shot generalization for Time Series Foundation Models (TSFMs). Existing TSFMs predominantly rely on massive parameterization to absorb such heterogeneity, as their static tokenization and positional encoding schemes entangle diverse temporal patterns into a fixed representation space, encouraging memorization rather than adaptation. To address this limitation, we propose Kairos, a flexible and parameter-efficient TSFM dedicated to forecasting tasks, which decouples temporal heterogeneity from model capacity through a novel tokenization perspective. Kairos introduces a dynamic patching tokenizer and a mixture-of-size encoding that adapt observational granularity to local information density, enabling fine-grained temporal abstraction without increasing model width or depth. In addition, we design a multi-granularity positional embedding based on dynamic rotary encodings, which conditions on instance-level spectral features and temporal structure induced by dynamic patching tokenization, allowing robust modeling of diverse temporal dependencies. Trained on a novel Predictability-Stratified Time-Series (PreSTS) corpus, Kairos achieves superior zero-shot performance with substantially fewer parameters on two mainstream benchmarks, GIFT-Eval and Time-Series-Library. The project page is at https://foundation-model-research.github.io/Kairos .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Kairos targets temporal heterogeneity in TSFMs via dynamic patching and mixture-of-size encoding to keep models smaller, but the parameter savings and zero-shot gains need direct verification from the results.

read the letter

The main thing to know is that Kairos tries to decouple time series heterogeneity from model capacity by adapting patch sizes and encodings per instance rather than scaling up width or depth. It introduces a dynamic patching tokenizer, mixture-of-size encoding, and spectral-conditioned dynamic rotary positional embeddings, then trains on a new PreSTS corpus for zero-shot forecasting on GIFT-Eval and Time-Series-Library benchmarks.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Kairos, a parameter-efficient time series foundation model for zero-shot forecasting. It proposes a dynamic patching tokenizer and mixture-of-size encoding to adapt observational granularity to local information density without increasing model width or depth, along with multi-granularity positional embeddings conditioned on instance-level spectral features and temporal structure. The model is trained on a new Predictability-Stratified Time-Series (PreSTS) corpus and claims superior zero-shot performance on GIFT-Eval and Time-Series-Library benchmarks with substantially fewer parameters than prior TSFMs.

Significance. If the central claims on parameter efficiency and performance hold under rigorous verification, this work would offer a meaningful advance by shifting focus from massive parameterization to adaptive tokenization for handling temporal heterogeneity in TSFMs. The architectural innovations and PreSTS corpus could influence future designs of efficient foundation models, provided the decoupling of heterogeneity is shown to preserve forecasting information without hidden capacity costs.

major comments (3)

[§3.2] §3.2: The dynamic patching tokenizer and mixture-of-size encoding are asserted to decouple temporal heterogeneity 'without increasing model width or depth'. However, the description does not specify the implementation of patch-size selection or mixture weights (e.g., whether a learned router MLP or gating parameters are used). This detail is load-bearing for the 'substantially fewer parameters' claim and must be clarified with explicit parameter counts.
[§5.1] §5.1, Table 2: Zero-shot results on GIFT-Eval and Time-Series-Library are presented as superior, yet no ablation studies isolate the contribution of dynamic patching versus mixture-of-size encoding versus the spectral-conditioned rotary embeddings. Without these, it remains unclear whether performance gains stem from the proposed mechanisms or from the PreSTS corpus alone.
[§4.3] §4.3: The multi-granularity positional embedding conditions on spectral features induced by dynamic patching. The paper should provide a parameter breakdown (or equation) showing that the additional conditioning projections add negligible or zero trainable parameters relative to standard rotary embeddings; otherwise the efficiency advantage over baselines is not established.

minor comments (2)

[Figure 1] Figure 1: The architecture diagram would benefit from explicit annotation of which components are parameter-free versus those that introduce new weights.
The abstract states performance claims but the main text should include a dedicated early section with exact parameter counts and baseline comparisons for immediate verification.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments have helped us identify areas where additional clarification and analysis will strengthen the presentation of our contributions. We address each major comment point by point below, providing explanations and indicating the revisions we will make in the next version of the paper.

read point-by-point responses

Referee: [§3.2] §3.2: The dynamic patching tokenizer and mixture-of-size encoding are asserted to decouple temporal heterogeneity 'without increasing model width or depth'. However, the description does not specify the implementation of patch-size selection or mixture weights (e.g., whether a learned router MLP or gating parameters are used). This detail is load-bearing for the 'substantially fewer parameters' claim and must be clarified with explicit parameter counts.

Authors: We appreciate the referee identifying this point of potential ambiguity. The patch-size selection in the dynamic patching tokenizer is performed by a deterministic, non-learned heuristic that computes local information density from the standard deviation of first-order differences over candidate windows and maps it to one of a fixed discrete set of patch sizes. The mixture-of-size encoding then uses normalized weights derived directly from the chosen patch sizes (proportional to their relative coverage), with no trainable router, MLP, or gating parameters involved. In the revised manuscript, we have expanded Section 3.2 with a precise algorithmic description, pseudocode, and a dedicated parameter-count table that explicitly shows these components contribute zero additional trainable parameters relative to a conventional static patching tokenizer. This addition directly bolsters the parameter-efficiency claims. revision: yes
Referee: [§5.1] §5.1, Table 2: Zero-shot results on GIFT-Eval and Time-Series-Library are presented as superior, yet no ablation studies isolate the contribution of dynamic patching versus mixture-of-size encoding versus the spectral-conditioned rotary embeddings. Without these, it remains unclear whether performance gains stem from the proposed mechanisms or from the PreSTS corpus alone.

Authors: We agree that component-wise ablations would make the source of the observed gains more transparent. Although the primary experiments already compare Kairos against other models trained on the identical PreSTS corpus, we did not include isolated ablations in the original submission. In the revised version, we have added a new set of controlled ablation experiments in Section 5.1. These train and evaluate four variants under identical optimization and data conditions: (i) static patching with standard positional embeddings, (ii) dynamic patching alone, (iii) dynamic patching plus mixture-of-size encoding, and (iv) the complete model including spectral-conditioned embeddings. The new results (reported in an additional table) demonstrate incremental improvements attributable to each architectural element beyond the corpus itself. We have also updated the surrounding text to highlight this controlled comparison. revision: yes
Referee: [§4.3] §4.3: The multi-granularity positional embedding conditions on spectral features induced by dynamic patching. The paper should provide a parameter breakdown (or equation) showing that the additional conditioning projections add negligible or zero trainable parameters relative to standard rotary embeddings; otherwise the efficiency advantage over baselines is not established.

Authors: Thank you for this request for explicit verification. The conditioning is realized by modulating the base rotary frequencies with a lightweight function of the instance-level spectral features; the required linear projection reuses the existing model embedding matrix and adds only a small bias vector whose size equals the hidden dimension. In the revised Section 4.3 we now include the precise mathematical formulation (updated Equation 4) together with a parameter-breakdown table that compares the total trainable parameters of the conditioned embedding against standard RoPE. The table confirms that the net increase is negligible (well under 0.1 % of total model parameters) and does not erode the efficiency advantage relative to prior TSFMs. revision: yes

Circularity Check

0 steps flagged

No circularity detected in architectural proposal or performance claims

full rationale

The paper proposes new mechanisms (dynamic patching tokenizer, mixture-of-size encoding, spectral-conditioned rotary embeddings) to decouple temporal heterogeneity without increasing model width or depth. These are presented as novel designs trained on the PreSTS corpus and evaluated zero-shot on external benchmarks GIFT-Eval and Time-Series-Library. No equations, parameters, or results are shown to reduce by construction to fitted inputs, self-definitions, or self-citation chains. The central claims rest on independent architectural choices and empirical results rather than renaming or importing uniqueness from prior self-work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 4 invented entities

The central claim rests on several newly introduced components and a custom training corpus whose effectiveness is asserted but not independently validated in the provided abstract.

axioms (1)

standard math Standard transformer attention and embedding mechanisms can be adapted for time series via patching and positional encodings.
The model builds on transformer foundations for sequence modeling.

invented entities (4)

Dynamic patching tokenizer no independent evidence
purpose: Adapt observational granularity to local information density
New tokenization scheme to handle varying sampling densities and periodic structures.
Mixture-of-size encoding no independent evidence
purpose: Enable fine-grained temporal abstraction without increasing model capacity
Novel encoding to support adaptive patch sizes.
Multi-granularity positional embedding no independent evidence
purpose: Condition on instance-level spectral features and temporal structure for robust dependency modeling
Dynamic rotary encodings tied to the patching output.
PreSTS corpus no independent evidence
purpose: Provide predictability-stratified training data for the model
Novel dataset introduced for training.

pith-pipeline@v0.9.0 · 5774 in / 1398 out tokens · 47992 ms · 2026-05-18T13:16:01.378980+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Mixture-of-Size Dynamic Patching (MoS-DP) ... Dynamic Patch Router ... finest patch size pk = min{pi | gn,i > 0}
IndisputableMonolith/Foundation/DimensionForcing.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Instance-Adaptive Rotary Position Embedding (IARoPE) ... θ′j = γj ⊙ θinit,j + βj ... FFT low-frequency components

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TempusBench: An Evaluation Framework for Time-Series Forecasting
cs.LG 2026-04 unverdicted novelty 7.0

TempusBench is a new evaluation framework for time-series forecasting models that supplies fresh non-overlapping datasets, tasks beyond horizon and domain, consistent tuning across models, and visualization tools.
WaveMoE: A Wavelet-Enhanced Mixture-of-Experts Foundation Model for Time Series Forecasting
cs.LG 2026-04 unverdicted novelty 6.0

WaveMoE uses a dual-path architecture with aligned time-series and wavelet tokens routed through shared experts to improve forecasting performance on diverse benchmarks.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · cited by 2 Pith papers · 5 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chat- topadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Tirex: Zero-shot forecasting across long and short horizons with enhanced in-context learning,

ISSN 2835-8856. URL https://openreview.net/forum?id=gerNCVqqtR. Andreas Auer, Patrick Podest, Daniel Klotz, Sebastian Böck, Günter Klambauer, and Sepp Hochreiter. Tirex: Zero-shot forecasting across long and short horizons with enhanced in-context learning. arXiv preprint arXiv:2505.23719,

work page arXiv
[4]

Scientific reports 12, 16327

Peng Chen, Yingying Zhang, Yunyao Cheng, Yang Shu, Yihang Wang, Qingsong Wen, Bin Yang, and Chenjuan Guo. Pathformer: Multi-scale transformers with adaptive pathways for time series forecasting.arXiv preprint arXiv:2402.05956,

work page arXiv
[5]

This time is different: An observability perspective on time series foundation models, 2025

Ben Cohen, Emaad Khwaja, Youssef Doubli, Salahidine Lemaachi, Chris Lettieri, Charles Masson, Hugo Miccinilli, Elise Ramé, Qiqi Ren, Afshin Rostamizadeh, et al. This time is different: An observability perspective on time series foundation models.arXiv preprint arXiv:2505.14766,

work page arXiv
[6]

Segment Anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything.arXiv:2304.02643,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024a. Xu Liu, Juncheng Liu, Gerald Woo, Taha Aksu, Yuxuan Liang, Roger Zimmermann, Chenghao Liu, Silvio Savarese, Caiming Xiong, and Doyen Sahoo. Moirai-moe: Empo...

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Timer-xl: Long-context transformers for unified time series forecasting

Yong Liu, Guo Qin, Xiangdong Huang, Jianmin Wang, and Mingsheng Long. Timer-xl: Long-context transformers for unified time series forecasting.arXiv preprint arXiv:2410.04803, 2024c. Yong Liu, Guo Qin, Zhiyuan Shi, Zhi Chen, Caiyin Yang, Xiangdong Huang, Jianmin Wang, and Mingsheng Long. Sundial: A family of highly capable time series foundation models.arX...

work page arXiv
[9]

Neural machine translation of rare words with subword units

Association for Computational Linguistics. doi: 10.18653/v1/P16-1162. URLhttps://aclanthology.org/P16-1162/. 11 Preprint. Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations,

work page doi:10.18653/v1/p16-1162
[10]

Time-moe: Billion-scale time series foundation models with mixture of experts

URL https: //arxiv.org/abs/2409.16040. Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

work page arXiv
[11]

Timemixer: Decomposable mul- tiscale mixing for time series forecasting.arXiv preprint arXiv:2405.14616,

Shiyu Wang, Haixu Wu, Xiaoming Shi, Tengge Hu, Huakun Luo, Lintao Ma, James Y Zhang, and Jun Zhou. Timemixer: Decomposable multiscale mixing for time series forecasting.arXiv preprint arXiv:2405.14616, 2024a. Xue Wang, Tian Zhou, Jinyang Gao, Bolin Ding, and Jingren Zhou. Output scaling: Yinglong- delayed chain of thought in a large pretrained time series...

work page arXiv
[12]

TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis

Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long. Timesnet: Temporal 2d-variation modeling for general time series analysis.arXiv preprint arXiv:2210.02186,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Zhang, and Qiang Xu

Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting?arXiv preprint arXiv:2205.13504,

work page arXiv
[14]

Adamoe: Token-adaptive routing with null experts for mixture-of-experts language models

Zihao Zeng, Yibo Miao, Hongcheng Gao, Hao Zhang, and Zhijie Deng. Adamoe: Token-adaptive routing with null experts for mixture-of-experts language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, pp. 6223–6235,

work page 2024
[15]

Elastst: To- wards robust varied-horizon forecasting with elastic time-series transformer.arXiv preprint arXiv:2411.01842,

Jiawen Zhang, Shun Zheng, Xumeng Wen, Xiaofang Zhou, Jiang Bian, and Jia Li. Elastst: To- wards robust varied-horizon forecasting with elastic time-series transformer.arXiv preprint arXiv:2411.01842,

work page arXiv
[16]

Informer: Beyond efficient transformer for long sequence time-series forecasting

Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. InThe Thirty- Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual Conference, volume 35, pp. 11106–11115. AAAI Press, 2021a. Haoyi Zhou, Shanghang Zhang, Jieqi Pen...

work page 2021
[17]

However, this method necessitates multiple iterations of autoregressive prediction, leading to a significant degradation in performance for medium- and long-term forecasting

Our observations reveal that whenJ= 1 , which corresponds to the conventional approach of predicting a single patch (Ansari et al., 2024; Liu et al., 2024c; Das et al., 2024), the model achieves optimal performance in short-term forecasting. However, this method necessitates multiple iterations of autoregressive prediction, leading to a significant degrad...

work page 2024
[18]

The learning rate for parameters related to IARoPE is set to 1e-5, while the learning rate for others is set to 1e-3

We employ the AdamW optimizer, a linear decay learning rate adjustment strategy for model optimization. The learning rate for parameters related to IARoPE is set to 1e-5, while the learning rate for others is set to 1e-3. Training is conducted on 4 × NVIDIA A100 GPUs using TF32 precision, which takes only 15 hours for base size. Table 4: Details of KAIROS...

work page 2048
[19]

Following (Das et al., 2024), the training loader samples 80% real data and 20% synthetic data

in conjunction with 15B synthetic time points. Following (Das et al., 2024), the training loader samples 80% real data and 20% synthetic data. Real-world data.The real-world datasets were stratified into five tiers based on their predictability. This hierarchical structure dictates the sampling probability during model training, assigning a higher likelih...

work page 2024
[20]

Table 5: Detailed descriptions of second-level, minute-level, and hourly datasets

19 Preprint. Table 5: Detailed descriptions of second-level, minute-level, and hourly datasets. Dataset Domain Frequency # Time Series # Time points Wind Power Energy 4S 1 7,397,147 Residential Load Power Energy T 813 437,983,677 Residential PV Power Energy T 699 376,016,850 Loop Seattle Transport 5T 323 33,953,760 Los-Loop Transport 5T 207 7,094,304 PEMS...

work page 2018
[21]

For the evaluation on the TSLib benchmark, the datasets were Table 9: Detailed descriptions of evaluation datasets. Dataset Domain Frequency # Time Series # Target # Time points ETTh1 Energy H 1 7 17,420 ETTh2 Energy H 1 7 17,420 ETTm1 Energy 15T 1 7 69,680 ETTm2 Energy 15T 1 7 69,680 Weather Nature 10T 1 21 52,696 Saugeen (D) Nature D 1 1 23,741 Saugeen ...

work page 2023
[22]

Consequently, we evaluate KAIROSand other TSFMs under a long -context setting

have devoted attention to predicting over long contexts. Consequently, we evaluate KAIROSand other TSFMs under a long -context setting. Specifically, we adopt a context length of 2048 time steps and examine four prediction horizons, which are {96,192,336,720}. For TSFMs incapable of processing this context length, we instead employ the context length at w...

work page 2048
[23]

Method DLinear iTrans

Table 10: Context Lengths for Models on the TSLib Benchmark. Method DLinear iTrans. TimesNet PatchTST Path. Chronos Moirai TimesFM-2.0 Timer-XL TTMa ChronosBolt KAIROS Context length {96, 2048} {96, 2048} {96, 2048} {336, 512, 2048} {96, 2048} 512 2048 2048 2048 1536 2048 2048 E.4 DETAILS OFIAROPE ANALYSIS In this section, we explain in more detail the se...

work page 2048
[24]

While KAIROSitself is not designed for direct societal applications with immediate negative impacts, any powerful predictive technology could be misused

H BROADERIMPACTS Our work on KAIROSis foundational research focused on advancing time series modeling. While KAIROSitself is not designed for direct societal applications with immediate negative impacts, any powerful predictive technology could be misused. So beyond general risks of advanced AI, our model has no specific negative societal impacts need to ...

work page arXiv 1930

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chat- topadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Tirex: Zero-shot forecasting across long and short horizons with enhanced in-context learning,

ISSN 2835-8856. URL https://openreview.net/forum?id=gerNCVqqtR. Andreas Auer, Patrick Podest, Daniel Klotz, Sebastian Böck, Günter Klambauer, and Sepp Hochreiter. Tirex: Zero-shot forecasting across long and short horizons with enhanced in-context learning. arXiv preprint arXiv:2505.23719,

work page arXiv

[4] [4]

Scientific reports 12, 16327

Peng Chen, Yingying Zhang, Yunyao Cheng, Yang Shu, Yihang Wang, Qingsong Wen, Bin Yang, and Chenjuan Guo. Pathformer: Multi-scale transformers with adaptive pathways for time series forecasting.arXiv preprint arXiv:2402.05956,

work page arXiv

[5] [5]

This time is different: An observability perspective on time series foundation models, 2025

Ben Cohen, Emaad Khwaja, Youssef Doubli, Salahidine Lemaachi, Chris Lettieri, Charles Masson, Hugo Miccinilli, Elise Ramé, Qiqi Ren, Afshin Rostamizadeh, et al. This time is different: An observability perspective on time series foundation models.arXiv preprint arXiv:2505.14766,

work page arXiv

[6] [6]

Segment Anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything.arXiv:2304.02643,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024a. Xu Liu, Juncheng Liu, Gerald Woo, Taha Aksu, Yuxuan Liang, Roger Zimmermann, Chenghao Liu, Silvio Savarese, Caiming Xiong, and Doyen Sahoo. Moirai-moe: Empo...

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Timer-xl: Long-context transformers for unified time series forecasting

Yong Liu, Guo Qin, Xiangdong Huang, Jianmin Wang, and Mingsheng Long. Timer-xl: Long-context transformers for unified time series forecasting.arXiv preprint arXiv:2410.04803, 2024c. Yong Liu, Guo Qin, Zhiyuan Shi, Zhi Chen, Caiyin Yang, Xiangdong Huang, Jianmin Wang, and Mingsheng Long. Sundial: A family of highly capable time series foundation models.arX...

work page arXiv

[9] [9]

Neural machine translation of rare words with subword units

Association for Computational Linguistics. doi: 10.18653/v1/P16-1162. URLhttps://aclanthology.org/P16-1162/. 11 Preprint. Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations,

work page doi:10.18653/v1/p16-1162

[10] [10]

Time-moe: Billion-scale time series foundation models with mixture of experts

URL https: //arxiv.org/abs/2409.16040. Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

work page arXiv

[11] [11]

Timemixer: Decomposable mul- tiscale mixing for time series forecasting.arXiv preprint arXiv:2405.14616,

Shiyu Wang, Haixu Wu, Xiaoming Shi, Tengge Hu, Huakun Luo, Lintao Ma, James Y Zhang, and Jun Zhou. Timemixer: Decomposable multiscale mixing for time series forecasting.arXiv preprint arXiv:2405.14616, 2024a. Xue Wang, Tian Zhou, Jinyang Gao, Bolin Ding, and Jingren Zhou. Output scaling: Yinglong- delayed chain of thought in a large pretrained time series...

work page arXiv

[12] [12]

TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis

Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long. Timesnet: Temporal 2d-variation modeling for general time series analysis.arXiv preprint arXiv:2210.02186,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Zhang, and Qiang Xu

Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting?arXiv preprint arXiv:2205.13504,

work page arXiv

[14] [14]

Adamoe: Token-adaptive routing with null experts for mixture-of-experts language models

Zihao Zeng, Yibo Miao, Hongcheng Gao, Hao Zhang, and Zhijie Deng. Adamoe: Token-adaptive routing with null experts for mixture-of-experts language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, pp. 6223–6235,

work page 2024

[15] [15]

Elastst: To- wards robust varied-horizon forecasting with elastic time-series transformer.arXiv preprint arXiv:2411.01842,

Jiawen Zhang, Shun Zheng, Xumeng Wen, Xiaofang Zhou, Jiang Bian, and Jia Li. Elastst: To- wards robust varied-horizon forecasting with elastic time-series transformer.arXiv preprint arXiv:2411.01842,

work page arXiv

[16] [16]

Informer: Beyond efficient transformer for long sequence time-series forecasting

Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. InThe Thirty- Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual Conference, volume 35, pp. 11106–11115. AAAI Press, 2021a. Haoyi Zhou, Shanghang Zhang, Jieqi Pen...

work page 2021

[17] [17]

However, this method necessitates multiple iterations of autoregressive prediction, leading to a significant degradation in performance for medium- and long-term forecasting

Our observations reveal that whenJ= 1 , which corresponds to the conventional approach of predicting a single patch (Ansari et al., 2024; Liu et al., 2024c; Das et al., 2024), the model achieves optimal performance in short-term forecasting. However, this method necessitates multiple iterations of autoregressive prediction, leading to a significant degrad...

work page 2024

[18] [18]

The learning rate for parameters related to IARoPE is set to 1e-5, while the learning rate for others is set to 1e-3

We employ the AdamW optimizer, a linear decay learning rate adjustment strategy for model optimization. The learning rate for parameters related to IARoPE is set to 1e-5, while the learning rate for others is set to 1e-3. Training is conducted on 4 × NVIDIA A100 GPUs using TF32 precision, which takes only 15 hours for base size. Table 4: Details of KAIROS...

work page 2048

[19] [19]

Following (Das et al., 2024), the training loader samples 80% real data and 20% synthetic data

in conjunction with 15B synthetic time points. Following (Das et al., 2024), the training loader samples 80% real data and 20% synthetic data. Real-world data.The real-world datasets were stratified into five tiers based on their predictability. This hierarchical structure dictates the sampling probability during model training, assigning a higher likelih...

work page 2024

[20] [20]

Table 5: Detailed descriptions of second-level, minute-level, and hourly datasets

19 Preprint. Table 5: Detailed descriptions of second-level, minute-level, and hourly datasets. Dataset Domain Frequency # Time Series # Time points Wind Power Energy 4S 1 7,397,147 Residential Load Power Energy T 813 437,983,677 Residential PV Power Energy T 699 376,016,850 Loop Seattle Transport 5T 323 33,953,760 Los-Loop Transport 5T 207 7,094,304 PEMS...

work page 2018

[21] [21]

For the evaluation on the TSLib benchmark, the datasets were Table 9: Detailed descriptions of evaluation datasets. Dataset Domain Frequency # Time Series # Target # Time points ETTh1 Energy H 1 7 17,420 ETTh2 Energy H 1 7 17,420 ETTm1 Energy 15T 1 7 69,680 ETTm2 Energy 15T 1 7 69,680 Weather Nature 10T 1 21 52,696 Saugeen (D) Nature D 1 1 23,741 Saugeen ...

work page 2023

[22] [22]

Consequently, we evaluate KAIROSand other TSFMs under a long -context setting

have devoted attention to predicting over long contexts. Consequently, we evaluate KAIROSand other TSFMs under a long -context setting. Specifically, we adopt a context length of 2048 time steps and examine four prediction horizons, which are {96,192,336,720}. For TSFMs incapable of processing this context length, we instead employ the context length at w...

work page 2048

[23] [23]

Method DLinear iTrans

Table 10: Context Lengths for Models on the TSLib Benchmark. Method DLinear iTrans. TimesNet PatchTST Path. Chronos Moirai TimesFM-2.0 Timer-XL TTMa ChronosBolt KAIROS Context length {96, 2048} {96, 2048} {96, 2048} {336, 512, 2048} {96, 2048} 512 2048 2048 2048 1536 2048 2048 E.4 DETAILS OFIAROPE ANALYSIS In this section, we explain in more detail the se...

work page 2048

[24] [24]

While KAIROSitself is not designed for direct societal applications with immediate negative impacts, any powerful predictive technology could be misused

H BROADERIMPACTS Our work on KAIROSis foundational research focused on advancing time series modeling. While KAIROSitself is not designed for direct societal applications with immediate negative impacts, any powerful predictive technology could be misused. So beyond general risks of advanced AI, our model has no specific negative societal impacts need to ...

work page arXiv 1930