Kairos: Toward Adaptive and Parameter-Efficient Time Series Foundation Models
Pith reviewed 2026-05-18 13:16 UTC · model grok-4.3
The pith
Kairos decouples temporal heterogeneity from model capacity using dynamic patching tokenizer and mixture-of-size encoding for time series forecasting.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Kairos introduces a dynamic patching tokenizer and a mixture-of-size encoding that adapt observational granularity to local information density, enabling fine-grained temporal abstraction without increasing model width or depth. In addition, we design a multi-granularity positional embedding based on dynamic rotary encodings, which conditions on instance-level spectral features and temporal structure induced by dynamic patching tokenization, allowing robust modeling of diverse temporal dependencies. Trained on a novel Predictability-Stratified Time-Series (PreSTS) corpus, Kairos achieves superior zero-shot performance with substantially fewer parameters on two mainstream benchmarks.
What carries the argument
Dynamic patching tokenizer paired with mixture-of-size encoding and multi-granularity positional embedding via dynamic rotary encodings.
Load-bearing premise
Decoupling temporal heterogeneity through dynamic patching tokenizer and mixture-of-size encoding can be done without increasing model width or depth while still preserving the information needed for accurate forecasting.
What would settle it
A baseline model using only static tokenization and positional encoding, trained and evaluated at the same parameter count, matching or exceeding Kairos zero-shot accuracy on both GIFT-Eval and Time-Series-Library would falsify the necessity of the dynamic components.
Figures
read the original abstract
Inherent temporal heterogeneity, such as varying sampling densities and periodic structures, has posed substantial challenges in zero-shot generalization for Time Series Foundation Models (TSFMs). Existing TSFMs predominantly rely on massive parameterization to absorb such heterogeneity, as their static tokenization and positional encoding schemes entangle diverse temporal patterns into a fixed representation space, encouraging memorization rather than adaptation. To address this limitation, we propose Kairos, a flexible and parameter-efficient TSFM dedicated to forecasting tasks, which decouples temporal heterogeneity from model capacity through a novel tokenization perspective. Kairos introduces a dynamic patching tokenizer and a mixture-of-size encoding that adapt observational granularity to local information density, enabling fine-grained temporal abstraction without increasing model width or depth. In addition, we design a multi-granularity positional embedding based on dynamic rotary encodings, which conditions on instance-level spectral features and temporal structure induced by dynamic patching tokenization, allowing robust modeling of diverse temporal dependencies. Trained on a novel Predictability-Stratified Time-Series (PreSTS) corpus, Kairos achieves superior zero-shot performance with substantially fewer parameters on two mainstream benchmarks, GIFT-Eval and Time-Series-Library. The project page is at https://foundation-model-research.github.io/Kairos .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Kairos, a parameter-efficient time series foundation model for zero-shot forecasting. It proposes a dynamic patching tokenizer and mixture-of-size encoding to adapt observational granularity to local information density without increasing model width or depth, along with multi-granularity positional embeddings conditioned on instance-level spectral features and temporal structure. The model is trained on a new Predictability-Stratified Time-Series (PreSTS) corpus and claims superior zero-shot performance on GIFT-Eval and Time-Series-Library benchmarks with substantially fewer parameters than prior TSFMs.
Significance. If the central claims on parameter efficiency and performance hold under rigorous verification, this work would offer a meaningful advance by shifting focus from massive parameterization to adaptive tokenization for handling temporal heterogeneity in TSFMs. The architectural innovations and PreSTS corpus could influence future designs of efficient foundation models, provided the decoupling of heterogeneity is shown to preserve forecasting information without hidden capacity costs.
major comments (3)
- [§3.2] §3.2: The dynamic patching tokenizer and mixture-of-size encoding are asserted to decouple temporal heterogeneity 'without increasing model width or depth'. However, the description does not specify the implementation of patch-size selection or mixture weights (e.g., whether a learned router MLP or gating parameters are used). This detail is load-bearing for the 'substantially fewer parameters' claim and must be clarified with explicit parameter counts.
- [§5.1] §5.1, Table 2: Zero-shot results on GIFT-Eval and Time-Series-Library are presented as superior, yet no ablation studies isolate the contribution of dynamic patching versus mixture-of-size encoding versus the spectral-conditioned rotary embeddings. Without these, it remains unclear whether performance gains stem from the proposed mechanisms or from the PreSTS corpus alone.
- [§4.3] §4.3: The multi-granularity positional embedding conditions on spectral features induced by dynamic patching. The paper should provide a parameter breakdown (or equation) showing that the additional conditioning projections add negligible or zero trainable parameters relative to standard rotary embeddings; otherwise the efficiency advantage over baselines is not established.
minor comments (2)
- [Figure 1] Figure 1: The architecture diagram would benefit from explicit annotation of which components are parameter-free versus those that introduce new weights.
- The abstract states performance claims but the main text should include a dedicated early section with exact parameter counts and baseline comparisons for immediate verification.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. The comments have helped us identify areas where additional clarification and analysis will strengthen the presentation of our contributions. We address each major comment point by point below, providing explanations and indicating the revisions we will make in the next version of the paper.
read point-by-point responses
-
Referee: [§3.2] §3.2: The dynamic patching tokenizer and mixture-of-size encoding are asserted to decouple temporal heterogeneity 'without increasing model width or depth'. However, the description does not specify the implementation of patch-size selection or mixture weights (e.g., whether a learned router MLP or gating parameters are used). This detail is load-bearing for the 'substantially fewer parameters' claim and must be clarified with explicit parameter counts.
Authors: We appreciate the referee identifying this point of potential ambiguity. The patch-size selection in the dynamic patching tokenizer is performed by a deterministic, non-learned heuristic that computes local information density from the standard deviation of first-order differences over candidate windows and maps it to one of a fixed discrete set of patch sizes. The mixture-of-size encoding then uses normalized weights derived directly from the chosen patch sizes (proportional to their relative coverage), with no trainable router, MLP, or gating parameters involved. In the revised manuscript, we have expanded Section 3.2 with a precise algorithmic description, pseudocode, and a dedicated parameter-count table that explicitly shows these components contribute zero additional trainable parameters relative to a conventional static patching tokenizer. This addition directly bolsters the parameter-efficiency claims. revision: yes
-
Referee: [§5.1] §5.1, Table 2: Zero-shot results on GIFT-Eval and Time-Series-Library are presented as superior, yet no ablation studies isolate the contribution of dynamic patching versus mixture-of-size encoding versus the spectral-conditioned rotary embeddings. Without these, it remains unclear whether performance gains stem from the proposed mechanisms or from the PreSTS corpus alone.
Authors: We agree that component-wise ablations would make the source of the observed gains more transparent. Although the primary experiments already compare Kairos against other models trained on the identical PreSTS corpus, we did not include isolated ablations in the original submission. In the revised version, we have added a new set of controlled ablation experiments in Section 5.1. These train and evaluate four variants under identical optimization and data conditions: (i) static patching with standard positional embeddings, (ii) dynamic patching alone, (iii) dynamic patching plus mixture-of-size encoding, and (iv) the complete model including spectral-conditioned embeddings. The new results (reported in an additional table) demonstrate incremental improvements attributable to each architectural element beyond the corpus itself. We have also updated the surrounding text to highlight this controlled comparison. revision: yes
-
Referee: [§4.3] §4.3: The multi-granularity positional embedding conditions on spectral features induced by dynamic patching. The paper should provide a parameter breakdown (or equation) showing that the additional conditioning projections add negligible or zero trainable parameters relative to standard rotary embeddings; otherwise the efficiency advantage over baselines is not established.
Authors: Thank you for this request for explicit verification. The conditioning is realized by modulating the base rotary frequencies with a lightweight function of the instance-level spectral features; the required linear projection reuses the existing model embedding matrix and adds only a small bias vector whose size equals the hidden dimension. In the revised Section 4.3 we now include the precise mathematical formulation (updated Equation 4) together with a parameter-breakdown table that compares the total trainable parameters of the conditioned embedding against standard RoPE. The table confirms that the net increase is negligible (well under 0.1 % of total model parameters) and does not erode the efficiency advantage relative to prior TSFMs. revision: yes
Circularity Check
No circularity detected in architectural proposal or performance claims
full rationale
The paper proposes new mechanisms (dynamic patching tokenizer, mixture-of-size encoding, spectral-conditioned rotary embeddings) to decouple temporal heterogeneity without increasing model width or depth. These are presented as novel designs trained on the PreSTS corpus and evaluated zero-shot on external benchmarks GIFT-Eval and Time-Series-Library. No equations, parameters, or results are shown to reduce by construction to fitted inputs, self-definitions, or self-citation chains. The central claims rest on independent architectural choices and empirical results rather than renaming or importing uniqueness from prior self-work.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard transformer attention and embedding mechanisms can be adapted for time series via patching and positional encodings.
invented entities (4)
-
Dynamic patching tokenizer
no independent evidence
-
Mixture-of-size encoding
no independent evidence
-
Multi-granularity positional embedding
no independent evidence
-
PreSTS corpus
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Mixture-of-Size Dynamic Patching (MoS-DP) ... Dynamic Patch Router ... finest patch size pk = min{pi | gn,i > 0}
-
IndisputableMonolith/Foundation/DimensionForcing.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Instance-Adaptive Rotary Position Embedding (IARoPE) ... θ′j = γj ⊙ θinit,j + βj ... FFT low-frequency components
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
TempusBench: An Evaluation Framework for Time-Series Forecasting
TempusBench is a new evaluation framework for time-series forecasting models that supplies fresh non-overlapping datasets, tasks beyond horizon and domain, consistent tuning across models, and visualization tools.
-
WaveMoE: A Wavelet-Enhanced Mixture-of-Experts Foundation Model for Time Series Forecasting
WaveMoE uses a dual-path architecture with aligned time-series and wavelet tokens routed through shared experts to improve forecasting performance on diverse benchmarks.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Cosmos World Foundation Model Platform for Physical AI
Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chat- topadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Tirex: Zero-shot forecasting across long and short horizons with enhanced in-context learning,
ISSN 2835-8856. URL https://openreview.net/forum?id=gerNCVqqtR. Andreas Auer, Patrick Podest, Daniel Klotz, Sebastian Böck, Günter Klambauer, and Sepp Hochreiter. Tirex: Zero-shot forecasting across long and short horizons with enhanced in-context learning. arXiv preprint arXiv:2505.23719,
-
[4]
Peng Chen, Yingying Zhang, Yunyao Cheng, Yang Shu, Yihang Wang, Qingsong Wen, Bin Yang, and Chenjuan Guo. Pathformer: Multi-scale transformers with adaptive pathways for time series forecasting.arXiv preprint arXiv:2402.05956,
-
[5]
This time is different: An observability perspective on time series foundation models, 2025
Ben Cohen, Emaad Khwaja, Youssef Doubli, Salahidine Lemaachi, Chris Lettieri, Charles Masson, Hugo Miccinilli, Elise Ramé, Qiqi Ren, Afshin Rostamizadeh, et al. This time is different: An observability perspective on time series foundation models.arXiv preprint arXiv:2505.14766,
-
[6]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything.arXiv:2304.02643,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024a. Xu Liu, Juncheng Liu, Gerald Woo, Taha Aksu, Yuxuan Liang, Roger Zimmermann, Chenghao Liu, Silvio Savarese, Caiming Xiong, and Doyen Sahoo. Moirai-moe: Empo...
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Timer-xl: Long-context transformers for unified time series forecasting
Yong Liu, Guo Qin, Xiangdong Huang, Jianmin Wang, and Mingsheng Long. Timer-xl: Long-context transformers for unified time series forecasting.arXiv preprint arXiv:2410.04803, 2024c. Yong Liu, Guo Qin, Zhiyuan Shi, Zhi Chen, Caiyin Yang, Xiangdong Huang, Jianmin Wang, and Mingsheng Long. Sundial: A family of highly capable time series foundation models.arX...
-
[9]
Neural machine translation of rare words with subword units
Association for Computational Linguistics. doi: 10.18653/v1/P16-1162. URLhttps://aclanthology.org/P16-1162/. 11 Preprint. Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations,
-
[10]
Time-moe: Billion-scale time series foundation models with mixture of experts
URL https: //arxiv.org/abs/2409.16040. Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,
-
[11]
Shiyu Wang, Haixu Wu, Xiaoming Shi, Tengge Hu, Huakun Luo, Lintao Ma, James Y Zhang, and Jun Zhou. Timemixer: Decomposable multiscale mixing for time series forecasting.arXiv preprint arXiv:2405.14616, 2024a. Xue Wang, Tian Zhou, Jinyang Gao, Bolin Ding, and Jingren Zhou. Output scaling: Yinglong- delayed chain of thought in a large pretrained time series...
-
[12]
TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis
Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long. Timesnet: Temporal 2d-variation modeling for general time series analysis.arXiv preprint arXiv:2210.02186,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting?arXiv preprint arXiv:2205.13504,
-
[14]
Adamoe: Token-adaptive routing with null experts for mixture-of-experts language models
Zihao Zeng, Yibo Miao, Hongcheng Gao, Hao Zhang, and Zhijie Deng. Adamoe: Token-adaptive routing with null experts for mixture-of-experts language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, pp. 6223–6235,
work page 2024
-
[15]
Jiawen Zhang, Shun Zheng, Xumeng Wen, Xiaofang Zhou, Jiang Bian, and Jia Li. Elastst: To- wards robust varied-horizon forecasting with elastic time-series transformer.arXiv preprint arXiv:2411.01842,
-
[16]
Informer: Beyond efficient transformer for long sequence time-series forecasting
Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. InThe Thirty- Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual Conference, volume 35, pp. 11106–11115. AAAI Press, 2021a. Haoyi Zhou, Shanghang Zhang, Jieqi Pen...
work page 2021
-
[17]
Our observations reveal that whenJ= 1 , which corresponds to the conventional approach of predicting a single patch (Ansari et al., 2024; Liu et al., 2024c; Das et al., 2024), the model achieves optimal performance in short-term forecasting. However, this method necessitates multiple iterations of autoregressive prediction, leading to a significant degrad...
work page 2024
-
[18]
We employ the AdamW optimizer, a linear decay learning rate adjustment strategy for model optimization. The learning rate for parameters related to IARoPE is set to 1e-5, while the learning rate for others is set to 1e-3. Training is conducted on 4 × NVIDIA A100 GPUs using TF32 precision, which takes only 15 hours for base size. Table 4: Details of KAIROS...
work page 2048
-
[19]
Following (Das et al., 2024), the training loader samples 80% real data and 20% synthetic data
in conjunction with 15B synthetic time points. Following (Das et al., 2024), the training loader samples 80% real data and 20% synthetic data. Real-world data.The real-world datasets were stratified into five tiers based on their predictability. This hierarchical structure dictates the sampling probability during model training, assigning a higher likelih...
work page 2024
-
[20]
Table 5: Detailed descriptions of second-level, minute-level, and hourly datasets
19 Preprint. Table 5: Detailed descriptions of second-level, minute-level, and hourly datasets. Dataset Domain Frequency # Time Series # Time points Wind Power Energy 4S 1 7,397,147 Residential Load Power Energy T 813 437,983,677 Residential PV Power Energy T 699 376,016,850 Loop Seattle Transport 5T 323 33,953,760 Los-Loop Transport 5T 207 7,094,304 PEMS...
work page 2018
-
[21]
For the evaluation on the TSLib benchmark, the datasets were Table 9: Detailed descriptions of evaluation datasets. Dataset Domain Frequency # Time Series # Target # Time points ETTh1 Energy H 1 7 17,420 ETTh2 Energy H 1 7 17,420 ETTm1 Energy 15T 1 7 69,680 ETTm2 Energy 15T 1 7 69,680 Weather Nature 10T 1 21 52,696 Saugeen (D) Nature D 1 1 23,741 Saugeen ...
work page 2023
-
[22]
Consequently, we evaluate KAIROSand other TSFMs under a long -context setting
have devoted attention to predicting over long contexts. Consequently, we evaluate KAIROSand other TSFMs under a long -context setting. Specifically, we adopt a context length of 2048 time steps and examine four prediction horizons, which are {96,192,336,720}. For TSFMs incapable of processing this context length, we instead employ the context length at w...
work page 2048
-
[23]
Table 10: Context Lengths for Models on the TSLib Benchmark. Method DLinear iTrans. TimesNet PatchTST Path. Chronos Moirai TimesFM-2.0 Timer-XL TTMa ChronosBolt KAIROS Context length {96, 2048} {96, 2048} {96, 2048} {336, 512, 2048} {96, 2048} 512 2048 2048 2048 1536 2048 2048 E.4 DETAILS OFIAROPE ANALYSIS In this section, we explain in more detail the se...
work page 2048
-
[24]
H BROADERIMPACTS Our work on KAIROSis foundational research focused on advancing time series modeling. While KAIROSitself is not designed for direct societal applications with immediate negative impacts, any powerful predictive technology could be misused. So beyond general risks of advanced AI, our model has no specific negative societal impacts need to ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.