KairosAgent: Agentic Time Series Forecasting with Fused Semantic Reasoning

Kan Ren; Kun Feng; Lintao Ma; Shuqi Gu; Sihan Lu; Xingyu Lu; Yiyang Tan; Yuchen Fang; Ziwei Shan

arxiv: 2605.30002 · v1 · pith:D6SHHIUMnew · submitted 2026-05-28 · 💻 cs.AI

KairosAgent: Agentic Time Series Forecasting with Fused Semantic Reasoning

Kun Feng , Ziwei Shan , Yuchen Fang , Yiyang Tan , Sihan Lu , Shuqi Gu , Lintao Ma , Xingyu Lu

show 1 more author

Kan Ren

This is my paper

Pith reviewed 2026-06-29 07:18 UTC · model grok-4.3

classification 💻 cs.AI

keywords time series forecastinglarge language modelstime series foundation modelsagentic frameworksreinforcement learningzero-shot forecastingmultimodal fusionsemantic reasoning

0 comments

The pith

KairosAgent fuses an LLM reasoner that invokes analytical tools with a TSFM forecaster to achieve superior zero-shot multimodal time series forecasting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes KairosAgent to solve the split between time series foundation models that lack semantic understanding and large language models that struggle with precise numerical forecasting. It creates an agentic system where an LLM reasoner dynamically calls tools to strengthen both numerical grasp and future-oriented semantic reasoning, then merges those outputs into the TSFM pipeline. A large corpus of trajectories trains the system further through reinforcement learning that assigns credit at each turn of multi-turn refinement. The result is presented as higher zero-shot accuracy across domains while using existing pretrained models more effectively and producing more interpretable forecasts.

Core claim

KairosAgent unifies textual reasoning and numerical forecasting by dynamically invoking analytical tools to enhance the numerical understanding and semantic reasoning capabilities of LLMs. The reasoning results are subsequently fused into the TSFM pipeline, enabling more accurate and reliable future predictions. To further improve the reasoning, the approach curates a large-scale corpus of high-quality trajectories, alongside a reinforcement learning from forecasting paradigm with multi-turn refinement and turn-level credit assignment.

What carries the argument

The LLM-based reasoner that dynamically invokes analytical tools, whose outputs are fused into the TSFM-based forecaster and refined by RL with turn-level credit assignment on curated trajectories.

Load-bearing premise

Dynamically invoking analytical tools will reliably improve LLM numerical understanding and semantic reasoning without adding new errors, and the RL training on the trajectories will produce generalizable gains instead of overfitting.

What would settle it

A controlled experiment on held-out cross-domain datasets where the tool-augmented LLM reasoning produces lower accuracy or introduces measurable numerical errors compared with the unmodified TSFM baseline.

Figures

Figures reproduced from arXiv: 2605.30002 by Kan Ren, Kun Feng, Lintao Ma, Shuqi Gu, Sihan Lu, Xingyu Lu, Yiyang Tan, Yuchen Fang, Ziwei Shan.

**Figure 2.** Figure 2: Overview of KAIROSAGENT. (a): inference pipeline where the reasoner invokes statistical tools to produce a morphology description r, which is fused into the Kairos decoder to generate numerical forecasts. (b): three-stage training pipeline where SFT initializes tool-calling capabilities, multimodal alignment trains the text-conditioned forecaster, and RL with turn-level credit assignment refines the reason… view at source ↗

**Figure 3.** Figure 3: Overview of the T-STAR corpus, illustrating its domain distribution and generation pipeline. We construct T-STAR, a tool-augmented corpus comprising over 40k validated examples drawn from 29 datasets across nine domains ( [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Performance comparison (MAE) on TimeIMM across irregular multimodal time series forecasting tasks, with complete results reported in Appendix F. Irregular Forecasting Evaluation [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Tool usage in T-STAR reasoning trajectories. (a) Per-turn distribution: Trend and volatility dominate early turns, shifting toward extrema, distribution, and frequency later. (b) Per-dataset distribution: Tool selection adapts to dataset-specific temporal properties. Values denote column-normalized tool call percentages. Additional seed-stability and inference-efficiency analyses are reported in Appendices… view at source ↗

**Figure 6.** Figure 6: Morphology prior improves Time-IMM fore [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Cross-domain multimodal time series forecasting is a challenging task, requiring models to integrate precise numerical comprehension, cross-domain semantic understanding, and effective multimodal fusion. Existing approaches either build Time Series Foundation Models (TSFMs) from scratch or leverage pretrained Large Language Models (LLMs). However, TSFMs often overlook semantic understanding and lack the ability to perform future-oriented semantic reasoning, and LLMs struggle with numerical comprehension and accurate quantitative forecasting. To overcome these limitations, we propose KairosAgent, a novel agentic framework for multimodal time series forecasting, including an LLM-based reasoner and a TSFM-based forecaster. KairosAgent unifies textual reasoning and numerical forecasting by dynamically invoking analytical tools to enhance the numerical understanding and semantic reasoning capabilities of LLMs. The reasoning results are subsequently fused into the TSFM pipeline, enabling more accurate and reliable future predictions. To further improve the reasoning, we curate a large-scale corpus of high-quality trajectories, alongside a reinforcement learning from forecasting paradigm with multi-turn refinement and turn-level credit assignment. Experiments demonstrate that KairosAgent achieves superior zero-shot forecasting performance while maximizing the utility of pretrained LLMs and TSFMs, presenting a promising direction for efficient and interpretable time series agents. The project page is at https://foundation-model-research.github.io/KairosAgent .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

KairosAgent sketches a hybrid LLM-TSFM agent with tool use and RL refinement, but the abstract supplies no experiments or numbers so the performance claims can't be checked.

read the letter

The main takeaway is that this paper puts forward KairosAgent as an agent framework that pairs an LLM reasoner with a TSFM forecaster, lets the LLM call analytical tools to improve its numerical grip, fuses the output into the forecaster, and then runs RL on curated trajectories with multi-turn refinement and turn-level credit assignment.

What is actually new is the specific pipeline that treats forecasting as a multi-turn agent task and applies RL directly to the reasoning traces rather than just fine-tuning the models. The motivation section does a clean job laying out why standalone TSFMs miss semantics and standalone LLMs miss accurate numbers.

The soft spot is obvious and central: the abstract asserts superior zero-shot performance across domains but gives no baselines, no datasets, no metrics, and no error analysis. The stress-test concern about the RL stage overfitting to the trajectory corpus lands because nothing is said about how diverse those trajectories are or whether hold-out domains were used. Without that evidence the fusion claim stays untested.

The paper is aimed at people already working on multimodal or agentic time series methods in applied areas like finance or healthcare. A reader hunting for architectural ideas might pull a useful diagram or two from it.

It does not look ready for peer review. The framework description is coherent on its own terms, but the load-bearing claim of better forecasting needs the actual results and validation details before any referee should spend time on it.

Referee Report

3 major / 2 minor

Summary. The paper proposes KairosAgent, an agentic framework for cross-domain multimodal time series forecasting that pairs an LLM-based reasoner with a TSFM-based forecaster. It dynamically invokes analytical tools to bolster LLM numerical comprehension and semantic reasoning, fuses those outputs into the TSFM pipeline, and refines the reasoner via RL on a curated corpus of trajectories using multi-turn refinement and turn-level credit assignment. The central claim is that this yields superior zero-shot forecasting performance while better exploiting pretrained LLMs and TSFMs.

Significance. If the empirical claims hold after proper validation, the work would be significant for demonstrating a practical route to fuse semantic reasoning with numerical forecasting without retraining foundation models from scratch. The emphasis on tool use, trajectory curation, and RL credit assignment could inform future agentic designs in time-series domains.

major comments (3)

[Abstract / Experiments] Abstract and Experiments section: no baselines, datasets, metrics (e.g., MAE, MSE), error bars, or statistical tests are reported to substantiate the 'superior zero-shot forecasting performance' claim; without these the central empirical assertion cannot be evaluated.
[RL from Forecasting Paradigm] RL from Forecasting section (trajectory curation and RL stage): the manuscript supplies no information on trajectory source diversity, domain coverage, or hold-out splits used for the RL corpus, leaving open the possibility that reported gains reflect overfitting to the training distribution rather than generalizable semantic-numerical fusion.
[Framework Architecture] Fusion mechanism description: the precise interface by which LLM reasoning outputs are injected into the TSFM (e.g., as additional tokens, prompts, or conditioning vectors) is not formalized, making reproducibility of the claimed multimodal fusion impossible to assess.

minor comments (2)

[RL from Forecasting Paradigm] Notation for the turn-level credit assignment function is introduced without an explicit equation or pseudocode.
[Abstract] The project page URL is given but no statement on code or model release appears in the manuscript.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to a revised manuscript that incorporates the requested details and clarifications.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: no baselines, datasets, metrics (e.g., MAE, MSE), error bars, or statistical tests are reported to substantiate the 'superior zero-shot forecasting performance' claim; without these the central empirical assertion cannot be evaluated.

Authors: We agree that the current abstract and Experiments section lack the necessary specifics. In the revision we will expand both sections to report the full experimental protocol: the complete set of baselines (including representative TSFMs and LLM-based methods), the datasets and their characteristics, evaluation metrics (MAE, MSE, and others), error bars computed over multiple random seeds, and appropriate statistical tests (e.g., paired t-tests or Wilcoxon tests) with p-values. These additions will allow direct evaluation of the zero-shot performance claims. revision: yes
Referee: [RL from Forecasting Paradigm] RL from Forecasting section (trajectory curation and RL stage): the manuscript supplies no information on trajectory source diversity, domain coverage, or hold-out splits used for the RL corpus, leaving open the possibility that reported gains reflect overfitting to the training distribution rather than generalizable semantic-numerical fusion.

Authors: We acknowledge the omission. The revised manuscript will include a dedicated subsection detailing the trajectory corpus: its construction from diverse sources across multiple time-series domains, the breadth of domain coverage, and the explicit train/hold-out splits used for RL training versus evaluation. This information will clarify that the reported improvements arise from generalizable fusion rather than overfitting. revision: yes
Referee: [Framework Architecture] Fusion mechanism description: the precise interface by which LLM reasoning outputs are injected into the TSFM (e.g., as additional tokens, prompts, or conditioning vectors) is not formalized, making reproducibility of the claimed multimodal fusion impossible to assess.

Authors: We agree that the fusion interface requires formalization. In the revision we will add a precise mathematical and algorithmic description of the injection mechanism, specifying whether LLM outputs are mapped to additional tokens, prompt prefixes, or explicit conditioning vectors, together with the exact dimensionality and integration point inside the TSFM pipeline. This will make the multimodal fusion fully reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity: framework builds on external pretrained models with empirical claims

full rationale

The paper proposes KairosAgent as an agentic system that combines an LLM reasoner with a TSFM forecaster, using dynamic tool invocation for numerical/semantic fusion and RL on curated trajectories for refinement. All load-bearing elements (tool use, fusion, RL with turn-level credit assignment) are described as engineering choices applied to existing pretrained models rather than derived from the paper's own fitted outputs or self-referential definitions. No equations, uniqueness theorems, or 'predictions' are presented that reduce by construction to inputs. The zero-shot performance claims are empirical and rest on experiments, not on any self-definitional loop or fitted-input renaming. This is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Based solely on the abstract, the central addition is the agentic framework itself; no explicit free parameters, axioms, or invented entities beyond the named components are detailed.

axioms (2)

domain assumption Pretrained LLMs can perform enhanced semantic reasoning when augmented with analytical tools
Invoked as the basis for the reasoner component
domain assumption TSFMs can incorporate fused semantic reasoning outputs to improve numerical forecasts
Invoked as the basis for the forecaster component

invented entities (1)

KairosAgent no independent evidence
purpose: Unify textual reasoning and numerical forecasting via tool invocation and RL refinement
The framework is introduced as the core contribution

pith-pipeline@v0.9.1-grok · 5788 in / 1377 out tokens · 29526 ms · 2026-06-29T07:18:45.621248+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 5 canonical work pages · 2 internal anchors

[1]

Gluonts: Probabilistic and neural time series modeling in python.Journal of Machine Learning Research, 21(116):1–6. Abdul Fatir Ansari, Lorenzo Stella, Ali Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Olek- sandr Shchur, Syama Sundar Rangapuram, Sebas- tian Pineda Arango, Shubham Kapoor, and 1 others
[2]

Conversational time series foundation models: Towards explainable and effective forecasting

Chronos: Learning the language of time series. Transactions on Machine Learning Research. Andreas Auer, Patrick Podest, Daniel Klotz, Sebastian Böck, Günter Klambauer, and Sepp Hochreiter. 2026. Tirex: Zero-shot forecasting across long and short horizons with enhanced in-context learning.Ad- vances in Neural Information Processing Systems, 38:57529–57580....

work page arXiv 2026
[3]

Monash Time Series Forecasting Archive.arXiv preprint arXiv:2105.06643, 2021

Monash time series forecasting archive.arXiv preprint arXiv:2105.06643. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783. Tong Guan, Zijie Meng, Dianqi Li, Shiyu Wang,...

work page arXiv 2024
[4]

TimeOmni-1: Incentivizing Complex Reasoning with Time Series in Large Language Models

Timeomni-1: Incentivizing complex reason- ing with time series in large language models.arXiv preprint arXiv:2509.24803. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

TS-Reasoner: Domain-Oriented Time Series Inference Agents for Reasoning and Automated Analysis

Chatts: Aligning time series with llms via syn- thetic data for enhanced understanding and reasoning. Proceedings of the VLDB Endowment, 18(8):2385– 2398. Wen Ye, Wei Yang, Defu Cao, Yizhou Zhang, Lumingyuan Tang, Jie Cai, and Yan Liu. 2024. Domain-oriented time series inference agents for reasoning and automated analysis.arXiv preprint arXiv:2410.04047. ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

# Samples

Are transformers effective for time series fore- casting? InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 11121– 11128. Yu Zheng, Xiuwen Yi, Ming Li, Ruiyuan Li, Zhangqing Shan, Eric Chang, and Tianrui Li. 2015. Forecast- ing fine-grained air quality based on big data. In Proceedings of the 21th ACM SIGKDD international c...

2015
[7]

Violating sam- ples are regenerated with a stricter prompt constraint

Metadata Leak Check.A judge verifies that the final forecast does not contain exact numbers, timestamps, dataset names, variable names, units, or domain labels. Violating sam- ples are regenerated with a stricter prompt constraint
[8]

Reasoning Usage Check.A judge assesses whether the reasoning trace meaningfully in- corporates tool outputs and metadata context, rather than producing a generic answer
[9]

In the short term,

Forecast Accuracy Check.A judge evaluates whether the predicted morphology is broadly consistent with the held-out future values in terms of trend, periodicity, volatility, regime changes, and turning points. Samples failing any check are retried up to 3 times. Samples that remain invalid after all retries are discarded. This pipeline ensures that retaine...

2024
[10]

call_id":

and TimeCMA (Liu et al., 2025a) construct template-based prompts from the numerical history. CALF (Liu et al., 2025b) uses embedding-level textual tokens extracted from the pretrained LLM vocabulary space. Zero-shot Multimodal Foundation Model.For Aurora (Wu et al., 2025b), we use the textual prompts provided in its original paper. Zero-shot TSFMs and Uni...

work page arXiv 2024

[1] [1]

Gluonts: Probabilistic and neural time series modeling in python.Journal of Machine Learning Research, 21(116):1–6. Abdul Fatir Ansari, Lorenzo Stella, Ali Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Olek- sandr Shchur, Syama Sundar Rangapuram, Sebas- tian Pineda Arango, Shubham Kapoor, and 1 others

[2] [2]

Conversational time series foundation models: Towards explainable and effective forecasting

Chronos: Learning the language of time series. Transactions on Machine Learning Research. Andreas Auer, Patrick Podest, Daniel Klotz, Sebastian Böck, Günter Klambauer, and Sepp Hochreiter. 2026. Tirex: Zero-shot forecasting across long and short horizons with enhanced in-context learning.Ad- vances in Neural Information Processing Systems, 38:57529–57580....

work page arXiv 2026

[3] [3]

Monash Time Series Forecasting Archive.arXiv preprint arXiv:2105.06643, 2021

Monash time series forecasting archive.arXiv preprint arXiv:2105.06643. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783. Tong Guan, Zijie Meng, Dianqi Li, Shiyu Wang,...

work page arXiv 2024

[4] [4]

TimeOmni-1: Incentivizing Complex Reasoning with Time Series in Large Language Models

Timeomni-1: Incentivizing complex reason- ing with time series in large language models.arXiv preprint arXiv:2509.24803. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

TS-Reasoner: Domain-Oriented Time Series Inference Agents for Reasoning and Automated Analysis

Chatts: Aligning time series with llms via syn- thetic data for enhanced understanding and reasoning. Proceedings of the VLDB Endowment, 18(8):2385– 2398. Wen Ye, Wei Yang, Defu Cao, Yizhou Zhang, Lumingyuan Tang, Jie Cai, and Yan Liu. 2024. Domain-oriented time series inference agents for reasoning and automated analysis.arXiv preprint arXiv:2410.04047. ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

# Samples

Are transformers effective for time series fore- casting? InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 11121– 11128. Yu Zheng, Xiuwen Yi, Ming Li, Ruiyuan Li, Zhangqing Shan, Eric Chang, and Tianrui Li. 2015. Forecast- ing fine-grained air quality based on big data. In Proceedings of the 21th ACM SIGKDD international c...

2015

[7] [7]

Violating sam- ples are regenerated with a stricter prompt constraint

Metadata Leak Check.A judge verifies that the final forecast does not contain exact numbers, timestamps, dataset names, variable names, units, or domain labels. Violating sam- ples are regenerated with a stricter prompt constraint

[8] [8]

Reasoning Usage Check.A judge assesses whether the reasoning trace meaningfully in- corporates tool outputs and metadata context, rather than producing a generic answer

[9] [9]

In the short term,

Forecast Accuracy Check.A judge evaluates whether the predicted morphology is broadly consistent with the held-out future values in terms of trend, periodicity, volatility, regime changes, and turning points. Samples failing any check are retried up to 3 times. Samples that remain invalid after all retries are discarded. This pipeline ensures that retaine...

2024

[10] [10]

call_id":

and TimeCMA (Liu et al., 2025a) construct template-based prompts from the numerical history. CALF (Liu et al., 2025b) uses embedding-level textual tokens extracted from the pretrained LLM vocabulary space. Zero-shot Multimodal Foundation Model.For Aurora (Wu et al., 2025b), we use the textual prompts provided in its original paper. Zero-shot TSFMs and Uni...

work page arXiv 2024