Multivariate Financial Forecasting using the Chronos Time Series Foundation Models

Mohini Yadav; Sanjiv R Das; Tarang Goyal

arxiv: 2605.21504 · v1 · pith:CEPNGLIRnew · submitted 2026-05-08 · 💱 q-fin.ST · cs.AI

Multivariate Financial Forecasting using the Chronos Time Series Foundation Models

Sanjiv R Das , Tarang Goyal , Mohini Yadav This is my paper

Pith reviewed 2026-05-22 02:15 UTC · model grok-4.3

classification 💱 q-fin.ST cs.AI

keywords multivariate forecastingtime series foundation modelsChronosfinancial forecastinginterest ratesequitiesforecast accuracyrolling evaluation

0 comments

The pith

Multivariate inputs from related series improve forecast accuracy over univariate baselines in the Chronos time-series foundation model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates whether feeding multiple related financial time series into the Chronos-2 foundation model produces more accurate forecasts than treating each series in isolation. It applies the model to panels of Magnificent-7 equities and U.S. Treasury interest rates, plus their combination, with rolling monthly out-of-sample evaluations spanning 2000 to 2025. Multivariate forecasts reduce RMSE and MAPE relative to univariate ones, with larger gains on interest rates and lower error dispersion across all series. Accuracy falls when equities and rates are mixed together, showing that the benefit depends on choosing coherent groups of series rather than adding arbitrary context.

Core claim

Multivariate forecasts with Chronos-2 consistently outperform univariate forecasts across the equity panel, the interest-rate panel, and the full set of series. Gains appear in every individual series examined, error dispersion declines under multivariate inputs, and the largest improvements occur for interest rates. Accuracy degrades when equities and rates are modeled jointly, indicating that cross-market mixing introduces noise that harms performance.

What carries the argument

Rolling monthly forecast evaluation of Chronos-2 under multivariate versus univariate input regimes, with varying window lengths and horizons, measured by RMSE and MAPE.

Load-bearing premise

The rolling monthly evaluation protocol with fixed window lengths and horizons produces unbiased out-of-sample comparisons between multivariate and univariate regimes without look-ahead bias or post-hoc selection of favorable periods.

What would settle it

A new test set or replication period in which univariate forecasts achieve lower or equal RMSE and MAPE than multivariate forecasts under identical rolling-window conditions would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.21504 by Mohini Yadav, Sanjiv R Das, Tarang Goyal.

**Figure 1.** Figure 1: Parameter heatmap comparing MV and UV forecasting accuracy across window lengths and horizons (based on MAPE). The upper plot is for Rates and the lower one is for Stocks. In each plot, the left side is for the 21 working day forecast (1 month forecast) and the right side is for the 63 working day forecast (3 month forecast) interpretability. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗

**Figure 2.** Figure 2: Time series of MV and UV MAPE errors over the evaluation period. This is for the dataset running from July 2010 through December 2025 [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of forecast accuracy pre-2023 and post-2023. Data used is for the period July 2010 through December 2025. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Using Chronos-2, an open-source time-series foundation model, we evaluate pretrained time-series models for economic and financial forecasting with an emphasis on whether multivariate (MV) inputs improve accuracy relative to univariate (UV) baselines. The study covers two panels -- the Magnificent-7 equities and U.S. Treasury interest rates -- as well as a combined panel, using rolling monthly evaluations from 2000--2025. We vary input window lengths and forecast horizons and report RMSE and MAPE. Across datasets, MV forecasts consistently outperform UV forecasts, with especially strong gains for interest rates and meaningful improvements for equities. Series-level comparisons show MV improvements in every case, and error dispersion is generally lower under MV inputs. We also provide parameter-heatmap and time-series visualizations. However, mixing time series across equity and interest rate markets reduces forecast accuracy, indicating that adding noisy context degrades model performance. Overall, the results indicate that foundation models can leverage cross-series information to improve forecast accuracy in finance, and that the benefits are strongest when related series are modeled jointly under disciplined rolling protocols. Other than using an open-source foundation model, this paper also showcases how AI may be used for financial research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Chronos-2 shows a multivariate edge on these financial panels but the results rest on point estimates without tests or full causal details.

read the letter

This paper's main finding is that multivariate inputs improve Chronos-2 forecasts over univariate ones for both the Magnificent 7 equities and Treasury rates, with the biggest lifts on interest rates and lower error spread overall. The work applies a recently released open-source time series foundation model to these panels using rolling monthly windows from 2000 to 2025. It varies input lengths and horizons, reports RMSE and MAPE, and includes some visualizations. The observation that mixing stocks and bonds hurts accuracy is a solid practical point. The evidence has clear gaps. The results are point estimates without any statistical tests or confidence intervals, which makes it hard to judge how meaningful the differences are. Details on prompting the foundation model or any fine-tuning are missing. The rolling protocol is called disciplined, but without confirmation that all preprocessing like scaling stays within the training window for each forecast origin, the multivariate gains could partly reflect non-causal handling. That matches the stress-test concern and needs checking. This kind of paper is useful for quant finance researchers testing foundation models on market data. A reader looking for applied examples rather than new theory would find it worthwhile. It should go to peer review. The comparison is concrete and the topic timely enough for referees to sort out the robustness issues.

Referee Report

3 major / 2 minor

Summary. The paper evaluates the Chronos-2 time-series foundation model for financial forecasting on two panels (Magnificent-7 equities and U.S. Treasury rates) plus a combined panel. Using rolling monthly out-of-sample evaluations from 2000–2025 with varying input windows and horizons, it reports that multivariate (MV) inputs produce lower RMSE and MAPE than univariate (UV) baselines, with larger gains for rates, uniformly positive series-level improvements, and lower error dispersion under MV; mixing equities and rates degrades accuracy.

Significance. If the MV advantage survives rigorous causal validation and statistical testing, the work would demonstrate that open-source time-series foundation models can usefully exploit cross-series dependencies in financial data under disciplined rolling protocols. This would be a concrete, reproducible contribution to multivariate forecasting practice, especially given the use of an open-source model and the explicit comparison of related versus unrelated series.

major comments (3)

[Evaluation Protocol] The rolling monthly evaluation protocol (described in the methods and results sections) does not explicitly confirm that normalization, scaling, or context construction for each forecast origin uses only information up to that origin. If any statistic is computed over the full series or if MV and UV input pipelines differ in their use of future data, the reported RMSE/MAPE gains could be artifacts rather than evidence of genuine cross-series information transfer.
[Results] Across all reported tables and figures, only point estimates of RMSE and MAPE are given; no standard errors, confidence intervals, Diebold-Mariano tests, or other statistical significance assessments accompany the MV–UV differences. Without these, the central claim of “consistent outperformance” rests on untested numerical differences whose economic or statistical reliability cannot be judged.
[Methodology] The manuscript provides insufficient detail on how Chronos-2 is prompted or adapted for the multivariate versus univariate regimes (e.g., tokenization of multiple series, context length handling, or any fine-tuning). This makes it impossible to verify that the input-construction rules are identical and fair across the two settings.

minor comments (2)

[Abstract] The abstract states that “mixing time series across equity and interest rate markets reduces forecast accuracy,” yet the corresponding table or figure is not referenced; adding an explicit cross-reference would improve readability.
[Results] Parameter-heatmap and time-series visualizations are mentioned but their placement (main text vs. appendix) and exact content are unclear; a brief description or caption reference would help.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major comment below and describe the revisions we intend to implement.

read point-by-point responses

Referee: [Evaluation Protocol] The rolling monthly evaluation protocol (described in the methods and results sections) does not explicitly confirm that normalization, scaling, or context construction for each forecast origin uses only information up to that origin. If any statistic is computed over the full series or if MV and UV input pipelines differ in their use of future data, the reported RMSE/MAPE gains could be artifacts rather than evidence of genuine cross-series information transfer.

Authors: We appreciate the referee's emphasis on preventing data leakage. In our implementation, all normalization and scaling statistics are computed exclusively from data available up to each forecast origin within the rolling window. We will revise the methods section to state this explicitly for both univariate and multivariate pipelines and to confirm that the only difference between the two regimes is the inclusion of additional series. revision: yes
Referee: [Results] Across all reported tables and figures, only point estimates of RMSE and MAPE are given; no standard errors, confidence intervals, Diebold-Mariano tests, or other statistical significance assessments accompany the MV–UV differences. Without these, the central claim of “consistent outperformance” rests on untested numerical differences whose economic or statistical reliability cannot be judged.

Authors: We agree that statistical assessment of the differences would strengthen the results. In the revision we will report standard errors derived from the rolling windows and add Diebold-Mariano tests comparing MV and UV forecasts for each series and for the panel aggregates. revision: yes
Referee: [Methodology] The manuscript provides insufficient detail on how Chronos-2 is prompted or adapted for the multivariate versus univariate regimes (e.g., tokenization of multiple series, context length handling, or any fine-tuning). This makes it impossible to verify that the input-construction rules are identical and fair across the two settings.

Authors: We concur that greater methodological transparency is required. The revised manuscript will contain an expanded subsection detailing the exact prompting format, the tokenization and concatenation procedure for multivariate inputs, context-length management, and the fact that the pretrained Chronos-2 model is used without fine-tuning. This will demonstrate that the MV and UV pipelines differ solely in the presence of cross-series context. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical rolling-window evaluation

full rationale

The paper conducts an empirical comparison of multivariate versus univariate forecasts generated by the pretrained Chronos-2 foundation model on equity and interest-rate panels. It reports RMSE and MAPE under a fixed rolling monthly protocol from 2000–2025 with no internal parameter fitting, no self-referential definitions, and no derivation that reduces the accuracy numbers to quantities defined by the paper’s own inputs. The central claim rests on out-of-sample forecast errors computed from historical data; these errors are not forced by construction from any fitted quantity or self-citation chain inside the manuscript.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of rolling-window out-of-sample evaluation and on the assumption that the pretrained Chronos-2 weights already encode useful cross-series structure for financial data.

axioms (1)

domain assumption Rolling monthly windows with fixed input lengths and horizons produce unbiased out-of-sample forecast comparisons.
Standard time-series practice invoked to justify the evaluation design.

pith-pipeline@v0.9.0 · 5741 in / 1313 out tokens · 56906 ms · 2026-05-22T02:15:30.893939+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

[1]

Conceptualization:Students are provided an initial outline

work page
[2]

Initial Coding:Students prepare a Google Colab notebook with assistance from Gemini

work page
[3]

Refinement:The professor updates the notebook, modularizing the code and updating logic, also utilizing Gemini for coding assistance

work page
[4]

Data Preparation:Results are finalized in Colab, consisting of printed dataframes, plots, and textual outputs

work page
[5]

com/prism/)

Environment Setup:A LATEX shell for the paper is established in OpenAI Prism (https://openai. com/prism/)

work page
[6]

Outlining:The structure is set up with blank sections: Introduction, Methodology, Results, and Concluding Discussion (including relevant subsec- tions)

work page
[7]

These are then integrated into the Results section

Table Conversion:Printed dataframes from Co- lab are fed into Gemini with prompts to convert them into LATEX tables. These are then integrated into the Results section

work page
[8]

Visuals:Figures are added to the project folder, and the corresponding LATEX code for figure inclu- sion is written

work page
[9]

Notably, the underlying model (GPT-5.2) inferred results directly from the tables to generate the write-up

Drafting Results:Prism is tasked with drafting text for the Results section to explain the tables and figures. Notably, the underlying model (GPT-5.2) inferred results directly from the tables to generate the write-up

work page
[10]

Gemini generates the Methodology section, which is then moved to the draft

Methodology:The Chronos-1 and Chronos-2 pa- pers, along with experiment descriptions from the Colab notebook, are uploaded to Gemini. Gemini generates the Methodology section, which is then moved to the draft

work page
[11]

Specific Subsections:A subsection on data leak- age is added; the reasoning is provided by the au- thors while the LLM handles the prose

work page
[12]

Prism is prompted to generate a literature review based on the citations as they relate to the Methodology and Results

Literature Review (Part 1):A BibTEX file from Zotero is uploaded. Prism is prompted to generate a literature review based on the citations as they relate to the Methodology and Results

work page
[13]

Prism generates the Introduction and updates the literature review for context

Introduction and Literature Review (Part 2): Previously published related papers are uploaded. Prism generates the Introduction and updates the literature review for context

work page
[14]

Conclusions:Using the full paper as context, Prism generates the Concluding Discussion

work page
[15]

Abstract:Prism is prompted to synthesize an abstract for the completed draft

work page
[16]

Self Review and Revision Process Gemini was also utilized to generate a mock referee report (review.pdf)

Human Editing:Final human-led editing passes are performed to improve narrative flow, reformat the LATEX layout, and ensure technical accuracy. Self Review and Revision Process Gemini was also utilized to generate a mock referee report (review.pdf). This report informed a response document (responses.md) created with the following prompt: “Please read the...

work page

[1] [1]

Conceptualization:Students are provided an initial outline

work page

[2] [2]

Initial Coding:Students prepare a Google Colab notebook with assistance from Gemini

work page

[3] [3]

Refinement:The professor updates the notebook, modularizing the code and updating logic, also utilizing Gemini for coding assistance

work page

[4] [4]

Data Preparation:Results are finalized in Colab, consisting of printed dataframes, plots, and textual outputs

work page

[5] [5]

com/prism/)

Environment Setup:A LATEX shell for the paper is established in OpenAI Prism (https://openai. com/prism/)

work page

[6] [6]

Outlining:The structure is set up with blank sections: Introduction, Methodology, Results, and Concluding Discussion (including relevant subsec- tions)

work page

[7] [7]

These are then integrated into the Results section

Table Conversion:Printed dataframes from Co- lab are fed into Gemini with prompts to convert them into LATEX tables. These are then integrated into the Results section

work page

[8] [8]

Visuals:Figures are added to the project folder, and the corresponding LATEX code for figure inclu- sion is written

work page

[9] [9]

Notably, the underlying model (GPT-5.2) inferred results directly from the tables to generate the write-up

Drafting Results:Prism is tasked with drafting text for the Results section to explain the tables and figures. Notably, the underlying model (GPT-5.2) inferred results directly from the tables to generate the write-up

work page

[10] [10]

Gemini generates the Methodology section, which is then moved to the draft

Methodology:The Chronos-1 and Chronos-2 pa- pers, along with experiment descriptions from the Colab notebook, are uploaded to Gemini. Gemini generates the Methodology section, which is then moved to the draft

work page

[11] [11]

Specific Subsections:A subsection on data leak- age is added; the reasoning is provided by the au- thors while the LLM handles the prose

work page

[12] [12]

Prism is prompted to generate a literature review based on the citations as they relate to the Methodology and Results

Literature Review (Part 1):A BibTEX file from Zotero is uploaded. Prism is prompted to generate a literature review based on the citations as they relate to the Methodology and Results

work page

[13] [13]

Prism generates the Introduction and updates the literature review for context

Introduction and Literature Review (Part 2): Previously published related papers are uploaded. Prism generates the Introduction and updates the literature review for context

work page

[14] [14]

Conclusions:Using the full paper as context, Prism generates the Concluding Discussion

work page

[15] [15]

Abstract:Prism is prompted to synthesize an abstract for the completed draft

work page

[16] [16]

Self Review and Revision Process Gemini was also utilized to generate a mock referee report (review.pdf)

Human Editing:Final human-led editing passes are performed to improve narrative flow, reformat the LATEX layout, and ensure technical accuracy. Self Review and Revision Process Gemini was also utilized to generate a mock referee report (review.pdf). This report informed a response document (responses.md) created with the following prompt: “Please read the...

work page