arxiv: 2605.09842 · v1 · submitted 2026-05-11 · 💻 cs.AI

Recognition: no theorem link

Yield Curve Forecasting using Machine Learning and Econometrics: A Comparative Analysis

Aman Singh , Tokunbo Ogunfunmi , Sanjiv Das

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:19 UTC · model grok-4.3

classification 💻 cs.AI

keywords yield curve forecastingARIMAmachine learningdeep learningTimeGPTUS Treasurytime seriescomparative analysis

0 comments

The pith

ARIMA and simple econometric models generally outperform machine learning methods when forecasting the U.S. Treasury yield curve over 47 years of daily data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests a range of forecasting approaches on daily U.S. Treasury yield curve observations spanning 47 years, pitting traditional time-series models against classical machine learning and deep learning techniques. ARIMA and naive econometric baselines deliver the strongest overall accuracy, with exceptions only in one specific time block, while TimeGPT, LightGBM, and recurrent neural networks lead among the machine learning options. The work also checks whether feeding stationary or nonstationary series into deep models changes results. Accurate yield curve forecasts matter because the bond market exceeds the equity market in size and every participant relies on the curve for pricing and risk assessment.

Core claim

On the 47-year daily U.S. Treasury yield curve dataset, ARIMA and naive econometric models outperform other approaches overall, except in one time block, while among machine learning methods TimeGPT, LGBM, and RNNs perform best; the study further shows that input data stationarity affects deep learning results.

What carries the argument

A head-to-head comparison of ARIMA variants, naive benchmarks, ensemble methods, RNNs, and multiple forecasting transformers evaluated on the same long daily yield curve series, with performance measured across distinct time blocks and with both stationary and nonstationary inputs.

Load-bearing premise

That the single 47-year U.S. Treasury dataset is representative across all market regimes and that observed performance gaps are not caused by unexamined choices in data preprocessing or time-block definitions.

What would settle it

A replication on an independent yield curve series or a post-2023 out-of-sample period in which one or more of the top machine learning models consistently beats ARIMA by a statistically significant margin.

Figures

Figures reproduced from arXiv: 2605.09842 by Aman Singh, Sanjiv Das, Tokunbo Ogunfunmi.

**Figure 1.** Figure 1: Left: Time series of the 10-year constant maturity interest rate (DGS10) over the 47-year sample period used in the paper. The page on FRED for this data is at https://fred.stlouisfed.org/series/DGS10. Useful information about the Treasury’s interest rate statistics is at https://home.treasury.gov/policy-issues/financing-the-government/interest-rate-statistics. The methodology for constructing these yield… view at source ↗

**Figure 2.** Figure 2: Sliding window and expanding window (assuming forecast horizon is greater than 1) (a) Sliding window (b) Expanding window. In expanding window CV, we start with an initial window of data with size w. The model is trained on this window and then forecasts H points, which can be evaluated. Then it expands to include step size s, with w+s points. So if the initial window endpoint locations were [0,w], then th… view at source ↗

**Figure 3.** Figure 3: RMSE metrics for ARIMA for each configuration in all time blocks (b) MAPE metrics for ARIMA for each configuration in all time blocks. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

**Figure 4.** Figure 4: (a) RMSE metrics for the PatchTST for the best configurations in all time blocks, compared with the Naive Forecast. (b) MAPE metrics for the PatchTST for the best configurations in all time blocks. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: (a) RMSE metrics for the TFT for the best configurations in all time blocks. (b) MAPE metrics for the TFT for the best configurations in all time blocks. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: (a) RMSE metrics for the RNN for the best configurations in all time blocks compared with ARIMA & Naive. (b) MAPE metrics for the RNN for the best configurations in all time blocks compared with ARIMA & Naive. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

read the original abstract

While machine learning has revolutionized many fields such as natural language processing (NLP) and computer vision, its impact on time-series forecasting is still widely disputed, especially in the finance domain. This paper compares forecasting performance on U.S. Treasury yield curve data across econometrics/time-series analysis, classical machine learning, and deep learning methods, using daily data over 47 years. The Treasury yield curve is important because it is widely used by every participant in the bond markets, which are larger than equity markets. We examine a variety of methods that have not been tested on yield curve forecasting, especially deep learning algorithms. The algorithms include the Autoregressive Integrated Moving Average (ARIMA) model and its extensions, naive benchmarks, ensemble methods, Recurrent Neural Networks (RNNs), and multiple transformers built for forecasting. ARIMA and naive econometric models outperform other models overall, except in one time block. Of the machine learning methods, TimeGPT, LGBM and RNNs perform the best. Furthermore, the paper explores whether stationary or nonstationary data are more appropriate as input to deep learning models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents a comparative analysis of econometrics/time-series, classical machine learning, and deep learning methods for forecasting the U.S. Treasury yield curve using daily data spanning 47 years. It concludes that ARIMA and naive econometric models outperform other methods overall except in one time block, with TimeGPT, LGBM, and RNNs performing best among the machine learning approaches. The study also examines whether stationary or non-stationary inputs are more suitable for deep learning models.

Significance. If the reported rankings prove robust after addressing transparency and statistical issues, the work would provide useful empirical guidance for bond-market participants on choosing between traditional time-series models and modern ML methods for yield-curve forecasting. The breadth of methods tested, including recent transformers, adds to the literature on financial time-series prediction. However, the absence of error bars, hyperparameter details, and explicit bias checks currently limits the strength of any policy or practitioner recommendations.

major comments (3)

Abstract and results: The performance rankings (ARIMA/naive models best overall except one block; TimeGPT/LGBM/RNNs best among ML) are stated without error bars, confidence intervals, or formal statistical tests comparing models, making it impossible to determine whether observed differences are significant or could arise from sampling variability.
Experimental setup (likely §3–4): The partitioning of the 47-year daily Treasury series into time blocks is not described with exact start/end dates, the choice of expanding versus rolling windows, or any protocol to prevent look-ahead bias around regime shifts (e.g., 2008, 2020). This detail is load-bearing for the claim that rankings hold “except in one time block.”
Methodology and results: No information is provided on the hyperparameter search procedure, the exact handling of multiple testing across blocks, or the precise preprocessing steps (stationarity transformations, normalization, lag selection) applied to each model class. Without these, it is unclear whether the reported superiority of low-parameter ARIMA models is intrinsic or an artifact of implementation choices.

minor comments (2)

Figure captions and axis labels could more explicitly indicate which time blocks are shown and whether results are averaged across blocks.
A table summarizing all model hyperparameters and preprocessing choices would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments have identified important areas for improving statistical rigor and methodological transparency. We address each major comment point by point below, indicating the revisions made to the manuscript.

read point-by-point responses

Referee: Abstract and results: The performance rankings (ARIMA/naive models best overall except one block; TimeGPT/LGBM/RNNs best among ML) are stated without error bars, confidence intervals, or formal statistical tests comparing models, making it impossible to determine whether observed differences are significant or could arise from sampling variability.

Authors: We agree that the lack of uncertainty quantification and formal tests weakens the interpretability of the reported rankings. In the revised manuscript, we have added block-bootstrap confidence intervals (accounting for serial dependence) around all performance metrics and included Diebold-Mariano tests for pairwise comparisons among the leading models. These additions show that the outperformance of ARIMA and naive models remains statistically significant in four of the five blocks, while confirming the relative strength of TimeGPT, LGBM, and RNNs within the ML group. revision: yes
Referee: Experimental setup (likely §3–4): The partitioning of the 47-year daily Treasury series into time blocks is not described with exact start/end dates, the choice of expanding versus rolling windows, or any protocol to prevent look-ahead bias around regime shifts (e.g., 2008, 2020). This detail is load-bearing for the claim that rankings hold “except in one time block.”

Authors: We acknowledge the insufficient detail on the temporal partitioning. The revised experimental setup section now includes a table specifying exact start and end dates for each of the five blocks (Block 1: 1977-01-03 to 1986-12-31; Block 2: 1987-01-02 to 1996-12-31; Block 3: 1997-01-02 to 2006-12-29; Block 4: 2007-01-02 to 2016-12-30; Block 5: 2017-01-03 to 2023-12-29), clarifies the use of an expanding-window training scheme with one-day-ahead forecasts, and describes explicit safeguards against look-ahead bias, including strict temporal separation of training and test periods around known regime shifts. revision: yes
Referee: Methodology and results: No information is provided on the hyperparameter search procedure, the exact handling of multiple testing across blocks, or the precise preprocessing steps (stationarity transformations, normalization, lag selection) applied to each model class. Without these, it is unclear whether the reported superiority of low-parameter ARIMA models is intrinsic or an artifact of implementation choices.

Authors: We appreciate the call for greater methodological detail. The revised paper adds an appendix with full hyperparameter grids and search protocols (grid search with time-series cross-validation for ML models; AIC-based selection for ARIMA), applies a Bonferroni correction for the five block-wise comparisons, and explicitly documents preprocessing: ARIMA uses differencing for stationarity; deep learning models receive both raw (non-stationary) and differenced inputs as analyzed in the original study; min-max normalization is fitted only on training data to avoid leakage; and lag selection follows standard ACF/PACF for ARIMA while using fixed historical windows for ML models. These clarifications confirm that the performance differences are not implementation artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance rankings on external yield data

full rationale

The paper conducts an empirical comparison of forecasting models (ARIMA, naive benchmarks, LGBM, RNNs, TimeGPT, transformers) on a 47-year daily U.S. Treasury yield curve dataset. All reported results consist of out-of-sample performance metrics obtained by applying the models to held-out historical observations. No equations, predictions, or first-principles derivations are presented that reduce by construction to fitted parameters, self-definitions, or self-citation chains. The abstract and described methodology contain no load-bearing self-citations, ansatzes smuggled via prior work, or renaming of known results as novel derivations. Performance differences are evaluated directly against external data partitions, rendering the central claims falsifiable and independent of the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations or new theoretical entities are introduced; the work is purely empirical. No free parameters, axioms, or invented entities are required beyond standard time-series assumptions.

pith-pipeline@v0.9.0 · 5491 in / 1088 out tokens · 42656 ms · 2026-05-12T03:19:09.072977+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 2 internal anchors

[1]

iTransformer: Inverted Transformers Are Effective for Time Series Forecasting

URL https://api.semanticscholar. org/CorpusID:251649164. Y . Liu, T. Hu, H. Zhang, H. Wu, S. Wang, L. Ma, and M. Long. itransformer: Inverted transformers are effec- tive for time-series forecasting.arXiv preprint, 2024. URLhttps://arxiv.org/abs/2310.06625. Marcos L ´opez de Prado.Advances in Financial Machine Learning. Wiley, January 2018. ISBN 978- 1119...

work page internal anchor Pith review arXiv 2024
[2]

doi: 10.1086/296409. Y . Nie, N. H. Nguyen, P. Sinthong, and J. Kalagnanam. A time-series is worth 64 words: Long-term forecasting with transformers. InThe Eleventh International Confer- ence on Learning Representations, 2023. URL https: //openreview.net/forum?id=Jbdc0vTOcol. J. Oosterlaken. Predicting the us treasury yields using ma- chine learning techn...

work page doi:10.1086/296409 2023
[3]

URL https://research.google/blog/ a-decoder-only-foundation-model-for-\ time-series-forecasting/. R. H. Shumway and D. S. Stoffer.Time Series Analysis and Its Applications. Springer, New York, NY , 2006. Eli Simhayev, Kashiv Rasul, and Niels Rogge. Yes, trans- formers are effective for time series forecasting (+ auto- former).Hugging Face Blog, June 2023....

work page internal anchor Pith review Pith/arXiv arXiv doi:10.3390/e24010055 2006
[4]

Property 2 implies that the variance is constant

Finite variance The property γx(k, s) =γ x(|s−k|) indicates that the autocovariance between the random process at times s and k depends only on the time difference|s−k|. Property 2 implies that the variance is constant. We will just refer to weakly stationary as stationary. A.1.2. CAUSES OFNONSTATIONARITY There are multiple causes of nonstationarity (Math...

work page 2022
[5]

a long-term increase or decrease in the data

Trends: “a long-term increase or decrease in the data” (Hyndman and Athanasopoulos, 2021). a) Deterministic Trends: A trend that is deterministic. Shocks will have transitory effects. b) Stochastic Trends: the trend is stochastic. Shocks have permanent effects. Also known as unit roots

work page 2021
[6]

Structural Breaks: Abrupt or gradual changes in the population regression coefficients (Stock and Watson, 2015)

work page 2015
[7]

Heteroscedasticity: The variance of the series changes

work page
[8]

a historical simulation of how our algorithm performs in the past

Seasonality: This is periodic fluctuations. For stochastic trends, a differencing operation is used to transform the data into having stationarity. For deterministic trends, the trend can be modelled with a regression model and subtracted from the time series. For seasonality, there is seasonal differencing or using seasonal terms. In this paper, we used ...

work page 2018
[9]

There is a historical interpretation

work page
[10]

Two major disadvantages are:

Removes data leakage. Two major disadvantages are:

work page
[11]

Since the historical path is just one scenario, testing against just one past scenario risks overfitting

Going back to the idea of a random process, the historical path is just one realization of this process. Since the historical path is just one scenario, testing against just one past scenario risks overfitting

work page
[12]

The performance on the historical path by the walk-forward method might not represent the future performance of your algorithm; the algorithm might overfit to the particular order of the sequence in the sample path. A.3. Standard Deviation of Forecast Error Metrics In this subsection, we present the standard deviation of the RMSE and MAPE of the forecasts...

work page arXiv