Sequential Structure in Intraday Futures Data: LSTM vs Gradient Boosting on MNQ

Mathias Mesfin

arxiv: 2605.17724 · v1 · pith:DBF4MXHYnew · submitted 2026-05-18 · 💱 q-fin.TR · cs.LG· q-fin.CP· q-fin.ST

Sequential Structure in Intraday Futures Data: LSTM vs Gradient Boosting on MNQ

Mathias Mesfin This is my paper

Pith reviewed 2026-05-19 22:00 UTC · model grok-4.3

classification 💱 q-fin.TR cs.LGq-fin.CPq-fin.ST

keywords intraday forecastingLSTMgradient boostingfuturesMNQsequential modelswalk-forward validationdata requirements

0 comments

The pith

Four years of single-instrument five-minute OHLCV data prove insufficient for reliable intraday ML forecasting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper compares LSTM networks and gradient boosting models for predicting whether the close of the Micro E-Mini Nasdaq 100 futures exceeds its 10:30 AM open by more than ten points. The models are trained and tested on four years of five-minute price bars using strict walk-forward validation that respects temporal order. All configurations produce out-of-sample accuracies clustered around 50 to 51 percent, indistinguishable from the base rate according to permutation tests. The authors interpret the failure to find signal as evidence that single-instrument datasets of this size lack enough sequential structure for these architectures to exploit.

Core claim

The central discovery is that neither LSTM nor gradient boosting extracts statistically significant predictive information from sequences of five-minute OHLCV bars when the training corpus is limited to roughly one thousand trading days of a single futures contract. Out-of-sample accuracies range from 50.00 percent to 50.89 percent, with p-values from permutation tests of 0.135 and higher. Feature importances shift across successive validation folds, consistent with noise fitting rather than capture of stable market structure. The work therefore supplies an empirical lower bound on the data volume required for sequential models to succeed at this task.

What carries the argument

Walk-forward expanding-window validation applied to binary targets derived from the 10:30 AM open to close price move in five-minute bars, serving as the testbed for comparing LSTM sequence processing against gradient boosting on raw OHLCV features.

If this is right

Single-instrument five-minute datasets spanning four years do not suffice for above-chance directional forecasts.
Model performance remains at the base rate of approximately 51.8 percent across tested architectures.
Unstable feature importances across folds indicate the absence of persistent sequential signals.
This evaluation provides a documented lower bound for data scale in Kronos-inspired financial sequence models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the same protocol to multi-year histories of several correlated futures might reveal cross-asset structure.
Alternative target definitions, such as different intraday reference times, could surface signals missed by the fixed 10:30 AM window.
Researchers may need to move beyond single-contract OHLCV to include order-book or macroeconomic features to exceed the observed performance ceiling.

Load-bearing premise

The assumption that the specific binary target and five-minute OHLCV representation would detect exploitable sequential structure if any existed in the market.

What would settle it

Finding statistically significant accuracy above 52 percent on the same target using a dataset at least twice as long or drawn from multiple instruments would falsify the claim that four years of single-instrument data are insufficient.

read the original abstract

This paper compares gradient boosting and long short-term memory (LSTM) architectures for intraday directional prediction in Micro E-Mini Nasdaq 100 futures (MNQ). Motivated by recent foundation-model research on financial candlestick data, including the Kronos architecture, we test whether five-minute OHLCV bar sequences contain exploitable sequential predictive structure at the scale of a single instrument dataset. Using 944 trading days from 2021-2025, four model configurations are evaluated under strict expanding-window walk-forward validation across three out-of-sample periods. The target variable is whether the session close exceeds the 10:30 AM open by more than ten points. No configuration produces statistically significant out-of-sample accuracy above the 51.8% base rate. Combined OOS accuracies range from 50.00% to 50.89% across gradient boosting variants, while the LSTM achieves 50.59%. Permutation tests yield p-values of 0.135 for the best gradient boosting model and 0.515 for the LSTM, indicating no statistically significant predictive edge. Feature importance instability across walk-forward folds suggests noise fitting rather than stable structural signal capture. The results indicate that four years of single-instrument five-minute OHLCV data are insufficient for reliable sequential ML-based intraday forecasting. The primary contribution is a documented evaluation of a Kronos-inspired architecture on a constrained real-world dataset, providing an empirical lower bound on data scale requirements for sequential financial ML.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript compares gradient boosting and LSTM models for predicting whether MNQ futures close exceeds the 10:30 AM open by more than 10 points, using five-minute OHLCV sequences over 944 trading days (2021-2025). It applies expanding-window walk-forward validation across three out-of-sample periods, reports combined OOS accuracies of 50.00-50.89% for GB variants and 50.59% for LSTM, and finds permutation p-values of 0.135 and 0.515 against the 51.8% base rate. Feature-importance instability is cited as evidence of noise fitting. The authors conclude that four years of single-instrument data are insufficient for reliable sequential ML intraday forecasting and position the work as an empirical lower bound motivated by Kronos-style architectures.

Significance. If the negative result is robust, the paper supplies a useful empirical lower bound on data scale for sequential models in single-instrument intraday futures settings. It gives explicit credit to the expanding-window walk-forward protocol, permutation tests for significance, and reporting of feature-importance instability across folds. These elements strengthen the contribution by documenting practical limits rather than claiming positive predictive power.

major comments (1)

[Abstract and target variable definition] The central claim that four years of single-instrument five-minute OHLCV data are insufficient for reliable sequential ML-based intraday forecasting (Abstract) is load-bearing on the assumption that the tested binary target would have revealed exploitable sequential structure if present. The target is defined as a single long-horizon outcome (close > 10:30 AM open by >10 points) anchored at a fixed intraday time. This formulation may miss shorter-horizon dependencies or alternative sequential patterns (e.g., volatility clustering or order-flow proxies) that the OHLCV sequence could contain, rendering the null results (accuracies near base rate, non-significant p-values) consistent with either absent structure or target-label misalignment.

minor comments (2)

[Methods] The abstract and methods lack explicit detail on the hyperparameter search space, optimization procedure, and exact data-cleaning rules (e.g., handling of overnight gaps or zero-volume bars). Adding these would improve reproducibility.
[Results] Consider adding a table that reports per-period accuracies, permutation p-values, and base rates to make the cross-period stability assessment more transparent.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below.

read point-by-point responses

Referee: [Abstract and target variable definition] The central claim that four years of single-instrument five-minute OHLCV data are insufficient for reliable sequential ML-based intraday forecasting (Abstract) is load-bearing on the assumption that the tested binary target would have revealed exploitable sequential structure if present. The target is defined as a single long-horizon outcome (close > 10:30 AM open by >10 points) anchored at a fixed intraday time. This formulation may miss shorter-horizon dependencies or alternative sequential patterns (e.g., volatility clustering or order-flow proxies) that the OHLCV sequence could contain, rendering the null results (accuracies near base rate, non-significant p-values) consistent with either absent structure or target-label misalignment.

Authors: We thank the referee for highlighting this consideration. The binary target was deliberately selected as a specific, economically relevant test case: whether the session close exceeds the 10:30 AM open by more than 10 points. This formulation corresponds to a clear directional signal that could inform intraday trading strategies and is consistent with the types of sequential forecasts explored in Kronos-style architectures. While alternative targets (shorter-horizon returns, volatility clustering, or order-flow proxies) might capture different dependencies, our study evaluates whether exploitable structure exists for this representative long-horizon directional task under realistic walk-forward conditions. The non-significant results therefore supply an empirical lower bound for this class of prediction problems rather than a universal claim that no sequential information exists in the OHLCV series for any possible label. We will revise the manuscript to expand the discussion of target selection, explicitly note this scope limitation, and clarify that other formulations remain open for future work. revision: yes

Circularity Check

0 steps flagged

Empirical evaluation relies on external statistical benchmarks with no reduction to fitted inputs or self-citations

full rationale

The paper reports an empirical comparison of LSTM and gradient boosting models for a fixed binary target (close exceeding 10:30 AM open by more than ten points) on five-minute OHLCV sequences, using expanding-window walk-forward validation, out-of-sample accuracy measurement, and permutation tests against the 51.8% base rate. The conclusion of data insufficiency follows directly from observed accuracies near the base rate and non-significant p-values (0.135 and 0.515), without any claimed derivation, equation, or parameter fit that reduces to the target or inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, and the evaluation uses independent statistical procedures rather than renaming or smuggling ansatzes. This matches the default expectation of a self-contained empirical study.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the modeling choice that five-minute OHLCV bars plus standard LSTM and gradient-boosting architectures are adequate probes for sequential structure, plus the implicit assumption that the 10-point threshold defines a meaningful economic target.

free parameters (1)

10-point threshold
Arbitrary cutoff chosen to define the binary target; its value directly affects base rate and model difficulty.

axioms (1)

domain assumption Five-minute OHLCV sequences contain any exploitable sequential predictive information that may exist.
The entire experimental design presupposes that this data representation is rich enough to surface structure if present.

pith-pipeline@v0.9.0 · 5800 in / 1406 out tokens · 49069 ms · 2026-05-19T22:00:29.571833+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

No configuration produces statistically significant out-of-sample accuracy above the 51.8% base rate... The results indicate that four years of single-instrument five-minute OHLCV data are insufficient for reliable sequential ML-based intraday forecasting.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages

[1]

Introduction The publication of foundation models for financial time series data represents a meaningful development in quantitative research. The Kronos model (Shi et al., 2025), trained on millions of candlestick bars across multiple instruments and time periods, demonstrated that sequential bar-level structure—the relationship between successive OHLCV ...

work page 2025
[2]

After session boundary filtering and removal of partial days, 947 complete trading days remain for daily feature construction

Data and Feature Engineering 2.1 Data The primary dataset consists of 72,604 five-minute OHLCV bars for MNQ continuous front-month futures covering regular trading hours (09:30–16:00 ET) from December 2021 through September 2025. After session boundary filtering and removal of partial days, 947 complete trading days remain for daily feature construction. ...

work page 2021
[3]

Target A (daily close vs

Target Variable Construction Two candidate target variables were evaluated before model training. Target A (daily close vs. open) and Target B (first-hour direction survival) were assessed for sample balance and feasibility. Target B was rejected due to a 6.95% base rate, which reflects the fact that a 2-point first-hour direction survival threshold is to...

work page 2026
[4]

Model Architectures and Validation Methodology 4.1 Gradient Boosting Classifier Three gradient boosting configurations are evaluated, each representing a different feature set and target specification. All use HistGradientBoostingClassifier from scikit-learn with conservative regularization to limit overfitting on the small dataset: max_leaf_nodes = 15, m...

work page 2026
[5]

This is the most conservative test—it does not attempt to use intraday structure and instead relies on multi-day patterns in returns, gaps, and volatility

Results 5.1 Gradient Boosting — Daily Features (GB-Daily) The daily feature model tests whether aggregated daily OHLCV statistics contain predictive information for next-session direction. This is the most conservative test—it does not attempt to use intraday structure and instead relies on multi-day patterns in returns, gaps, and volatility. Fold Train T...

work page 2022
[6]

Structural Interpretation 6.1 Sample Size Requirements for Sequential ML in Finance The Kronos model was trained on approximately 9.2 million candlestick bars across multiple instruments and time periods. Our dataset contains 72,604 five-minute RTH bars for a single Mesfin (2026) | 13 instrument over four years—approximately 127 times fewer bars than the ...

work page 2026
[7]

A 16-unit single-layer LSTM is the simplest possible sequential model

Limitations and Extensions 7.1 Limitations The LSTM architecture tested here is intentionally minimal. A 16-unit single-layer LSTM is the simplest possible sequential model. More complex architectures—deeper LSTMs, bidirectional LSTMs, attention mechanisms, or transformer layers—might extract different information from the same sequences. However, any inc...

work page 2022
[8]

Kronos: A foundation model for the language of financial markets.arXiv preprint arXiv:2508.02739, 2025

Conclusion Mesfin (2026) | 15 This paper has documented a systematic evaluation of gradient boosting and LSTM architectures for intraday directional prediction in MNQ futures. No configuration produced statistically significant out-of-sample accuracy. Combined OOS accuracies range from 50.00% to 50.89% across gradient boosting variants and 50.59% for the ...

work page arXiv 2026

[1] [1]

Introduction The publication of foundation models for financial time series data represents a meaningful development in quantitative research. The Kronos model (Shi et al., 2025), trained on millions of candlestick bars across multiple instruments and time periods, demonstrated that sequential bar-level structure—the relationship between successive OHLCV ...

work page 2025

[2] [2]

After session boundary filtering and removal of partial days, 947 complete trading days remain for daily feature construction

Data and Feature Engineering 2.1 Data The primary dataset consists of 72,604 five-minute OHLCV bars for MNQ continuous front-month futures covering regular trading hours (09:30–16:00 ET) from December 2021 through September 2025. After session boundary filtering and removal of partial days, 947 complete trading days remain for daily feature construction. ...

work page 2021

[3] [3]

Target A (daily close vs

Target Variable Construction Two candidate target variables were evaluated before model training. Target A (daily close vs. open) and Target B (first-hour direction survival) were assessed for sample balance and feasibility. Target B was rejected due to a 6.95% base rate, which reflects the fact that a 2-point first-hour direction survival threshold is to...

work page 2026

[4] [4]

Model Architectures and Validation Methodology 4.1 Gradient Boosting Classifier Three gradient boosting configurations are evaluated, each representing a different feature set and target specification. All use HistGradientBoostingClassifier from scikit-learn with conservative regularization to limit overfitting on the small dataset: max_leaf_nodes = 15, m...

work page 2026

[5] [5]

This is the most conservative test—it does not attempt to use intraday structure and instead relies on multi-day patterns in returns, gaps, and volatility

Results 5.1 Gradient Boosting — Daily Features (GB-Daily) The daily feature model tests whether aggregated daily OHLCV statistics contain predictive information for next-session direction. This is the most conservative test—it does not attempt to use intraday structure and instead relies on multi-day patterns in returns, gaps, and volatility. Fold Train T...

work page 2022

[6] [6]

Structural Interpretation 6.1 Sample Size Requirements for Sequential ML in Finance The Kronos model was trained on approximately 9.2 million candlestick bars across multiple instruments and time periods. Our dataset contains 72,604 five-minute RTH bars for a single Mesfin (2026) | 13 instrument over four years—approximately 127 times fewer bars than the ...

work page 2026

[7] [7]

A 16-unit single-layer LSTM is the simplest possible sequential model

Limitations and Extensions 7.1 Limitations The LSTM architecture tested here is intentionally minimal. A 16-unit single-layer LSTM is the simplest possible sequential model. More complex architectures—deeper LSTMs, bidirectional LSTMs, attention mechanisms, or transformer layers—might extract different information from the same sequences. However, any inc...

work page 2022

[8] [8]

Kronos: A foundation model for the language of financial markets.arXiv preprint arXiv:2508.02739, 2025

Conclusion Mesfin (2026) | 15 This paper has documented a systematic evaluation of gradient boosting and LSTM architectures for intraday directional prediction in MNQ futures. No configuration produced statistically significant out-of-sample accuracy. Combined OOS accuracies range from 50.00% to 50.89% across gradient boosting variants and 50.59% for the ...

work page arXiv 2026