arxiv: 2604.24486 · v1 · submitted 2026-04-27 · 💻 cs.CE

Recognition: unknown

Comparative Evaluation of Modern Deep Learning Methodologies for Portfolio Optimization

Samuel Ozechi , Banjo Francis , Wisdom Yakanu , Joe Wayne Byers

Authors on Pith no claims yet

Pith reviewed 2026-05-07 17:25 UTC · model grok-4.3

classification 💻 cs.CE

keywords portfolio optimizationgraph neural networkstransformersdeep reinforcement learningautoencodersmean-variance optimizationbacktestingrisk-adjusted returns

0 comments

The pith

Transformer plus Graph Neural Network hybrids deliver the lowest volatility and drawdowns in portfolio optimization while mean-variance optimization with accurate inputs yields the highest returns and Sharpe ratios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates deep learning models for portfolio optimization by testing Graph Neural Networks, Transformers, Deep Reinforcement Learning, Autoencoders, and their combinations against traditional benchmarks. It applies these to 2015-2023 data across equities, ETFs, and bonds to handle covariance estimation, return prediction, and asset allocation through backtesting. Performance is measured by volatility, cumulative returns, maximum drawdown, annualized return, and Sharpe ratio. Hybrid models show stronger stability and risk control than standalone approaches, yet classical mean-variance optimization still leads in return metrics when supplied with good forecasts from the deep models. The work matters because it tests whether recent neural architectures can produce more reliable investment strategies in practice.

Core claim

Hybrid models such as Transformer combined with Graph Neural Network achieve the lowest volatility and maximum drawdown, providing superior stability and risk control. Mean-Variance Optimization paired with well-calibrated inputs from these models produces the highest cumulative return and Sharpe ratio. Standalone Deep Reinforcement Learning underperforms due to limited awareness of market structure, while Autoencoders perform similarly to simple equal-weighted portfolios, underscoring the value of dynamic policy learning over static feature compression.

What carries the argument

A comparative backtesting framework that combines deep learning components for relational modeling, temporal forecasting, and dimensionality reduction with classical Mean-Variance Optimization to produce dynamic asset allocations.

If this is right

Hybrid architectures that jointly model relational and temporal market structures improve risk-adjusted stability over single-model approaches.
Traditional mean-variance optimization retains strong performance when supplied with deep learning forecasts for returns and covariances.
Deep reinforcement learning requires explicit structural components to avoid underperformance relative to simpler baselines.
Autoencoders alone tend to replicate equal-weighted behavior, indicating limited added value without dynamic allocation layers.
Integrating deep learning outputs into classical optimization frameworks produces more robust strategies than either category alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The emphasis on latent representations could be extended to test whether learned embeddings improve scalability when the number of assets grows beyond the current set.
Live deployment would likely require online retraining to handle regime shifts not captured in the fixed 2015-2023 window.
Adding turnover penalties during optimization might shift preference toward lower-frequency rebalancing policies even if they sacrifice some Sharpe ratio.

Load-bearing premise

Results from the 2015-2023 backtest period and selected performance metrics will continue to hold for future market conditions without major overfitting or unmodeled costs such as transaction fees and liquidity limits.

What would settle it

Re-running the identical seven strategies on market data after 2023 or with added transaction costs and liquidity constraints, then checking whether the Transformer+GNN model still shows the lowest volatility and drawdown while MVO still leads in cumulative return.

read the original abstract

This study proposes a portfolio optimization framework that integrates advanced deep learning architectures with traditional financial models to enhance risk-adjusted performance. Using historical data from 2015-2023 across equities, ETFs, and bonds, the research evaluates the predictive power of Graph Neural Networks (GNNs), Deep Reinforcement Learning (DRL), Transformers, and Autoencoders. The models jointly address covariance estimation, return forecasting, dynamic asset allocation, and dimensionality reduction. Hybrid approaches such as Transformer+GNN and Autoencoder+DRL are also explored to capture both relational and temporal market structures. Performance is assessed through backtesting using metrics including volatility, cumulative return, maximum drawdown, annualized return, and Sharpe ratio across seven strategies, including Equal-Weighted, 60/40 allocation, and Mean-Variance Optimization (MVO). Results show that hybrid models provide superior stability and risk control, with Transformer+GNN achieving the lowest volatility and drawdown. MVO, when paired with well-calibrated inputs, delivers the highest cumulative return and Sharpe ratio, highlighting the continued relevance of traditional methods. Standalone DRL underperforms due to limited structural awareness, while Autoencoders exhibit behavior similar to Equal-Weight strategies, emphasizing the need for dynamic policy learning. These findings align with existing literature on relational modeling and feature compression in finance. Overall, the study demonstrates that combining deep learning with financial theory yields robust and adaptive portfolio strategies and suggests exploring latent representations within traditional optimization frameworks to improve scalability and performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a standard side-by-side comparison of known deep learning architectures on portfolio optimization whose performance rankings rest on backtests without clear out-of-sample validation.

read the letter

The main takeaway is that the paper applies GNNs, Transformers, DRL, autoencoders and a couple of hybrids to return forecasting, covariance estimation and allocation on 2015-2023 equity, ETF and bond data. It reports that the Transformer-GNN combination shows the best stability on volatility and drawdown while MVO with decent inputs still leads on cumulative return and Sharpe. Nothing in the work is new; the architectures and the hybrid idea are already in the literature, and the authors note their results line up with prior findings.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a portfolio optimization framework that combines deep learning models (GNNs, Transformers, DRL, Autoencoders) and hybrids with traditional methods like MVO. Using 2015-2023 backtests on equities, ETFs, and bonds, it evaluates performance on volatility, cumulative return, drawdown, and Sharpe ratio, claiming that Transformer+GNN hybrids deliver the best stability and risk control while MVO with calibrated inputs achieves the highest returns and Sharpe ratio.

Significance. If the results hold under proper validation, the work would provide a useful empirical comparison showing how relational (GNN) and temporal (Transformer) modeling can enhance portfolio stability when hybridized, while underscoring that classical MVO remains competitive with good forecasts. This could inform model selection in quantitative finance and encourage further integration of DL with mean-variance frameworks.

major comments (2)

[Abstract and backtesting procedure] The central empirical claims rest on backtesting over the single 2015-2023 window, yet the abstract and available description supply no train/test split, walk-forward validation, purged cross-validation, or held-out period. Deep models have high capacity; without strict temporal separation, superior stability and return figures for Transformer+GNN and MVO can arise from fitting regime-specific noise rather than generalization. This directly undermines both the hybrid superiority claim and the assertion that MVO remains competitive once inputs are well-calibrated.
[Methodology and Results sections] No architecture details (layer sizes, attention heads, GNN message-passing functions), training protocols, hyperparameter search procedures, or statistical significance tests are supplied. Without these, the reported metrics (e.g., Transformer+GNN lowest volatility/drawdown) cannot be reproduced or verified, rendering the comparative evaluation unverifiable.

minor comments (2)

[Abstract] The abstract states that seven strategies are evaluated but does not list them explicitly; enumerating Equal-Weighted, 60/40, MVO, and the DL variants would improve clarity.
[Abstract] The statement that findings 'align with existing literature on relational modeling and feature compression' would be strengthened by citing specific prior works.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their insightful comments, which have helped us identify areas for improvement in our manuscript. Below, we provide point-by-point responses to the major comments and indicate the revisions we have made or will make.

read point-by-point responses

Referee: [Abstract and backtesting procedure] The central empirical claims rest on backtesting over the single 2015-2023 window, yet the abstract and available description supply no train/test split, walk-forward validation, purged cross-validation, or held-out period. Deep models have high capacity; without strict temporal separation, superior stability and return figures for Transformer+GNN and MVO can arise from fitting regime-specific noise rather than generalization. This directly undermines both the hybrid superiority claim and the assertion that MVO remains competitive once inputs are well-calibrated.

Authors: We agree with the referee that the backtesting procedure must be described more rigorously to support the claims. We have revised the manuscript to include a full account of the train/test split, walk-forward validation, purged cross-validation, and held-out period used in our experiments. These additions directly address the potential for fitting to regime-specific noise and reinforce the validity of our findings on model performance. revision: yes
Referee: [Methodology and Results sections] No architecture details (layer sizes, attention heads, GNN message-passing functions), training protocols, hyperparameter search procedures, or statistical significance tests are supplied. Without these, the reported metrics (e.g., Transformer+GNN lowest volatility/drawdown) cannot be reproduced or verified, rendering the comparative evaluation unverifiable.

Authors: We acknowledge that the manuscript did not provide sufficient details on the model architectures, training protocols, hyperparameter searches, and statistical tests. This omission makes reproduction difficult. In the revised version, we have included all necessary details in the Methodology and Results sections, as well as in supplementary appendices, covering layer sizes, attention heads, GNN message-passing functions, training procedures, hyperparameter optimization methods, and statistical significance testing. These revisions make the comparative evaluation fully verifiable and reproducible. revision: yes

Circularity Check

0 steps flagged

Purely empirical backtest evaluation with no derivation chain

full rationale

The paper performs a comparative empirical study of deep learning architectures (GNNs, Transformers, DRL, Autoencoders) and hybrids for portfolio optimization tasks including covariance estimation and asset allocation. All reported results—volatility, cumulative return, Sharpe ratio, drawdown—are computed directly from backtests on the 2015-2023 historical dataset across the listed strategies. No first-principles derivation, predictive equations, or fitted-parameter renaming occurs; performance figures are observed outputs of the simulation rather than quantities that reduce to the model's own inputs by construction. Self-citations are absent from the provided text, and claims of alignment with literature refer to external work. The evaluation is therefore self-contained against external benchmarks and exhibits no circularity.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claims rest on numerous unfixed hyperparameters inside the neural architectures, the assumption that the 2015-2023 window is representative, and standard finance simplifications such as frictionless trading that are not explicitly validated.

free parameters (2)

Neural network hyperparameters (learning rates, layer sizes, attention heads)
All deep learning models require extensive tuning to the specific dataset; these choices directly affect reported volatility and Sharpe ratios.
Covariance and return forecasting parameters
Both traditional MVO and the DL variants fit these quantities to historical returns, making performance sensitive to the fitting window and regularization.

axioms (2)

domain assumption Historical market data from 2015-2023 is stationary enough for out-of-sample generalization
All backtest conclusions presuppose that patterns observed in the training window persist.
domain assumption Transaction costs, slippage, and liquidity constraints can be ignored
Standard in many academic backtests but materially affects real-world Sharpe ratios and drawdowns.

pith-pipeline@v0.9.0 · 5571 in / 1547 out tokens · 86232 ms · 2026-05-07T17:25:04.455930+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

A Deep Learning Framework for Financial Time Series Using Stacked Autoencoders and Long-Short Term Memory

Bao, Xiaoyun, Jun Yue, and Yulei Rao. “A Deep Learning Framework for Financial Time Series Using Stacked Autoencoders and Long-Short Term Memory.” PLoS ONE, vol. 12, no. 7, 2017, e0180944. Black, Fischer. “Capital Market Equilibrium with Restricted Borrowing.” The Journal of Business, vol. 45, no. 3, 1972, pp. 444–455. Black, Fischer, and Robert Litterman...

work page doi:10.2469/faj.v48.n5.28 2017
[2]

Deep Learning in Finance

Heaton, J. B., N. G. Polson, and J. H. Witte. “Deep Learning in Finance.” Annual Review of Financial Economics, vol. 9, 2017, pp. 145–181. [https://doi.org/10.1146/annurev-financial-110716-032821](https://doi.org/10.1146/annurev-financial- 110716-032821). https://learn.wqu.edu/my-courses/courses/deep-learning-for-finance https://learn.wqu.edu/my-courses/c...

work page doi:10.1146/annurev-financial-110716-032821 2017
[3]

A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem

Jiang, Zhengyao, Dixing Xu, and Jinjun Liang. “A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem.” arXiv, 2017, arXiv:1706.10059. [https://arxiv.org/abs/1706.10059](https://arxiv.org/abs/1706.10059). 23 Jin, Yichen, Hossam El-Saawy, and Yuying Xu. “Portfolio Management Using Reinforcement Learning: A Systematic Review....

work page doi:10.3905/jfds.2020.1.048 2017
[4]

Capital Asset Prices: A Theory of Market Equilibrium under Conditions of Risk

[https://doi.org/10.2139/ssrn.1117574](https://doi.org/10.2139/ssrn.1117574). Sharpe, William F. “Capital Asset Prices: A Theory of Market Equilibrium under Conditions of Risk.” The Journal of Finance, vol. 19, no. 3, 1964, pp. 425–442. [https://doi.org/10.2307/2977928](https://doi.org/10.2307/2977928). Wang, Yiming, and Xiaowei Zhou. “Stock Market Predic...

work page doi:10.2139/ssrn.1117574 1964