pith. sign in

arxiv: 2605.17307 · v1 · pith:M2FIO6SCnew · submitted 2026-05-17 · 💱 q-fin.PM · cs.AI· cs.LG· cs.NE· q-fin.TR

Deep Reinforcement Learning Framework for Diversified Portfolio Management Across Global Equity Markets

Pith reviewed 2026-05-19 23:01 UTC · model grok-4.3

classification 💱 q-fin.PM cs.AIcs.LGcs.NEq-fin.TR
keywords deep reinforcement learningportfolio managementSoft Actor-Criticglobal equity marketswalk-forward optimizationrisk-adjusted returnstransaction costsdiversification
0
0 comments X p. Extension
pith:M2FIO6SC Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{M2FIO6SC}

Prints a linked pith:M2FIO6SC badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Deep reinforcement learning for portfolio allocation shows competitive performance mainly in the Euro Stoxx 50 but delivers no statistically significant excess returns over buy-and-hold across global equity markets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a deep reinforcement learning framework that uses the Soft Actor-Critic algorithm to set continuous portfolio weights across major equity indices. It tests five model variants that differ in reward design, policy type, constraints, and sequence modeling, running them through sixteen walk-forward out-of-sample periods from 2003 to 2026 on the Nasdaq-100, Nikkei 225, and Euro Stoxx 50. The central aim is to determine whether these agents can produce better risk-adjusted results than a simple buy-and-hold approach once transaction costs and diversification rules are included. A reader would care because the results speak to whether automated, adaptive methods can add practical value in real markets that include trading frictions and regime shifts.

Core claim

The study finds that reinforcement learning strategies achieve competitive risk-adjusted performance primarily in the Euro Stoxx 50, where statistically significant abnormal returns are observed, yet the central hypothesis is only partially confirmed: no strategy achieves statistically significant excess returns relative to Buy and Hold under HAC-robust inference across all three markets. Regime analysis shows that the reinforcement learning approach adds the most value during periods of elevated uncertainty, while ensemble aggregation across markets improves risk-adjusted performance and confirms the benefits of geographic diversification.

What carries the argument

Soft Actor-Critic algorithm inside a Markov Decision Process that learns continuous portfolio weights while embedding transaction costs, turnover penalties, and diversification constraints directly in the reward function.

Load-bearing premise

The sixteen walk-forward out-of-sample folds spanning 2003-2026 provide a sufficiently unbiased test of out-of-sample performance without the RL agent overfitting to the specific market regimes present in the training windows.

What would settle it

Re-running the identical walk-forward procedure on data after 2026 and finding no statistically significant abnormal returns in the Euro Stoxx 50 under the same HAC-robust tests would falsify the claim of competitive performance in that market.

Figures

Figures reproduced from arXiv: 2605.17307 by Kamil Kashif, Robert \'Slepaczuk.

Figure 3
Figure 3. Figure 3: Data construction pipeline for the reinforcement learning framework [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of the reinforcement learning agent [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Decomposition of the Reinforcement Learning Reward Function [PITH_FULL_IMAGE:figures/full_fig_p028_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Overview of the reinforcement learning portfolio allocation research methodology [PITH_FULL_IMAGE:figures/full_fig_p039_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Empirical results for the NASDAQ-100 across the three reinforcement learning configurations. [PITH_FULL_IMAGE:figures/full_fig_p041_8.png] view at source ↗
Figure 11
Figure 11. Figure 11: Equity curves for cross-asset ensemble portfolios across reinforcement learning strategies. [PITH_FULL_IMAGE:figures/full_fig_p056_11.png] view at source ↗
read the original abstract

This study develops and evaluates a deep reinforcement learning framework for dynamic portfolio allocation across global equity markets. The Soft Actor-Critic algorithm is used to learn continuous portfolio weights within a Markov Decision Process, incorporating transaction costs, turnover penalties, and diversification constraints into the reward function. Five model configurations are compared, varying in reward formulation, policy structure (flat versus hierarchical Dirichlet), portfolio constraints, and temporal encoder (LSTM versus Transformer), and evaluated via walk-forward optimization across sixteen out-of-sample folds spanning 2003-2026 on the Nasdaq-100, Nikkei 225, and Euro Stoxx 50. Results show that RL strategies achieve competitive risk-adjusted performance primarily in the Euro Stoxx 50, where statistically significant abnormal returns are observed, but the central hypothesis is only partially confirmed: no strategy achieves statistically significant excess returns relative to Buy and Hold under HAC-robust inference across all markets. Regime analysis reveals that RL adds the most value during periods of elevated uncertainty, while ensemble aggregation across markets improves risk-adjusted performance and confirms the benefits of geographic diversification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper develops a deep reinforcement learning framework using the Soft Actor-Critic algorithm for dynamic portfolio allocation across the Nasdaq-100, Nikkei 225, and Euro Stoxx 50. It incorporates transaction costs, turnover penalties, and diversification constraints into the reward function and compares five model configurations differing in reward formulation, policy structure (flat vs. hierarchical Dirichlet), constraints, and temporal encoder (LSTM vs. Transformer). Evaluation uses walk-forward optimization over sixteen out-of-sample folds spanning 2003-2026. The central results are that RL strategies achieve competitive risk-adjusted performance primarily in the Euro Stoxx 50 (with statistically significant abnormal returns) but the hypothesis is only partially confirmed: no configuration produces statistically significant excess returns over Buy-and-Hold under HAC-robust inference across all three markets. Additional findings include greater value added during elevated-uncertainty regimes and improved performance from ensemble aggregation across markets.

Significance. If the out-of-sample claims hold after addressing the noted gaps, the work would provide useful evidence on the regime-dependent utility of RL for global equity allocation and the benefits of geographic diversification. The walk-forward design and HAC-robust inference are positive features for robustness claims, though the partial confirmation across markets tempers the overall impact.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Methodology): The manuscript reports statistically significant abnormal returns in the Euro Stoxx 50 yet supplies no explicit equations or pseudocode for the reward function (including the precise weighting of transaction costs, turnover penalties, and diversification constraints), the hyperparameter search procedure, or the exact implementation of HAC-robust standard errors. These omissions are load-bearing because they prevent verification that the reported alphas are not artifacts of the training objective or inference choices.
  2. [§5 and §4.3] §5 (Empirical Results) and §4.3 (Walk-forward procedure): The central out-of-sample performance claims rest on sixteen walk-forward folds being an unbiased test of generalization. The paper does not demonstrate that the Soft Actor-Critic agent (with LSTM or Transformer encoders) avoids implicit memorization of recurring volatility or correlation regimes that may span multiple training windows. Because the reward penalties modulate policy within the same regime distribution rather than breaking temporal dependence, the reported regime-specific value-add and alphas could reflect in-sample regime capture rather than true generalization.
minor comments (2)
  1. [Table 2] Table 2 or equivalent performance summary: the R² or information-ratio values for the Buy-and-Hold benchmark should be reported alongside the RL configurations to allow direct comparison of economic magnitude.
  2. [Figure 4] Figure 4 (regime analysis): axis labels and shading for 'elevated uncertainty' periods could be clarified to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important areas for improving methodological transparency and strengthening the robustness claims. We address each major comment below and indicate the revisions we will make to the next version of the paper.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Methodology): The manuscript reports statistically significant abnormal returns in the Euro Stoxx 50 yet supplies no explicit equations or pseudocode for the reward function (including the precise weighting of transaction costs, turnover penalties, and diversification constraints), the hyperparameter search procedure, or the exact implementation of HAC-robust standard errors. These omissions are load-bearing because they prevent verification that the reported alphas are not artifacts of the training objective or inference choices.

    Authors: We agree that explicit formulations are necessary for full reproducibility and independent verification of the reported alphas. In the revised manuscript we will insert the complete reward function equation in §4, with explicit coefficients for transaction costs, turnover penalties, and the diversification constraint term. We will also add a dedicated subsection describing the hyperparameter search (including the search space, optimization method, and selection criterion) and specify the HAC implementation details, including the kernel choice and lag selection rule. These additions will be placed in §4 to allow readers to confirm that the alphas are not artifacts of the training objective. revision: yes

  2. Referee: [§5 and §4.3] §5 (Empirical Results) and §4.3 (Walk-forward procedure): The central out-of-sample performance claims rest on sixteen walk-forward folds being an unbiased test of generalization. The paper does not demonstrate that the Soft Actor-Critic agent (with LSTM or Transformer encoders) avoids implicit memorization of recurring volatility or correlation regimes that may span multiple training windows. Because the reward penalties modulate policy within the same regime distribution rather than breaking temporal dependence, the reported regime-specific value-add and alphas could reflect in-sample regime capture rather than true generalization.

    Authors: We acknowledge the referee’s concern that regime overlap across successive training windows could allow implicit memorization. The walk-forward design with sixteen non-overlapping out-of-sample folds spanning 2003–2026 already exposes each model to multiple distinct market regimes, including crises and low-volatility periods. The temporal encoders (LSTM and Transformer) are intended to model evolving dynamics rather than static regime patterns, and the regime-stratified results show that value added is concentrated in high-uncertainty periods that are not uniformly distributed across folds. Nevertheless, we agree that additional safeguards would strengthen the generalization claim. In revision we will expand §4.3 with a discussion of regime coverage across folds and add a sensitivity check that reports performance when training windows are shortened or when folds are reordered. We will also clarify that the reward penalties operate on realized turnover and diversification within each period rather than on regime labels. revision: partial

Circularity Check

0 steps flagged

No circularity: out-of-sample walk-forward results are independent of training objective

full rationale

The paper reports empirical risk-adjusted performance and HAC-robust statistical comparisons to Buy-and-Hold on sixteen walk-forward out-of-sample folds spanning 2003-2026. These evaluation metrics are computed on held-out periods after the Soft Actor-Critic agent is trained on preceding windows and are not defined as or reduced to the reward function components (transaction costs, turnover penalties, diversification constraints). No equations, fitted parameters, or self-citations are shown that would make the reported alphas or regime-specific value-add equivalent to the training inputs by construction. The methodology therefore remains self-contained against external market data and benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard MDP assumptions for financial time series and on the premise that the chosen reward components (returns, costs, turnover, diversification) adequately proxy real-world investor utility; no new entities are postulated.

axioms (1)
  • domain assumption Financial market returns can be modeled as a Markov Decision Process with observable state features sufficient for policy learning.
    The abstract frames the problem as an MDP and trains policies on historical price and index data.

pith-pipeline@v0.9.0 · 5725 in / 1338 out tokens · 34124 ms · 2026-05-19T23:01:11.162588+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 2 internal anchors

  1. [1]

    and Consoli, Sergio and Piras, Luca and Podda, Alessandro Sebastian and Recupero, Diego Reforgiato , title =

    Carta, Salvatore M. and Consoli, Sergio and Piras, Luca and Podda, Alessandro Sebastian and Recupero, Diego Reforgiato , title =. IEEE Access , volume =. 2021 , publisher =

  2. [2]

    Sensors , volume =

    Micha. Sensors , volume =. 2022 , publisher =

  3. [3]

    Applying

    Bui, Quynh and. Applying. Physica A: Statistical Mechanics and its Applications , volume =. 2022 , publisher =

  4. [4]

    The Journal of Finance , volume =

    Markowitz, Harry , title =. The Journal of Finance , volume =. 1952 , publisher =

  5. [5]

    Financial Analysts Journal , volume =

    Black, Fischer and Litterman, Robert , title =. Financial Analysts Journal , volume =. 1992 , publisher =

  6. [6]

    Journal of Big Data , volume =

    Optimal. Journal of Big Data , volume =. 2025 , publisher =

  7. [7]

    Neurocomputing , volume =

    Kim, Kyoung-jae , title =. Neurocomputing , volume =. 2003 , publisher =

  8. [8]

    Predicting the direction of stock market prices using random forest

    Khaidem, Luckyson and Saha, Snehanshu and Dey, Sudeepa Roy , title =. arXiv preprint arXiv:1605.00003 , year =. 1605.00003 , archivePrefix =

  9. [9]

    Application of machine learning in algorithmic investment strategies on global stock markets , journal =

    Grudniewicz, Jan and. Application of machine learning in algorithmic investment strategies on global stock markets , journal =. 2023 , publisher =

  10. [10]

    and Borwein, Jonathan M

    Bailey, David H. and Borwein, Jonathan M. and. The probability of backtest overfitting , journal =. 2017 , publisher =

  11. [11]

    2018 , publisher =

    The 10 reasons most machine learning funds fail , journal =. 2018 , publisher =

  12. [12]

    IEEE Transactions on Emerging Topics in Computational Intelligence , volume =

    Lin, Yu-Fei and Huang, Tzu-Ming and Chung, Wei-Ho and Ueng, Yeong-Luh , title =. IEEE Transactions on Emerging Topics in Computational Intelligence , volume =. 2021 , publisher =

  13. [13]

    Supervised autoencoder

    Bieganowski, Bartosz and. Supervised autoencoder. Journal of Big Data , volume =. 2025 , publisher =

  14. [14]

    European Journal of Operational Research , volume =

    Fischer, Thomas and Krauss, Christopher , title =. European Journal of Operational Research , volume =. 2018 , publisher =

  15. [15]

    Daily and intraday application of various architectures of the

    Krynska, Katarzyna and. Daily and intraday application of various architectures of the. SSRN Electronic Journal , year =

  16. [16]

    Knowledge-Based Systems , volume =

    Kashif, Kamil and. Knowledge-Based Systems , volume =. 2025 , publisher =

  17. [17]

    Informer in algorithmic investment strategies on high frequency bitcoin data , journal =

    Stefaniuk, Filip and. Informer in algorithmic investment strategies on high frequency bitcoin data , journal =. 2025 , eprint =

  18. [18]

    Mathematical Finance , volume =

    Hambly, Ben and Xu, Renyuan and Yang, Huining , title =. Mathematical Finance , volume =. 2023 , publisher =

  19. [19]

    IEEE Transactions on Neural Networks , volume =

    Moody, John and Saffell, Matthew , title =. IEEE Transactions on Neural Networks , volume =. 2001 , publisher =

  20. [20]

    IEEE Transactions on Neural Networks and Learning Systems , volume =

    Deng, Yue and Bao, Feng and Kong, Youyong and Ren, Zhiquan and Dai, Qionghai , title =. IEEE Transactions on Neural Networks and Learning Systems , volume =. 2017 , publisher =

  21. [21]

    Application of Deep Reinforcement Learning to At-the-Money

    Bracha, Zofia and Sakowski, Pawe. Application of Deep Reinforcement Learning to At-the-Money. arXiv preprint arXiv:2510.09247 , year =. 2510.09247 , archivePrefix =

  22. [22]

    Symmetry , volume =

    Zhang, Haoran and Li, Xiaofei and Wan, Tianjiao and Du, Junjie , title =. Symmetry , volume =. 2026 , publisher =

  23. [23]

    arXiv preprint arXiv:2112.06753 , year =

    Liu, Xiao-Yang and Rui, Jingyang and Gao, Jiechao and Yang, Liuqing and Yang, Hongyang and Wang, Zhaoran and Wang, Christina Dan and Guo, Jian , title =. arXiv preprint arXiv:2112.06753 , year =. 2112.06753 , archivePrefix =

  24. [24]

    Quantitative Finance , volume =

    Buehler, Hans and Gonon, Lukas and Teichmann, Josef and Wood, Ben , title =. Quantitative Finance , volume =. 2019 , publisher =

  25. [25]

    Computational Management Science , volume =

    Maringer, Dietmar and Ramtohul, Tikesh , title =. Computational Management Science , volume =. 2012 , publisher =

  26. [26]

    and Ritter, Gordon and Wang, Yixuan and Zhang, Bofei , title =

    Du, Jiayi and Jin, Muyang and Kolm, Petter N. and Ritter, Gordon and Wang, Yixuan and Zhang, Bofei , title =. The Journal of Financial Data Science , volume =. 2020 , publisher =

  27. [27]

    IEEE Access , volume =

    Kabbani, Taylan and Duman, Ekrem , title =. IEEE Access , volume =. 2022 , publisher =

  28. [28]

    2025 International Conference on Sustainability, Innovation & Technology (ICSIT) , pages =

    Rani, Ishta and Gandhi, Hina and Kumar, Ramesh and Marannan, Nithya and Kim, Na Kyung and Kumar, Tejaswini , title =. 2025 International Conference on Sustainability, Innovation & Technology (ICSIT) , pages =. 2025 , publisher =

  29. [29]

    Proceedings of the First ACM International Conference on AI in Finance (ICAIF '20) , pages =

    Yang, Hongyang and Liu, Xiao-Yang and Zhong, Shan and Walid, Anwar , title =. Proceedings of the First ACM International Conference on AI in Finance (ICAIF '20) , pages =. 2020 , publisher =

  30. [30]

    IEEE Transactions on Big Data , volume =

    Enkhsaikhan, Bayaraa and Jo, Ohyun , title =. IEEE Transactions on Big Data , volume =. 2025 , publisher =

  31. [31]

    2024 IEEE 5th India Council International Subsections Conference (INDISCON) , pages =

    Tamuly, Adrika and Bhutani, Gariman and Sukriti , title =. 2024 IEEE 5th India Council International Subsections Conference (INDISCON) , pages =. 2024 , publisher =

  32. [32]

    Expert Systems with Applications , volume =

    Soleymani, Farzan and Paquet, Eric , title =. Expert Systems with Applications , volume =. 2020 , publisher =

  33. [33]

    Global Finance Journal , volume =

    Jiang, Yifu and Olmo, Jose and Atwi, Majed , title =. Global Finance Journal , volume =. 2024 , publisher =

  34. [34]

    and Thorne, Marcus V

    Sterling, Helena J. and Thorne, Marcus V. , title =. International Journal of Artificial Intelligence Research , volume =. 2026 , publisher =

  35. [35]

    Neurocomputing , volume =

    Cheng, Li-Chen and Sun, Jian-Shiou , title =. Neurocomputing , volume =. 2024 , publisher =

  36. [36]

    Analytics , volume =

    Millea, Adrian , title =. Analytics , volume =. 2023 , publisher =

  37. [37]

    Journal of Risk and Financial Management , volume =

    Hao, Zheng and Zhang, Haowei and Zhang, Yipu , title =. Journal of Risk and Financial Management , volume =. 2023 , publisher =

  38. [38]

    Expert Systems with Applications , volume =

    Shavandi, Ali and Khedmati, Majid , title =. Expert Systems with Applications , volume =. 2022 , publisher =

  39. [39]

    Information Sciences , volume =

    Wu, Xing and Chen, Haolei and Wang, Jianjia and Troiano, Luigi and Loia, Vincenzo and Fujita, Hamido , title =. Information Sciences , volume =. 2020 , publisher =

  40. [40]

    and Veness, Joel and Bellemare, Marc G

    Mnih, Volodymyr and Kavukcuoglu, Koray and Silver, David and Rusu, Andrei A. and Veness, Joel and Bellemare, Marc G. and Graves, Alex and Riedmiller, Martin and Fidjeland, Andreas K. and Ostrovski, Georg and others , title =. Nature , volume =. 2015 , publisher =

  41. [41]

    Proximal Policy Optimization Algorithms

    Schulman, John and Wolski, Filip and Dhariwal, Prafulla and Radford, Alec and Klimov, Oleg , title =. arXiv preprint arXiv:1707.06347 , year =. 1707.06347 , archivePrefix =

  42. [42]

    Annals of Operations Research , volume =

    Enhancing. Annals of Operations Research , volume =. 2025 , publisher =

  43. [43]

    International Journal of Financial Studies , volume =

    Chaweewanchon, Apichat and Chaysiri, Rujira , title =. International Journal of Financial Studies , volume =. 2022 , publisher =

  44. [44]

    and West, Kenneth D

    Newey, Whitney K. and West, Kenneth D. , title =. Econometrica , volume =. 1987 , publisher =

  45. [45]

    and Romano, Joseph P

    Politis, Dimitris N. and Romano, Joseph P. , title =. Journal of the American Statistical Association , volume =. 1994 , publisher =

  46. [46]

    The Review of Financial Studies , volume =

    DeMiguel, Victor and Garlappi, Lorenzo and Uppal, Raman , title =. The Review of Financial Studies , volume =. 2009 , publisher =

  47. [47]

    and Bhadra, Dipayan and Ridoy, Moinul and Milanova, Mariofanna , title =

    Kabir, Md R. and Bhadra, Dipayan and Ridoy, Moinul and Milanova, Mariofanna , title =. Sci , volume =. 2025 , publisher =

  48. [48]

    Proceedings of the 35th International Conference on Machine Learning , series =

    Haarnoja, Tuomas and Zhou, Aurick and Abbeel, Pieter and Levine, Sergey , title =. Proceedings of the 35th International Conference on Machine Learning , series =. 2018 , publisher =