TT-DAC-PS: Twin-Target Deterministic Actor-Critic with Policy Smoothing for Optimal Trade Execution

Alfonso Dufour; Atta Badii; Ilia Zaznov; Julian Kunkel

arxiv: 2606.08379 · v1 · pith:HJGDJ5W4new · submitted 2026-06-07 · 💻 cs.AI · cs.CE· cs.LG· q-fin.CP· q-fin.TR

TT-DAC-PS: Twin-Target Deterministic Actor-Critic with Policy Smoothing for Optimal Trade Execution

Ilia Zaznov , Atta Badii , Julian Kunkel , Alfonso Dufour This is my paper

Pith reviewed 2026-06-27 19:06 UTC · model grok-4.3

classification 💻 cs.AI cs.CEcs.LGq-fin.CPq-fin.TR

keywords optimal trade executionreinforcement learningactor-criticimplementation shortfalllimit order booktrade impactpolicy smoothing

0 comments

The pith

TT-DAC-PS reduces mean implementation shortfall for large stock sell programs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a deterministic actor-critic model called TT-DAC-PS that combines twin critic targets, pessimistic min backup, target policy smoothing, delayed updates, and conservative Q regularization. It evaluates the model inside a simulator that merges Almgren-Chriss impact with real limit-order-book prices and volumes for ten U.S. stocks. A sympathetic reader would care because lower average shortfall directly reduces the cost of unwinding large positions while keeping variance competitive. The model is tested against PPO, SAC, A2C and the classical TWAP, VWAP, and Almgren-Chriss baselines.

Core claim

TT-DAC-PS integrates twin exponential-moving-average critic targets with pessimistic min backup, TD3-style target policy smoothing noise, delayed actor updates, and conservative Q regularisation. Exploration uses Ornstein-Uhlenbeck noise under a hybrid schedule of deterministic decay, variance-guided adjustment, and a learned temperature. When run on limit-order-book data for ten U.S. stocks, the method consistently lowers mean implementation shortfall percentage with competitive variance and outperforms the listed reinforcement-learning and classical benchmarks.

What carries the argument

Twin-Target Deterministic Actor-Critic with Policy Smoothing (TT-DAC-PS), which stabilises Q-value estimates through twin targets, pessimistic backup, smoothing noise, and conservative regularisation to support better policy decisions in trade execution.

If this is right

Large sell programs can be completed at lower average cost than with time-weighted or volume-weighted schedules.
The combination of pessimistic backup and smoothing noise keeps learning stable despite the non-stationary order-book environment.
Normalised state features and per-step volume caps allow the same architecture to generalise across the ten tested stocks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Changing the reward function to penalise different risk measures could shift the variance-cost trade-off without altering the core architecture.
Applying the same twin-target and smoothing components to buy-side execution or to other impact models would test whether the gains are specific to sell programs.
Running the method on out-of-sample periods after the training window would show whether the shortfall reduction persists under new market conditions.

Load-bearing premise

The simulated trading environment based on the Almgren-Chriss impact model and historical LOB data sufficiently represents the dynamics of actual market execution for the tested stocks.

What would settle it

Deploying the trained policy on live trading data for the same ten stocks and measuring no reduction in mean implementation shortfall would contradict the reported performance advantage.

Figures

Figures reproduced from arXiv: 2606.08379 by Alfonso Dufour, Atta Badii, Ilia Zaznov, Julian Kunkel.

**Figure 1.** Figure 1: Taxonomy of optimal execution research. • VWAP (Volume-Weighted Average Price) aligns trades with volume: x VWAP t = Q · vt PN−1 k=0 vk where vt is expected or observed volume. No-dynamic-arbitrage constraints [19] require the permanent impact to be linear in aggregate order flow, ruling out price manipulation and ensuring market integrity. Extensions to the AC model include transient impact and resilience… view at source ↗

read the original abstract

This study addresses the optimal execution of large stock sell programs by introducing TT-DAC-PS (Twin-Target Deterministic Actor-Critic with Policy Smoothing), a deterministic actor-critic architecture that combines twin exponential-moving-average critic targets with pessimistic min backup, TD3-style target policy smoothing noise, delayed actor updates, and conservative Q regularisation to curb overestimation. Exploration uses Ornstein-Uhlenbeck (OU) noise with a hybrid schedule: deterministic episode-wise decay, variance-guided adjustment based on recent reward dispersion, and a Soft Actor-Critic (SAC)-style temperature that is learned and mapped to the noise scale. The environment integrates Almgren-Chriss (AC) trade impact with Limit Order Book (LOB) prices and volumes, normalised state features, per-step volume participation caps, and a utility-based reward. The trade execution algorithm is applied to LOB data for ten U.S. stocks. Performance is assessed against reinforcement-learning baseline algorithms, including Proximal Policy Optimisation (PPO), Soft Actor-Critic (SAC), and Advantage Actor-Critic (A2C), as well as alternative trade execution algorithms, including Time-Weighted Average Price (TWAP), Volume-Weighted Average Price (VWAP), and AC. The proposed model consistently reduces mean implementation shortfall percentage with competitive variance, outperforming classical baselines and standard reinforcement-learning benchmark models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TT-DAC-PS is a competent incremental RL tweak for trade execution whose reported edge sits on an unvalidated Almgren-Chriss plus LOB simulator.

read the letter

The paper's core contribution is a deterministic actor-critic that stacks twin EMA critics with min backup, TD3-style target policy smoothing, delayed actor updates, conservative Q regularization, and a hybrid OU noise schedule that mixes deterministic decay, variance guidance, and a learned SAC-style temperature. They run it on historical LOB snapshots for ten U.S. stocks inside an Almgren-Chriss impact environment and claim lower mean implementation shortfall than PPO, SAC, A2C, TWAP, VWAP, and the plain AC baseline.

The architecture is a legitimate synthesis of existing pieces rather than a new framework, and the choice to use real LOB data instead of pure synthetic paths is a clear positive. The reward and state normalization are described plainly enough that someone could re-implement the setup.

The main weakness is the environment. The abstract and stress-test note give no detail on how the AC parameters were fitted, whether they were stock-specific or fixed, or whether any realized slippage check was done against actual market data. Without that, the performance gap could be an artifact of the simulator rather than the algorithm. There is also no mention of run counts, statistical tests, or sensitivity to the participation-rate caps.

This paper is for people already working on RL for market microstructure who want another data point on actor-critic variants. A reader looking for a new theoretical angle or a rigorously validated execution method will not find it here.

It is coherent on its own terms and reports a concrete empirical comparison, so it deserves a serious referee even if the simulation validation will need substantial work.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces TT-DAC-PS, a deterministic actor-critic method for optimal trade execution that augments TD3 with twin exponential-moving-average critic targets, pessimistic min backup, target policy smoothing noise, delayed actor updates, and conservative Q regularization. Exploration employs Ornstein-Uhlenbeck noise under a hybrid schedule combining deterministic decay, variance-guided adjustment, and a learned SAC-style temperature. The environment combines the Almgren-Chriss impact model with historical limit-order-book snapshots; the method is evaluated on ten U.S. stocks and reports lower mean implementation shortfall (with competitive variance) relative to TWAP, VWAP, AC, PPO, SAC, and A2C.

Significance. If the simulator faithfully reproduces real-market impact and liquidity dynamics, the architecture could supply a practical, overestimation-resistant RL baseline for continuous-control execution problems; the hybrid noise schedule and twin-target design are potentially reusable in other noisy-reward domains. The empirical contribution, however, is conditional on validation of the Almgren-Chriss + LOB environment against realized slippage.

major comments (3)

[Environment and Experimental Setup] Environment and Experimental Setup: the headline outperformance claim rests on the Almgren-Chriss + LOB simulator. The manuscript provides no information on how the temporary and permanent impact coefficients were fitted to the ten stocks, whether parameters were stock-specific or fixed, or whether any out-of-sample validation against actual execution slippage was performed. Without these details the reported gains relative to the classical and RL baselines cannot be distinguished from artifacts of the impact model.
[Results section] Results section (tables reporting mean implementation shortfall): the comparisons lack the number of independent training runs, standard errors or confidence intervals, and any statistical significance tests (e.g., paired t-tests or Wilcoxon rank-sum across seeds or stocks). This omission prevents assessment of whether the claimed consistent reductions are statistically reliable or could arise from training variance.
[§4] §4 (or wherever the reward and participation constraints are defined): the utility-based reward and per-step volume caps are central to the policy objective, yet no sensitivity analysis is reported with respect to the choice of utility function or cap values; small changes in these modeling choices could alter the ranking versus the baselines.

minor comments (3)

[Method] Notation for the twin-target and pessimistic-min operators should be introduced with explicit equations rather than prose descriptions only.
[Abstract and Results] The abstract states "consistently reduces" but the results tables do not indicate whether this holds for every stock or only on average; a per-stock breakdown or win-rate statistic would clarify the claim.
[Related Work] Missing references to recent RL-for-execution surveys or to prior work that already combines AC impact with LOB snapshots should be added for context.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the manuscript. Below we respond point-by-point to the major comments, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Environment and Experimental Setup] the headline outperformance claim rests on the Almgren-Chriss + LOB simulator. The manuscript provides no information on how the temporary and permanent impact coefficients were fitted to the ten stocks, whether parameters were stock-specific or fixed, or whether any out-of-sample validation against actual execution slippage was performed. Without these details the reported gains relative to the classical and RL baselines cannot be distinguished from artifacts of the impact model.

Authors: We agree that the current manuscript lacks these details. In the revision we will add a subsection describing the stock-specific fitting procedure for the impact coefficients using historical LOB and trade data. We will also explicitly state that comprehensive out-of-sample validation against realized execution slippage was not performed (the study is simulator-based) and discuss this limitation. This provides the requested transparency without altering the core claims. revision: partial
Referee: [Results section] the comparisons lack the number of independent training runs, standard errors or confidence intervals, and any statistical significance tests (e.g., paired t-tests or Wilcoxon rank-sum across seeds or stocks). This omission prevents assessment of whether the claimed consistent reductions are statistically reliable or could arise from training variance.

Authors: We accept this criticism. The revised manuscript will report the number of independent runs (10 random seeds), include standard errors and confidence intervals in all tables, and add paired t-tests (or Wilcoxon rank-sum where appropriate) across seeds and stocks to establish statistical reliability of the reported improvements. revision: yes
Referee: [§4] the utility-based reward and per-step volume caps are central to the policy objective, yet no sensitivity analysis is reported with respect to the choice of utility function or cap values; small changes in these modeling choices could alter the ranking versus the baselines.

Authors: We acknowledge the value of such analysis. The revision will include a new sensitivity study (main text or appendix) varying the utility parameters and participation caps, demonstrating that TT-DAC-PS retains its performance advantage under reasonable perturbations of these choices. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on external baselines

full rationale

The manuscript introduces TT-DAC-PS, an RL architecture, and reports its performance on LOB data for ten stocks inside an AC+LOB simulator. All load-bearing results are obtained by direct comparison to independent baselines (TWAP, VWAP, AC, PPO, SAC, A2C). No equations reduce a claimed prediction to a fitted input by construction, no uniqueness theorems are imported from the authors' prior work, and no ansatz is smuggled via self-citation. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review limits identification of all parameters; the main assumption is the fidelity of the market simulation model.

axioms (1)

domain assumption The trading environment is modeled accurately by the Almgren-Chriss impact function combined with LOB data.
This is invoked in the environment description in the abstract.

pith-pipeline@v0.9.1-grok · 5805 in / 1301 out tokens · 34644 ms · 2026-06-27T19:06:41.347304+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 6 canonical work pages · 3 internal anchors

[1]

Optimal execution of portfolio transactions.Journal of Risk, 3(2):5–39, 2001

Robert Almgren and Neil Chriss. Optimal execution of portfolio transactions.Journal of Risk, 3(2):5–39, 2001

2001
[2]

Optimal execution with nonlinear impact functions and trading-enhanced risk.Applied mathematical finance, 10(1):1–18, 2003

Robert F Almgren. Optimal execution with nonlinear impact functions and trading-enhanced risk.Applied mathematical finance, 10(1):1–18, 2003

2003
[3]

Optimal trade execution under geometric brownian motion in the almgren and chriss framework.International Journal of Theoretical and Applied Finance, 14(03):353–368, 2011

Jim Gatheral and Alexander Schied. Optimal trade execution under geometric brownian motion in the almgren and chriss framework.International Journal of Theoretical and Applied Finance, 14(03):353–368, 2011

2011
[4]

PhD thesis, University College London, 2015

Weiguan Wang.Optimal Execution Under Nonlinear Transient Market Impact Model. PhD thesis, University College London, 2015

2015
[5]

Cambridge University Press, 2015

Álvaro Cartea, Sebastian Jaimungal, and José Penalva.Algorithmic and High-Frequency Trading. Cambridge University Press, 2015

2015
[6]

Agent-based models for latent liquidity and concave price impact.Physical Review E, 89(4):042805, 2014

Iacopo Mastromatteo, Bence Toth, and Jean-Philippe Bouchaud. Agent-based models for latent liquidity and concave price impact.Physical Review E, 89(4):042805, 2014

2014
[7]

Reinforcement learning for optimized trade execution

Yuriy Nevmyvaka, Yi Feng, and Michael Kearns. Reinforcement learning for optimized trade execution. In Proceedings of the 23rd International Conference on Machine Learning, pages 673–680. ACM, 2006

2006
[8]

A reinforcement learning extension to the almgren-chriss framework for optimal trade execution

Dieter Hendricks and Diane Wilcox. A reinforcement learning extension to the almgren-chriss framework for optimal trade execution. In2014 IEEE Conference on computational intelligence for financial engineering & economics (CIFEr), pages 457–464. IEEE, 2014

2014
[9]

Universal trading for order execution with oracle policy distillation

Yuchen Fang, Kan Ren, Weiqing Liu, Dong Zhou, Weinan Zhang, Jiang Bian, Yong Yu, and Tie-Yan Liu. Universal trading for order execution with oracle policy distillation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 107–115, 2021

2021
[10]

Deep reinforcement learning for algorith- mic trading.Available at SSRN 3812473, 2021

Álvaro Cartea, Sebastian Jaimungal, and Leandro Sánchez-Betancourt. Deep reinforcement learning for algorith- mic trading.Available at SSRN 3812473, 2021

2021
[11]

Optimal execution with reinforcement learning.arXiv, 2024

Y Hafsi and E Vittori. Optimal execution with reinforcement learning.arXiv, 2024

2024
[12]

Reinforcement learning for optimal execution when liquidity is time-varying

Tommaso Macrì and Fabrizio Lillo. Reinforcement learning for optimal execution when liquidity is time-varying. arXiv preprint arXiv:2402.12049, 2024

work page arXiv 2024
[13]

Benchmarking deep reinforcement learning approaches to trade execution.Pacific-Basin Finance Journal, 94:102876, 2025

Isaac Tonkin et al. Benchmarking deep reinforcement learning approaches to trade execution.Pacific-Basin Finance Journal, 94:102876, 2025

2025
[14]

Optimal execution strategies in limit order books with general shape functions.Quantitative Finance, 10(2):143–157, 2009

Aurélien Alfonsi, Antje Fruth, and Alexander Schied. Optimal execution strategies in limit order books with general shape functions.Quantitative Finance, 10(2):143–157, 2009

2009
[15]

Recent advances in reinforcement learning in finance.Mathematical Finance, 2021

Ben Hambly, Renyuan Xu, and Huining Yang. Recent advances in reinforcement learning in finance.Mathematical Finance, 2021

2021
[16]

Deep reinforcement learning for online optimal execution strategies.arXiv preprint arXiv:2410.13493, 2024

Matteo Micheli and Antoine Monod. Deep reinforcement learning for online optimal execution strategies.arXiv preprint arXiv:2410.13493, 2024

work page arXiv 2024
[17]

Continuous control with deep reinforcement learning

Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.arXiv preprint arXiv:1509.02971, 2015. ICLR 2016

work page internal anchor Pith review Pith/arXiv arXiv 2015
[18]

Dimitris Bertsimas and Andrew W. Lo. Optimal control of execution costs.Journal of Financial Markets, 1(1):1–50, 1998

1998
[19]

No-dynamic-arbitrage and market impact.Quantitative Finance, 10(7):749–759, 2010

Jim Gatheral. No-dynamic-arbitrage and market impact.Quantitative Finance, 10(7):749–759, 2010. 19 TT-DAC-PS: Twin-Target Deterministic Actor-Critic with Policy Smoothing for Optimal Trade Execution

2010
[20]

Transient linear price impact and optimal execution.Mathemat- ical Finance, 25(3):557–592, 2015

Jim Gatheral, Alexander Schied, and Alla Slynko. Transient linear price impact and optimal execution.Mathemat- ical Finance, 25(3):557–592, 2015. Often cited by early preprint year 2013

2015
[21]

Robust strategies for optimal order execution in the almgren–chriss framework.Applied Mathematical Finance, 20(3):264–286, 2013

Alexander Schied. Robust strategies for optimal order execution in the almgren–chriss framework.Applied Mathematical Finance, 20(3):264–286, 2013

2013
[22]

Dynamic trading with predictable returns and transaction costs.The Journal of Finance, 68(6):2309–2340, 2013

Nicolae Gârleanu and Lasse Heje Pedersen. Dynamic trading with predictable returns and transaction costs.The Journal of Finance, 68(6):2309–2340, 2013

2013
[23]

Gould, Mason A

Martin D. Gould, Mason A. Porter, Stacy Williams, Mark McDonald, Daniel J. Fenn, and Sam D. Howison. Limit order books.Quantitative Finance, 13(11):1709–1742, 2013

2013
[24]

Statistical properties of stock order books: empirical results and models.Quantitative Finance, 2(4):251–256, 2002

Jean-Philippe Bouchaud, Marc Mézard, and Marc Potters. Statistical properties of stock order books: empirical results and models.Quantitative Finance, 2(4):251–256, 2002

2002
[25]

Statistical modeling of high-frequency financial data.Annual Review of Financial Economics, 3(1):291–310, 2011

Rama Cont. Statistical modeling of high-frequency financial data.Annual Review of Financial Economics, 3(1):291–310, 2011

2011
[26]

Fluctuations and response in financial markets: The subtle nature of “random” price changes.Quantitative Finance, 4(2):176–190, 2004

Jean-Philippe Bouchaud, Yuval Gefen, Marc Potters, and Matthieu Wyart. Fluctuations and response in financial markets: The subtle nature of “random” price changes.Quantitative Finance, 4(2):176–190, 2004

2004
[27]

Hawkes processes in finance.Market Microstructure and Liquidity, 1(1):1550005, 2015

Emmanuel Bacry, Iacopo Mastromatteo, and Jean-François Muzy. Hawkes processes in finance.Market Microstructure and Liquidity, 1(1):1550005, 2015

2015
[28]

Critical reflexivity in financial markets: a hawkes process analysis.The European Physical Journal B, 86(10):442, 2013

Stephen Hardiman, Nicolas Bercot, and Jean-Philippe Bouchaud. Critical reflexivity in financial markets: a hawkes process analysis.The European Physical Journal B, 86(10):442, 2013

2013
[29]

Anomalous price impact and the critical nature of liquidity in financial markets.Physical Review X, 1(2):021006, 2011

Bence Tóth, Yves Lemperiere, Cyril Deremble, Joachim De Lataillade, Julien Kockelkoren, and J-P Bouchaud. Anomalous price impact and the critical nature of liquidity in financial markets.Physical Review X, 1(2):021006, 2011

2011
[30]

The non-linear market impact of large trades: Evidence from limit order books.The Journal of Trading, 8(3):1–12, 2013

Natalia Bershova and Dmitry Rakhlin. The non-linear market impact of large trades: Evidence from limit order books.The Journal of Trading, 8(3):1–12, 2013

2013
[31]

Direct estimation of equity market impact

Robert Almgren, Chee Thum, Emmanuel Hauptmann, and Hong Li. Direct estimation of equity market impact. Risk, 18(7):58–62, 2005

2005
[32]

Addressing function approximation error in actor-critic methods

Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. InProceedings of the 35th International Conference on Machine Learning (ICML), 2018

2018
[33]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InProceedings of the 35th International Conference on Machine Learning, 2018

2018
[34]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. InarXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[35]

Deterministic policy gradient algorithms

David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. InProceedings of the 31st International Conference on Machine Learning, 2014

2014
[36]

Double deep q-learning for optimal trade execution.arXiv preprint arXiv:1812.06600, 2018

Bohan Ning, Xiaoteng Wang, Andrew Lim, and Jie Ye. Double deep q-learning for optimal trade execution.arXiv preprint arXiv:1812.06600, 2018

work page arXiv 2018
[37]

An end-to-end optimal trade execution framework based on proximal policy optimization

Siyu Lin and Peter A Beling. An end-to-end optimal trade execution framework based on proximal policy optimization. InProceedings of the twenty-ninth international conference on international joint conferences on artificial intelligence, pages 4548–4554, 2021

2021
[38]

Deep reinforcement learning for automated stock trading: An ensemble strategy

Hongyang Yang, Xiao-Yang Liu, Shan Zhong, and Anwar Walid. Deep reinforcement learning for automated stock trading: An ensemble strategy. InProceedings of the first ACM international conference on AI in finance, pages 1–8, 2020

2020
[39]

A deep reinforcement learning framework for optimal trade execution

Siyu Lin and Peter A Beling. A deep reinforcement learning framework for optimal trade execution. InJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 223–240. Springer, 2020

2020
[40]

Chen, Xi Chen, Tamim Asfour, Pieter Abbeel, and Marcin Andrychowicz

Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon Sidor, Richard Y . Chen, Xi Chen, Tamim Asfour, Pieter Abbeel, and Marcin Andrychowicz. Parameter space noise for exploration. InInternational Conference on Learning Representations, 2018

2018
[41]

Noisy networks for exploration

Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Matteo Hessel, Ian Osband, Alex Graves, V olodymyr Mnih, Rémi Munos, Demis Hassabis, Olivier Pietquin, Charles Blundell, and Shane Legg. Noisy networks for exploration. InInternational Conference on Learning Representations (ICLR), 2018. 20 TT-DAC-PS: Twin-Target Deterministic Actor-Crit...

2018
[42]

Exploration by Random Network Distillation

Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[43]

Efros, and Trevor Darrell

Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven exploration by self- supervised prediction. InICML Workshop on Principled Approaches to Deep Learning, 2017

2017
[44]

Deep exploration via bootstrapped dqn

Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped dqn. InAdvances in Neural Information Processing Systems, 2016

2016
[45]

A tutorial on thompson sampling and the exploration-exploitation tradeoff.Foundations and Trends in Machine Learning, 11(1):1–96, 2018

Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A tutorial on thompson sampling and the exploration-exploitation tradeoff.Foundations and Trends in Machine Learning, 11(1):1–96, 2018

2018
[46]

A fully consistent, minimal model for non-linear market impact.Quantitative Finance, 15(7):1109–1121, 2015

Jonathan Donier, Julius Bonart, Iacopo Mastromatteo, and Jean-Philippe Bouchaud. A fully consistent, minimal model for non-linear market impact.Quantitative Finance, 15(7):1109–1121, 2015

2015
[47]

Cross-impact and no-dynamic-arbitrage.Quantitative Finance, 19(1):137– 154, 2019

Michael Schneider and Fabrizio Lillo. Cross-impact and no-dynamic-arbitrage.Quantitative Finance, 19(1):137– 154, 2019

2019
[48]

Dissecting cross impact on stock markets: An empirical analysis.Journal of Statistical Mechanics: Theory and Experiment, 2017(2):023406, 2017

Michael Benzaquen, Iacopo Mastromatteo, Zoltan Eisler, and Jean-Philippe Bouchaud. Dissecting cross impact on stock markets: An empirical analysis.Journal of Statistical Mechanics: Theory and Experiment, 2017(2):023406, 2017

2017
[49]

Trading lightly: Cross- impact and optimal portfolio execution

Iacopo Mastromatteo, Michael Benzaquen, Zoltan Eisler, and Jean-Philippe Bouchaud. Trading lightly: Cross- impact and optimal portfolio execution. 2017

2017
[50]

ABIDES: Towards high-fidelity market simulation for AI research

David Byrd, Maria Hybinette, and Tucker Hybinette Balch. ABIDES: Towards high-fidelity market simulation for AI research. 2019

2019
[51]

ABIDES: Towards high-fidelity multi-agent market simulation

David Byrd, Maria Hybinette, and Tucker Hybinette Balch. ABIDES: Towards high-fidelity multi-agent market simulation. InProceedings of the 2020 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation (PADS), 2020

2020
[52]

Predicting stock price changes based on the limit order book: a survey.Mathematics, 10(8):1234, 2022

Ilia Zaznov, Julian Kunkel, Alfonso Dufour, and Atta Badii. Predicting stock price changes based on the limit order book: a survey.Mathematics, 10(8):1234, 2022

2022
[53]

Ilia Zaznov, Julian Martin Kunkel, Atta Badii, and Alfonso Dufour. The intraday dynamics predictor: a trioflow fusion of convolutional layers and gated recurrent units for high-frequency price movement forecasting.Applied Sciences, 14(7):2984, 2024

2024
[54]

Universal features of price formation in financial markets: perspectives from deep learning.Quantitative Finance, 19(9):1449–1459, 2019

Justin Sirignano and Rama Cont. Universal features of price formation in financial markets: perspectives from deep learning.Quantitative Finance, 19(9):1449–1459, 2019

2019
[55]

Deeplob: Deep convolutional neural networks for limit order books.IEEE Access, 7:167692–167705, 2019

Zihao Zhang, Stefan Zohren, and Stephen Roberts. Deeplob: Deep convolutional neural networks for limit order books.IEEE Access, 7:167692–167705, 2019

2019
[56]

Attention based reading, highlighting, and forecasting of the limit order book

Jiwon Jung and Kiseop Lee. Attention based reading, highlighting, and forecasting of the limit order book. 2024

2024
[57]

Optimal execution with price-volume coupling

Matthias Schnaubelt, Jonas Löhner, Bálint Horváth, et al. Optimal execution with price-volume coupling. SSRN 3534315, 2020

2020
[58]

Latency and liquidity risk.International Journal of Theoretical and Applied Finance, 24(06n07):2150035, 2021

Álvaro Cartea, Sebastian Jaimungal, and Leandro Sánchez-Betancourt. Latency and liquidity risk.International Journal of Theoretical and Applied Finance, 24(06n07):2150035, 2021

2021
[59]

André F. Perold. The implementation shortfall: Paper versus reality.The Journal of Portfolio Management, 14(3):4–9, 1988

1988
[60]

Tyrrell Rockafellar and Stanislav Uryasev

R. Tyrrell Rockafellar and Stanislav Uryasev. Optimization of conditional value-at-risk.Journal of Risk, 2:21–41, 2000

2000
[61]

Adamz: an enhanced optimisation method for neural network training.Neural Computing and Applications, pages 1–28, 2025

Ilia Zaznov, Atta Badii, Julian Kunkel, and Alfonso Dufour. Adamz: an enhanced optimisation method for neural network training.Neural Computing and Applications, pages 1–28, 2025. 21

2025

[1] [1]

Optimal execution of portfolio transactions.Journal of Risk, 3(2):5–39, 2001

Robert Almgren and Neil Chriss. Optimal execution of portfolio transactions.Journal of Risk, 3(2):5–39, 2001

2001

[2] [2]

Optimal execution with nonlinear impact functions and trading-enhanced risk.Applied mathematical finance, 10(1):1–18, 2003

Robert F Almgren. Optimal execution with nonlinear impact functions and trading-enhanced risk.Applied mathematical finance, 10(1):1–18, 2003

2003

[3] [3]

Optimal trade execution under geometric brownian motion in the almgren and chriss framework.International Journal of Theoretical and Applied Finance, 14(03):353–368, 2011

Jim Gatheral and Alexander Schied. Optimal trade execution under geometric brownian motion in the almgren and chriss framework.International Journal of Theoretical and Applied Finance, 14(03):353–368, 2011

2011

[4] [4]

PhD thesis, University College London, 2015

Weiguan Wang.Optimal Execution Under Nonlinear Transient Market Impact Model. PhD thesis, University College London, 2015

2015

[5] [5]

Cambridge University Press, 2015

Álvaro Cartea, Sebastian Jaimungal, and José Penalva.Algorithmic and High-Frequency Trading. Cambridge University Press, 2015

2015

[6] [6]

Agent-based models for latent liquidity and concave price impact.Physical Review E, 89(4):042805, 2014

Iacopo Mastromatteo, Bence Toth, and Jean-Philippe Bouchaud. Agent-based models for latent liquidity and concave price impact.Physical Review E, 89(4):042805, 2014

2014

[7] [7]

Reinforcement learning for optimized trade execution

Yuriy Nevmyvaka, Yi Feng, and Michael Kearns. Reinforcement learning for optimized trade execution. In Proceedings of the 23rd International Conference on Machine Learning, pages 673–680. ACM, 2006

2006

[8] [8]

A reinforcement learning extension to the almgren-chriss framework for optimal trade execution

Dieter Hendricks and Diane Wilcox. A reinforcement learning extension to the almgren-chriss framework for optimal trade execution. In2014 IEEE Conference on computational intelligence for financial engineering & economics (CIFEr), pages 457–464. IEEE, 2014

2014

[9] [9]

Universal trading for order execution with oracle policy distillation

Yuchen Fang, Kan Ren, Weiqing Liu, Dong Zhou, Weinan Zhang, Jiang Bian, Yong Yu, and Tie-Yan Liu. Universal trading for order execution with oracle policy distillation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 107–115, 2021

2021

[10] [10]

Deep reinforcement learning for algorith- mic trading.Available at SSRN 3812473, 2021

Álvaro Cartea, Sebastian Jaimungal, and Leandro Sánchez-Betancourt. Deep reinforcement learning for algorith- mic trading.Available at SSRN 3812473, 2021

2021

[11] [11]

Optimal execution with reinforcement learning.arXiv, 2024

Y Hafsi and E Vittori. Optimal execution with reinforcement learning.arXiv, 2024

2024

[12] [12]

Reinforcement learning for optimal execution when liquidity is time-varying

Tommaso Macrì and Fabrizio Lillo. Reinforcement learning for optimal execution when liquidity is time-varying. arXiv preprint arXiv:2402.12049, 2024

work page arXiv 2024

[13] [13]

Benchmarking deep reinforcement learning approaches to trade execution.Pacific-Basin Finance Journal, 94:102876, 2025

Isaac Tonkin et al. Benchmarking deep reinforcement learning approaches to trade execution.Pacific-Basin Finance Journal, 94:102876, 2025

2025

[14] [14]

Optimal execution strategies in limit order books with general shape functions.Quantitative Finance, 10(2):143–157, 2009

Aurélien Alfonsi, Antje Fruth, and Alexander Schied. Optimal execution strategies in limit order books with general shape functions.Quantitative Finance, 10(2):143–157, 2009

2009

[15] [15]

Recent advances in reinforcement learning in finance.Mathematical Finance, 2021

Ben Hambly, Renyuan Xu, and Huining Yang. Recent advances in reinforcement learning in finance.Mathematical Finance, 2021

2021

[16] [16]

Deep reinforcement learning for online optimal execution strategies.arXiv preprint arXiv:2410.13493, 2024

Matteo Micheli and Antoine Monod. Deep reinforcement learning for online optimal execution strategies.arXiv preprint arXiv:2410.13493, 2024

work page arXiv 2024

[17] [17]

Continuous control with deep reinforcement learning

Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.arXiv preprint arXiv:1509.02971, 2015. ICLR 2016

work page internal anchor Pith review Pith/arXiv arXiv 2015

[18] [18]

Dimitris Bertsimas and Andrew W. Lo. Optimal control of execution costs.Journal of Financial Markets, 1(1):1–50, 1998

1998

[19] [19]

No-dynamic-arbitrage and market impact.Quantitative Finance, 10(7):749–759, 2010

Jim Gatheral. No-dynamic-arbitrage and market impact.Quantitative Finance, 10(7):749–759, 2010. 19 TT-DAC-PS: Twin-Target Deterministic Actor-Critic with Policy Smoothing for Optimal Trade Execution

2010

[20] [20]

Transient linear price impact and optimal execution.Mathemat- ical Finance, 25(3):557–592, 2015

Jim Gatheral, Alexander Schied, and Alla Slynko. Transient linear price impact and optimal execution.Mathemat- ical Finance, 25(3):557–592, 2015. Often cited by early preprint year 2013

2015

[21] [21]

Robust strategies for optimal order execution in the almgren–chriss framework.Applied Mathematical Finance, 20(3):264–286, 2013

Alexander Schied. Robust strategies for optimal order execution in the almgren–chriss framework.Applied Mathematical Finance, 20(3):264–286, 2013

2013

[22] [22]

Dynamic trading with predictable returns and transaction costs.The Journal of Finance, 68(6):2309–2340, 2013

Nicolae Gârleanu and Lasse Heje Pedersen. Dynamic trading with predictable returns and transaction costs.The Journal of Finance, 68(6):2309–2340, 2013

2013

[23] [23]

Gould, Mason A

Martin D. Gould, Mason A. Porter, Stacy Williams, Mark McDonald, Daniel J. Fenn, and Sam D. Howison. Limit order books.Quantitative Finance, 13(11):1709–1742, 2013

2013

[24] [24]

Statistical properties of stock order books: empirical results and models.Quantitative Finance, 2(4):251–256, 2002

Jean-Philippe Bouchaud, Marc Mézard, and Marc Potters. Statistical properties of stock order books: empirical results and models.Quantitative Finance, 2(4):251–256, 2002

2002

[25] [25]

Statistical modeling of high-frequency financial data.Annual Review of Financial Economics, 3(1):291–310, 2011

Rama Cont. Statistical modeling of high-frequency financial data.Annual Review of Financial Economics, 3(1):291–310, 2011

2011

[26] [26]

Fluctuations and response in financial markets: The subtle nature of “random” price changes.Quantitative Finance, 4(2):176–190, 2004

Jean-Philippe Bouchaud, Yuval Gefen, Marc Potters, and Matthieu Wyart. Fluctuations and response in financial markets: The subtle nature of “random” price changes.Quantitative Finance, 4(2):176–190, 2004

2004

[27] [27]

Hawkes processes in finance.Market Microstructure and Liquidity, 1(1):1550005, 2015

Emmanuel Bacry, Iacopo Mastromatteo, and Jean-François Muzy. Hawkes processes in finance.Market Microstructure and Liquidity, 1(1):1550005, 2015

2015

[28] [28]

Critical reflexivity in financial markets: a hawkes process analysis.The European Physical Journal B, 86(10):442, 2013

Stephen Hardiman, Nicolas Bercot, and Jean-Philippe Bouchaud. Critical reflexivity in financial markets: a hawkes process analysis.The European Physical Journal B, 86(10):442, 2013

2013

[29] [29]

Anomalous price impact and the critical nature of liquidity in financial markets.Physical Review X, 1(2):021006, 2011

Bence Tóth, Yves Lemperiere, Cyril Deremble, Joachim De Lataillade, Julien Kockelkoren, and J-P Bouchaud. Anomalous price impact and the critical nature of liquidity in financial markets.Physical Review X, 1(2):021006, 2011

2011

[30] [30]

The non-linear market impact of large trades: Evidence from limit order books.The Journal of Trading, 8(3):1–12, 2013

Natalia Bershova and Dmitry Rakhlin. The non-linear market impact of large trades: Evidence from limit order books.The Journal of Trading, 8(3):1–12, 2013

2013

[31] [31]

Direct estimation of equity market impact

Robert Almgren, Chee Thum, Emmanuel Hauptmann, and Hong Li. Direct estimation of equity market impact. Risk, 18(7):58–62, 2005

2005

[32] [32]

Addressing function approximation error in actor-critic methods

Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. InProceedings of the 35th International Conference on Machine Learning (ICML), 2018

2018

[33] [33]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InProceedings of the 35th International Conference on Machine Learning, 2018

2018

[34] [34]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. InarXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[35] [35]

Deterministic policy gradient algorithms

David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. InProceedings of the 31st International Conference on Machine Learning, 2014

2014

[36] [36]

Double deep q-learning for optimal trade execution.arXiv preprint arXiv:1812.06600, 2018

Bohan Ning, Xiaoteng Wang, Andrew Lim, and Jie Ye. Double deep q-learning for optimal trade execution.arXiv preprint arXiv:1812.06600, 2018

work page arXiv 2018

[37] [37]

An end-to-end optimal trade execution framework based on proximal policy optimization

Siyu Lin and Peter A Beling. An end-to-end optimal trade execution framework based on proximal policy optimization. InProceedings of the twenty-ninth international conference on international joint conferences on artificial intelligence, pages 4548–4554, 2021

2021

[38] [38]

Deep reinforcement learning for automated stock trading: An ensemble strategy

Hongyang Yang, Xiao-Yang Liu, Shan Zhong, and Anwar Walid. Deep reinforcement learning for automated stock trading: An ensemble strategy. InProceedings of the first ACM international conference on AI in finance, pages 1–8, 2020

2020

[39] [39]

A deep reinforcement learning framework for optimal trade execution

Siyu Lin and Peter A Beling. A deep reinforcement learning framework for optimal trade execution. InJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 223–240. Springer, 2020

2020

[40] [40]

Chen, Xi Chen, Tamim Asfour, Pieter Abbeel, and Marcin Andrychowicz

Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon Sidor, Richard Y . Chen, Xi Chen, Tamim Asfour, Pieter Abbeel, and Marcin Andrychowicz. Parameter space noise for exploration. InInternational Conference on Learning Representations, 2018

2018

[41] [41]

Noisy networks for exploration

Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Matteo Hessel, Ian Osband, Alex Graves, V olodymyr Mnih, Rémi Munos, Demis Hassabis, Olivier Pietquin, Charles Blundell, and Shane Legg. Noisy networks for exploration. InInternational Conference on Learning Representations (ICLR), 2018. 20 TT-DAC-PS: Twin-Target Deterministic Actor-Crit...

2018

[42] [42]

Exploration by Random Network Distillation

Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[43] [43]

Efros, and Trevor Darrell

Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven exploration by self- supervised prediction. InICML Workshop on Principled Approaches to Deep Learning, 2017

2017

[44] [44]

Deep exploration via bootstrapped dqn

Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped dqn. InAdvances in Neural Information Processing Systems, 2016

2016

[45] [45]

A tutorial on thompson sampling and the exploration-exploitation tradeoff.Foundations and Trends in Machine Learning, 11(1):1–96, 2018

Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A tutorial on thompson sampling and the exploration-exploitation tradeoff.Foundations and Trends in Machine Learning, 11(1):1–96, 2018

2018

[46] [46]

A fully consistent, minimal model for non-linear market impact.Quantitative Finance, 15(7):1109–1121, 2015

Jonathan Donier, Julius Bonart, Iacopo Mastromatteo, and Jean-Philippe Bouchaud. A fully consistent, minimal model for non-linear market impact.Quantitative Finance, 15(7):1109–1121, 2015

2015

[47] [47]

Cross-impact and no-dynamic-arbitrage.Quantitative Finance, 19(1):137– 154, 2019

Michael Schneider and Fabrizio Lillo. Cross-impact and no-dynamic-arbitrage.Quantitative Finance, 19(1):137– 154, 2019

2019

[48] [48]

Dissecting cross impact on stock markets: An empirical analysis.Journal of Statistical Mechanics: Theory and Experiment, 2017(2):023406, 2017

Michael Benzaquen, Iacopo Mastromatteo, Zoltan Eisler, and Jean-Philippe Bouchaud. Dissecting cross impact on stock markets: An empirical analysis.Journal of Statistical Mechanics: Theory and Experiment, 2017(2):023406, 2017

2017

[49] [49]

Trading lightly: Cross- impact and optimal portfolio execution

Iacopo Mastromatteo, Michael Benzaquen, Zoltan Eisler, and Jean-Philippe Bouchaud. Trading lightly: Cross- impact and optimal portfolio execution. 2017

2017

[50] [50]

ABIDES: Towards high-fidelity market simulation for AI research

David Byrd, Maria Hybinette, and Tucker Hybinette Balch. ABIDES: Towards high-fidelity market simulation for AI research. 2019

2019

[51] [51]

ABIDES: Towards high-fidelity multi-agent market simulation

David Byrd, Maria Hybinette, and Tucker Hybinette Balch. ABIDES: Towards high-fidelity multi-agent market simulation. InProceedings of the 2020 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation (PADS), 2020

2020

[52] [52]

Predicting stock price changes based on the limit order book: a survey.Mathematics, 10(8):1234, 2022

Ilia Zaznov, Julian Kunkel, Alfonso Dufour, and Atta Badii. Predicting stock price changes based on the limit order book: a survey.Mathematics, 10(8):1234, 2022

2022

[53] [53]

Ilia Zaznov, Julian Martin Kunkel, Atta Badii, and Alfonso Dufour. The intraday dynamics predictor: a trioflow fusion of convolutional layers and gated recurrent units for high-frequency price movement forecasting.Applied Sciences, 14(7):2984, 2024

2024

[54] [54]

Universal features of price formation in financial markets: perspectives from deep learning.Quantitative Finance, 19(9):1449–1459, 2019

Justin Sirignano and Rama Cont. Universal features of price formation in financial markets: perspectives from deep learning.Quantitative Finance, 19(9):1449–1459, 2019

2019

[55] [55]

Deeplob: Deep convolutional neural networks for limit order books.IEEE Access, 7:167692–167705, 2019

Zihao Zhang, Stefan Zohren, and Stephen Roberts. Deeplob: Deep convolutional neural networks for limit order books.IEEE Access, 7:167692–167705, 2019

2019

[56] [56]

Attention based reading, highlighting, and forecasting of the limit order book

Jiwon Jung and Kiseop Lee. Attention based reading, highlighting, and forecasting of the limit order book. 2024

2024

[57] [57]

Optimal execution with price-volume coupling

Matthias Schnaubelt, Jonas Löhner, Bálint Horváth, et al. Optimal execution with price-volume coupling. SSRN 3534315, 2020

2020

[58] [58]

Latency and liquidity risk.International Journal of Theoretical and Applied Finance, 24(06n07):2150035, 2021

Álvaro Cartea, Sebastian Jaimungal, and Leandro Sánchez-Betancourt. Latency and liquidity risk.International Journal of Theoretical and Applied Finance, 24(06n07):2150035, 2021

2021

[59] [59]

André F. Perold. The implementation shortfall: Paper versus reality.The Journal of Portfolio Management, 14(3):4–9, 1988

1988

[60] [60]

Tyrrell Rockafellar and Stanislav Uryasev

R. Tyrrell Rockafellar and Stanislav Uryasev. Optimization of conditional value-at-risk.Journal of Risk, 2:21–41, 2000

2000

[61] [61]

Adamz: an enhanced optimisation method for neural network training.Neural Computing and Applications, pages 1–28, 2025

Ilia Zaznov, Atta Badii, Julian Kunkel, and Alfonso Dufour. Adamz: an enhanced optimisation method for neural network training.Neural Computing and Applications, pages 1–28, 2025. 21

2025