Memory-Induced Supra-Competitive Outcomes Between Deep Reinforcement Learning Agents in Optimal Trade Execution
Pith reviewed 2026-05-21 07:01 UTC · model grok-4.3
The pith
Access to intra-episode memory lets RL agents in a two-agent liquidation game sustain lower implementation shortfalls than the game-theoretic benchmark.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the two-agent Almgren-Chriss liquidation game, DDQN agents that condition on intra-episode history—especially recent mid-prices and own past actions—produce supra-competitive outcomes, defined as lower implementation shortfalls than the relevant game-theoretic benchmark, at substantially higher rates and with greater persistence than agents restricted to ex-ante complete schedules.
What carries the argument
The contrast between ex-ante schedule-learning agents and state-contingent DDQN policies that incorporate intra-episode feedback and memory within the Almgren-Chriss two-agent execution environment.
If this is right
- Supra-competitive behavior requires state-contingent interaction along the realized execution path rather than multi-agent learning or current-price observation alone.
- Ex-ante schedule commitment removes the conditions under which supra-competitive results emerge.
- Recent prices combined with the agent's own past actions form the most effective memory signals for sustaining outperformance.
Where Pith is reading between the lines
- Market venues that limit real-time data feeds to trading algorithms might reduce the frequency of these memory-driven outcomes.
- Similar memory effects could appear in other sequential multi-agent games where agents share a common price path.
- Extending the setup to three or more agents would test whether the same memory channel continues to support supra-competitive execution.
Load-bearing premise
Differences in observed outcomes are caused by the presence or absence of memory and intra-episode feedback rather than by unexamined variations in training stability or hyperparameter choices.
What would settle it
A controlled retraining experiment in which agents receive identical hyperparameters and architectures but are denied access to intra-episode price history and past actions, with outcomes then compared against the original memory-enabled runs.
Figures
read the original abstract
In this paper, we investigate whether deep reinforcement-learning agents interacting in a shared optimal-execution environment can sustain supra-competitive outcomes, in the sense of achieving lower implementation shortfalls than the relevant game-theoretical competitive benchmark. We study a two-agent Almgren-Chriss liquidation game and examine how learned behavior depends on intra-episode environment feedback, the ability to interpret the mid-price and the agent's knoledge of the past. We first use ex-ante schedule-learning agents to remove intra-episode feedback and isolate what can arise when agents commit to complete liquidation trajectories before execution begins. We then allow agents to condition on the evolving state using a variety of DDQN architectures. We find that, when agents are given access to intra-episode history, especially recent prices and own past actions, supra-competitive outcomes become substantially more frequent and more persistent. These findings indicate that supra-competitive behavior in this execution game is driven not by multi-agent learning or by current price observation alone, but by feedback, memory, and state-contingent interaction along the realized execution path.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies a two-agent Almgren-Chriss liquidation game and asks whether deep RL agents can produce supra-competitive outcomes (lower implementation shortfalls than the game-theoretic benchmark). It first examines ex-ante schedule-learning agents that commit to full trajectories without intra-episode feedback, then compares these to DDQN agents that receive varying degrees of intra-episode state information, including recent prices and own past actions. The central empirical claim is that access to intra-episode history substantially raises both the frequency and persistence of supra-competitive outcomes.
Significance. If the attribution to memory holds after controlling for training differences, the result would indicate that state-contingent feedback along the execution path, rather than multi-agent learning or price observation alone, drives outperformance of the competitive benchmark. This has potential implications for the design of execution algorithms and for understanding emergent non-competitive behavior in multi-agent financial RL settings. The paper's use of a clearly defined game-theoretic benchmark and forward simulation is a methodological strength.
major comments (2)
- [methods / experimental setup] Experimental setup (methods section describing DDQN variants): the manuscript does not state whether the DDQN agents with memory use identical network architectures, learning rates, replay-buffer sizes, training episode counts, and convergence criteria as the ex-ante schedule-learning baselines. If these differ, the reported increase in supra-competitive frequency could reflect optimization advantages or reduced non-stationarity rather than the memory mechanism itself.
- [results] Results on frequency and persistence: the claim that supra-competitive outcomes become 'substantially more frequent and more persistent' with intra-episode history requires quantitative support (number of independent runs, error bars or confidence intervals on the reported frequencies, and statistical tests comparing conditions). Without these, it is impossible to separate the memory effect from training variance.
minor comments (2)
- Notation: the distinction between 'ex-ante schedule-learning agents' and the various DDQN state-input configurations should be summarized in a single table for clarity.
- [results] The abstract states that the effect is driven by 'feedback, memory, and state-contingent interaction'; the results section should explicitly isolate which component (recent prices vs. own past actions) contributes most.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help strengthen the clarity and rigor of our work. We address each major comment below.
read point-by-point responses
-
Referee: [methods / experimental setup] Experimental setup (methods section describing DDQN variants): the manuscript does not state whether the DDQN agents with memory use identical network architectures, learning rates, replay-buffer sizes, training episode counts, and convergence criteria as the ex-ante schedule-learning baselines. If these differ, the reported increase in supra-competitive frequency could reflect optimization advantages or reduced non-stationarity rather than the memory mechanism itself.
Authors: We confirm that all agents—both the ex-ante schedule-learning baselines and the DDQN variants with varying intra-episode state information—were trained using identical network architectures, learning rates, replay-buffer sizes, training episode counts, and convergence criteria. This design choice was made explicitly to isolate the contribution of intra-episode memory and state feedback. We will add a dedicated paragraph in the revised Methods section stating these shared hyperparameters and training protocols. revision: yes
-
Referee: [results] Results on frequency and persistence: the claim that supra-competitive outcomes become 'substantially more frequent and more persistent' with intra-episode history requires quantitative support (number of independent runs, error bars or confidence intervals on the reported frequencies, and statistical tests comparing conditions). Without these, it is impossible to separate the memory effect from training variance.
Authors: We agree that additional quantitative detail is necessary to support the frequency and persistence claims. Our experiments were performed across multiple independent training runs using different random seeds. In the revision we will report the exact number of runs, include error bars or confidence intervals on the supra-competitive frequencies, and add appropriate statistical comparisons (e.g., two-sample t-tests) between the memory and no-memory conditions. revision: yes
Circularity Check
No circularity: empirical simulation results against external benchmark
full rationale
The paper reports outcomes from forward simulation of RL policies (DDQN variants with and without intra-episode state) against the externally defined Almgren-Chriss game-theoretic benchmark. No derivation step reduces a claimed result to a fitted parameter or self-citation by construction; the frequency of supra-competitive outcomes is measured directly from independent rollouts rather than being algebraically entailed by the training objective or prior author work. The central attribution to memory is therefore an empirical observation, not a definitional or fitted tautology.
Axiom & Free-Parameter Ledger
free parameters (1)
- DDQN architecture and training hyperparameters
axioms (1)
- domain assumption The Almgren-Chriss model is a valid representation of optimal execution price dynamics.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
supra-competitive outcomes become substantially more frequent and more persistent... driven not by multi-agent learning or by current price observation alone, but by feedback, memory, and state-contingent interaction
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_injective unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
history-aware DDQN... Transformer encoder with masked self-attention... recent prices and own past actions
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
IEEE Transactions on Neural Networks , volume =
Learning to trade via direct reinforcement , author =. IEEE Transactions on Neural Networks , volume =. 2001 , publisher =
work page 2001
-
[2]
Expert Systems with Applications , volume =
An automated FX trading system using adaptive reinforcement learning , author =. Expert Systems with Applications , volume =. 2006 , publisher =
work page 2006
-
[3]
Reinforcement learning in financial markets , author=. Data , volume=. 2019 , publisher=
work page 2019
-
[4]
Mathematical Finance , volume=
Recent advances in reinforcement learning in finance , author=. Mathematical Finance , volume=. 2023 , publisher=
work page 2023
-
[5]
Annual Review of Statistics and Its Application , volume=
A review of reinforcement learning in financial applications , author=. Annual Review of Statistics and Its Application , volume=. 2025 , publisher=
work page 2025
-
[6]
arXiv preprint arXiv:1911.10107 , year=
Deep reinforcement learning for trading , author=. arXiv preprint arXiv:1911.10107 , year=
-
[7]
arXiv preprint arXiv:2101.07107 , year=
Deep reinforcement learning for active high frequency trading , author=. arXiv preprint arXiv:2101.07107 , year=
-
[8]
Reinforcement learning for high-frequency market making , author=. ESANN 2018-Proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning , pages=. 2018 , organization=
work page 2018
-
[9]
Performance of deep reinforcement learning for high frequency market making on actual tick data , author=. Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems , pages=
-
[10]
Automatic Optimization of Trading Strategies Based on Reinforcement Learning , author=. 2025 IEEE 14th International Conference on Communication Systems and Network Technologies (CSNT) , pages=. 2025 , organization=
work page 2025
-
[11]
Optimal execution of portfolio transactions , author=. Journal of Risk , volume=
-
[12]
Quantitative Finance , volume=
Limit order books , author=. Quantitative Finance , volume=. 2013 , publisher=
work page 2013
-
[13]
Proceedings of the 23rd international conference on Machine learning , pages=
Reinforcement learning for optimized trade execution , author=. Proceedings of the 23rd international conference on Machine learning , pages=
-
[14]
A reinforcement learning extension to the Almgren-Chriss framework for optimal trade execution , author=. 2014 IEEE Conference on computational intelligence for financial engineering & economics (CIFEr) , pages=. 2014 , organization=
work page 2014
-
[15]
Applied Mathematical Finance , volume=
Double deep q-learning for optimal execution , author=. Applied Mathematical Finance , volume=. 2021 , publisher=
work page 2021
-
[16]
Available at SSRN 3374766 , year=
Deep execution-value and policy based reinforcement learning for trading and beating market benchmarks , author=. Available at SSRN 3374766 , year=
-
[17]
An end-to-end optimal trade execution framework based on proximal policy optimization , author=. Proceedings of the twenty-ninth international conference on international joint conferences on artificial intelligence , pages=
-
[18]
European Journal of Operational Research , volume=
Deep reinforcement learning for the optimal placement of cryptocurrency limit orders , author=. European Journal of Operational Research , volume=. 2022 , publisher=
work page 2022
-
[19]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Universal trading for order execution with oracle policy distillation , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[20]
Proceedings of the Third ACM International Conference on AI in Finance , pages=
Cost-efficient reinforcement learning for optimal trade execution on dynamic market environment , author=. Proceedings of the Third ACM International Conference on AI in Finance , pages=
-
[21]
Quantitative Finance , volume=
Learning a functional control for high-frequency finance , author=. Quantitative Finance , volume=. 2022 , publisher=
work page 2022
-
[22]
Quantitative Finance , volume =
A reinforcement learning approach to optimal execution , author =. Quantitative Finance , volume =
-
[23]
Practical application of deep reinforcement learning to optimal trade execution , author=. FinTech , volume=. 2023 , publisher=
work page 2023
-
[24]
Applied Mathematical Finance , volume=
Reinforcement learning for optimal execution when liquidity is time-varying , author=. Applied Mathematical Finance , volume=. 2024 , publisher=
work page 2024
-
[25]
arXiv preprint arXiv:2410.13493 , year=
Deep Reinforcement Learning for Online Optimal Execution Strategies , author=. arXiv preprint arXiv:2410.13493 , year=
-
[26]
Joint Learning of Volume Scheduling and Order Placement Policies for Optimal Order Execution , author=. Mathematics , volume=. 2024 , publisher=
work page 2024
-
[27]
Macmic: Executing iceberg orders via hierarchical reinforcement learning , author=. Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24 , pages=
-
[28]
Expert Systems with Applications , volume=
An adaptive dual-level reinforcement learning approach for optimal trade execution , author=. Expert Systems with Applications , volume=. 2024 , publisher=
work page 2024
-
[29]
arXiv preprint arXiv:2207.11152 , year=
Learn continuously, act discretely: Hybrid action-space reinforcement learning for optimal execution , author=. arXiv preprint arXiv:2207.11152 , year=
-
[30]
Multi-agent Reinforcement Learning in Sequential Social Dilemmas
Multi-agent reinforcement learning in sequential social dilemmas , author=. arXiv preprint arXiv:1702.03037 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Maintaining cooperation in complex social dilemmas using deep reinforcement learning
Maintaining cooperation in complex social dilemmas using deep reinforcement learning , author=. arXiv preprint arXiv:1707.01068 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
Learning with Opponent-Learning Awareness
Learning with opponent-learning awareness , author=. arXiv preprint arXiv:1709.04326 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
Journal of Economic Dynamics and Control , volume=
Q-learning agents in a Cournot oligopoly model , author=. Journal of Economic Dynamics and Control , volume=. 2008 , publisher=
work page 2008
-
[34]
American Economic Review , volume=
Artificial intelligence, algorithmic pricing, and collusion , author=. American Economic Review , volume=. 2020 , publisher=
work page 2020
-
[35]
The RAND Journal of Economics , volume=
Autonomous algorithmic collusion: Q-learning under sequential pricing , author=. The RAND Journal of Economics , volume=. 2021 , publisher=
work page 2021
-
[36]
arXiv preprint arXiv:2503.11270 , year=
Exploring Competitive and Collusive Behaviors in Algorithmic Pricing with Deep Reinforcement Learning , author=. arXiv preprint arXiv:2503.11270 , year=
-
[37]
arXiv preprint arXiv:2409.01147 , year =
On Mechanism Underlying Algorithmic Collusion , author =. arXiv preprint arXiv:2409.01147 , year =
-
[38]
Artificial Collusion: Examining Supracompetitive Pricing by Q-Learning Algorithms , author =. 2022 , number =
work page 2022
-
[39]
Artificial Intelligence: Can Seemingly Collusive Outcomes Be Avoided? , author =. Management Science , volume =. 2023 , doi =
work page 2023
-
[40]
Advances in Neural Information Processing Systems 35 (NeurIPS 2022) , year =
Learning to Mitigate AI Collusion on Economic Platforms , author =. Advances in Neural Information Processing Systems 35 (NeurIPS 2022) , year =
work page 2022
-
[41]
arXiv preprint arXiv:2508.14766 , year =
Algorithmic Collusion is Algorithm Orchestration , author =. arXiv preprint arXiv:2508.14766 , year =
-
[42]
Dynamic Games and Applications , volume=
Transient impact from the Nash equilibrium of a permanent market impact game , author=. Dynamic Games and Applications , volume=. 2024 , publisher=
work page 2024
-
[43]
Mathematical Finance , volume =
A State-Constrained Differential Game Arising in Optimal Portfolio Liquidation , author =. Mathematical Finance , volume =
-
[44]
Mathematical Finance , volume =
Dynamics of Market Making Algorithms in Dealer Markets: Learning and Tacit Collusion , author =. Mathematical Finance , volume =. 2024 , doi =
work page 2024
-
[45]
Quantitative Finance , volume =
Cooperation Between Independent Market Makers , author =. Quantitative Finance , volume =. 2022 , doi =
work page 2022
-
[46]
arXiv preprint arXiv:2408.11773 , year =
Deviations from the Nash Equilibrium and Emergence of Tacit Collusion in a Two-Player Optimal Execution Game with Reinforcement Learning , author =. arXiv preprint arXiv:2408.11773 , year =
-
[47]
SSRN Electronic Journal , year =
Algorithmic Collusion in Electronic Markets: The Impact of Tick Size , author =. SSRN Electronic Journal , year =
-
[48]
Dou, Winston Wei and Goldstein, Itay and Ji, Yan , journal =
-
[49]
The Invisible Handshake: Persistent Overpricing by Adaptive Market Agents
The Invisible Handshake: Tacit Collusion Between Adaptive Market Agents , author =. arXiv preprint arXiv:2510.15995 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[50]
arXiv preprint arXiv:1911.05892 , year =
Reinforcement Learning for Market Making in a Multi-Agent Dealer Market , author =. arXiv preprint arXiv:1911.05892 , year =
-
[51]
Ardon, Leo and Vadori, Nelson and Spooner, Thomas and Xu, Mengda and Vann, Jared and Ganesh, Sumitra , booktitle =. Towards a Fully. 2021 , doi =
work page 2021
-
[52]
Towards Multi-Agent Reinforcement Learning-Driven Over-the-Counter Market Simulations , author =. Mathematical Finance , year =
-
[53]
arXiv preprint arXiv:2407.21025 , year =
Reinforcement Learning in High-Frequency Market Making , author =. arXiv preprint arXiv:2407.21025 , year =
-
[54]
Proceedings of the 7th Annual Conference on Learning for Dynamics and Control , series =
Eberhard, Onno and Vernade, Claire and Muehlebach, Michael , title =. Proceedings of the 7th Annual Conference on Learning for Dynamics and Control , series =. 2025 , publisher =
work page 2025
-
[55]
Proceedings of the AAAI conference on artificial intelligence , volume=
Film: Visual reasoning with a general conditioning layer , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
-
[56]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Outrageously large neural networks: The sparsely-gated mixture-of-experts layer , author=. arXiv preprint arXiv:1701.06538 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[57]
Cheridito, Patrick and Weiss, Moritz , title =. Quantitative Finance , year =. doi:10.1080/14697688.2026.2631116 , note =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.