arxiv: 2605.12653 · v1 · submitted 2026-05-12 · 💻 cs.LG · cs.AI· stat.ML

Recognition: no theorem link

Plan Before You Trade: Inference-Time Optimization for RL Trading Agents

Arindam Banerjee, Eun Go, Rohan Deb

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:21 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords inference-time optimizationreinforcement learningportfolio managementfinancial tradingprice forecastingmodel predictive controlDJ30 benchmarkstochastic policies

0 comments

The pith

Reinforcement learning trading agents improve returns and risk metrics by optimizing their policy at each decision step against a price forecaster's predicted trajectory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard RL agents for portfolio management, trained as static policies, can be enhanced at inference time without retraining by using a forecaster to imagine multi-step price paths and then optimizing the current allocation choice. This works because one agent's trades have negligible effect on future market prices, so the forecaster can generate trajectories independently rather than through repeated action-conditioned simulations. The resulting plugin, FPILOT, builds an imagined return objective from the predicted prices and solves for a better action before executing the trade. Across five learning algorithms on the DJ30 benchmark, it delivers higher total returns and better Sharpe, Sortino, and Calmar ratios, with larger gains for stochastic policies. Performance scales with forecaster quality, suggesting the method will improve as financial prediction advances.

Core claim

FPILOT is a plugin that, at every decision point, takes the forecaster's multi-step price trajectory, constructs an allocation-based imagined return objective, and optimizes the pre-trained policy before executing one trade step. This adapts the agent to current forecasts without any model updates or full rollouts.

What carries the argument

The inference-time optimization step that maximizes an imagined return objective built from the forecaster's price trajectory and the agent's allocation choice.

If this is right

Stochastic policies receive larger performance lifts than deterministic ones.
Gains increase steadily as the quality of the price forecasts improves.
The method applies to any pre-trained RL agent without requiring retraining.
Risk-adjusted metrics improve alongside raw total return on the DJ30 benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same inference-time planning step could be tested in other RL settings where the agent's action has little effect on the environment dynamics.
In fast-moving markets, the plugin might reduce how often full retraining is needed by letting the agent respond to fresh forecasts.
Pairing FPILOT with the best available forecasters would provide a direct way to measure how much better prediction accuracy translates into trading gains.

Load-bearing premise

A single agent's portfolio allocation does not meaningfully affect future market prices, so a price forecaster can generate useful multi-step trajectories without conditioning on the agent's actions.

What would settle it

Replace the forecaster with perfect future prices in a market simulator where the agent's own trades are large enough to move prices, then check whether FPILOT still outperforms the original policy.

Figures

Figures reproduced from arXiv: 2605.12653 by Arindam Banerjee, Eun Go, Rohan Deb.

**Figure 1.** Figure 1: Overview of FinPILOT. A standard RL agent is pre-trained on historical market data with no access to any price forecaster. At inference time, an XGBoost forecaster provides an H-step price trajectory used as a surrogate world model; the pre-trained actor is adapted on the imagined objective Jt before executing an action in the real environment. 2.1 Markov Decision Process Following the RLFT literature and … view at source ↗

**Figure 2.** Figure 2: Portfolio value trajectories for SAC base [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 5.** Figure 5: Sharpe ratio as a function of forecast qual [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

read the original abstract

Reinforcement learning agents for portfolio management are typically trained and deployed as static policies, with no mechanism for using price forecasts at inference time. We propose $\text{FPILOT}$ (**Fin**ancial **P**lugin **I**nference-time **L**earning for **O**ptimal **T**rading), a plugin inference-time optimization framework inspired by Model Predictive Control (MPC). Our key structural insight is that future prices mostly do not depend on one agent's portfolio allocation, so a suitable predictive model can produce a multi-step price trajectory without iterative action-conditioned rollouts as in typical reinforcement learning. At each decision step, we use the forecaster's predicted price trajectory to construct an allocation-based imagined return objective, and optimize the policy at inference-time before executing one step of the trade. Our framework is compatible with any pre-trained agent and adapts the policy to the forecaster's predictions without any retraining. Evaluated across five policy learning algorithms on the TradeMaster DJ30 benchmark, $\text{FPILOT}$ produces consistent improvements in total return and return-based risk-adjusted metrics (Sharpe, Sortino, Calmar), with stochastic policies benefiting more than deterministic ones. Further, using synthetic forecasts at calibrated quality levels, we show that gains consistently improve with forecaster quality, suggesting that our performance will improve based on advances in financial forecasting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FPILOT gives a straightforward plugin to tweak pre-trained RL trading policies at inference time using external forecasts, with gains shown on the DJ30 benchmark, but the no-market-impact assumption is untested and could limit real-world value.

read the letter

The main point is that this paper introduces FPILOT as a way to optimize a pre-trained RL trading policy at each decision step by using a separate multi-step price forecast to build an imagined return objective. No retraining is needed, and it works as a drop-in addition to any base agent. The experiments report better total returns and risk-adjusted metrics like Sharpe, Sortino, and Calmar across five different policy algorithms on the TradeMaster DJ30 data, with bigger lifts for stochastic policies. They also run synthetic forecast tests at different quality levels and show performance scales with forecast accuracy, which backs the core mechanism. That part is useful for anyone who already has a trained model and wants to layer in predictions without starting over. The approach draws from MPC ideas but avoids action-conditioned rollouts by treating prices as independent of the agent's allocation, which keeps the optimization simple and non-iterative. The citation pattern covers standard RL trading and MPC references without obvious omissions. The soft spot is the unexamined assumption that the agent's trades have negligible effect on future prices. The paper calls this a key insight but provides no market impact quantification or ablations that add temporary or permanent impact functions to the imagined trajectories. On DJ30 assets this might hold for tiny positions, but without those checks the reported gains could partly reflect the no-impact counterfactual rather than true adaptation. The abstract also skips details on statistical tests, exact data splits, and transaction cost modeling, so the full paper needs to fill those in to make the improvements credible. This work is for people building or deploying RL agents in finance who face retraining costs and have access to decent forecasts. A reader focused on practical inference-time tweaks would get value if the experiments are tightened up. It deserves peer review because the plugin idea is clean and the compatibility claim is testable, even if revisions are needed on the impact and evaluation gaps.

Referee Report

2 major / 2 minor

Summary. The paper proposes FPILOT, an inference-time optimization framework for RL-based portfolio management agents. It uses a predictive model to generate multi-step price trajectories independent of the agent's actions, constructs an allocation-based imagined return objective, and optimizes the policy at each decision step before trading. Evaluated on the TradeMaster DJ30 benchmark with five policy learning algorithms, it reports consistent improvements in total return and risk-adjusted metrics (Sharpe, Sortino, Calmar), with larger benefits for stochastic policies, and shows gains increasing with forecaster quality using synthetic forecasts.

Significance. If the central assumption holds and the empirical gains are robust, FPILOT could serve as a practical plugin to enhance existing RL trading agents by incorporating forecasts at inference time without retraining. This approach bridges model-free RL with elements of model predictive control in a financial setting, and the scaling with forecast quality suggests benefits from advances in prediction models.

major comments (2)

[Key structural insight] The assumption that future prices mostly do not depend on one agent's portfolio allocation is load-bearing for the framework's validity. The manuscript provides no quantification of market impact (e.g., via volume or impact functions) on DJ30 assets and no ablation injecting impact into the imagined trajectories. If impact is material, the optimized actions are based on incorrect counterfactuals, which could artifactually inflate the reported improvements over the base algorithms.
[Evaluation on TradeMaster DJ30] The claims of consistent improvements lack accompanying details on statistical tests, confidence intervals, transaction costs, data splits, and exact baseline implementations. Without these, it is unclear whether the gains are statistically significant or practically meaningful beyond the no-impact assumption.

minor comments (2)

Provide more details on the exact optimization procedure (e.g., solver, constraints on allocations) used at inference time to ensure reproducibility.
Clarify how the framework handles stochastic vs deterministic policies in the optimization step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments on our manuscript. We address each major comment point by point below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: The assumption that future prices mostly do not depend on one agent's portfolio allocation is load-bearing for the framework's validity. The manuscript provides no quantification of market impact (e.g., via volume or impact functions) on DJ30 assets and no ablation injecting impact into the imagined trajectories. If impact is material, the optimized actions are based on incorrect counterfactuals, which could artifactually inflate the reported improvements over the base algorithms.

Authors: We agree that the no-impact assumption is central to FPILOT, as it enables efficient non-iterative trajectory generation without action-conditioned rollouts. For the highly liquid DJ30 large-cap stocks, a single agent's allocations are typically a negligible fraction of daily volume, making the assumption reasonable in this benchmark setting. However, we acknowledge the manuscript lacks explicit quantification or robustness checks. In the revision we will add: (1) estimates of market impact for DJ30 assets using standard models (e.g., square-root impact) based on reported average volumes; (2) a discussion of scenarios where impact could matter; and (3) an ablation injecting synthetic linear impact into the imagined trajectories to measure sensitivity of the reported gains. These changes will strengthen the validity claims. revision: yes
Referee: The claims of consistent improvements lack accompanying details on statistical tests, confidence intervals, transaction costs, data splits, and exact baseline implementations. Without these, it is unclear whether the gains are statistically significant or practically meaningful beyond the no-impact assumption.

Authors: We appreciate the call for greater statistical detail and reproducibility. The original manuscript follows the TradeMaster DJ30 protocol, with documented data splits (training through 2018, testing 2019-2020), inclusion of the benchmark's transaction cost model, and baselines implemented exactly as released in the TradeMaster codebase. We did not, however, report confidence intervals or formal significance tests. In the revision we will add bootstrap 95% confidence intervals for all metrics across seeds, paired t-tests (or Wilcoxon signed-rank where appropriate) for the improvements, and explicit pointers to the precise baseline hyperparameters and code versions used. This will clarify both statistical significance and practical relevance. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework uses external forecaster and standard optimization without self-referential reduction

full rationale

The paper's derivation relies on an external predictive model to generate multi-step price trajectories (under the explicit assumption of negligible single-agent market impact) followed by a standard inference-time optimization of an allocation-based return objective. No equations or steps reduce by construction to fitted parameters renamed as predictions, nor do they depend on self-citations for uniqueness or ansatz smuggling. The benchmark evaluations and synthetic-forecaster ablations are independent of the method's internal construction. This is a standard MPC-style plugin with no load-bearing self-definition or renaming of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that a single agent's allocation does not affect future prices, allowing non-iterative forecasting; the FPILOT optimization procedure itself is introduced without external verification beyond the reported benchmark gains.

axioms (1)

domain assumption Future prices mostly do not depend on one agent's portfolio allocation
Stated as the key structural insight enabling non-iterative multi-step price trajectories.

invented entities (1)

FPILOT inference-time optimization framework no independent evidence
purpose: Plugin that constructs an allocation-based imagined return objective and optimizes the policy before each trade
New method introduced to adapt pre-trained agents using external forecasts.

pith-pipeline@v0.9.0 · 5546 in / 1286 out tokens · 37030 ms · 2026-05-14T21:21:47.200662+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

[1]

International Conference on Learning Representations , year=

Dream to Control: Learning Behaviors by Latent Imagination , author=. International Conference on Learning Representations , year=

work page
[2]

2019 , eprint=

Enhancing Stock Movement Prediction with Adversarial Training , author=. 2019 , eprint=

work page 2019
[3]

Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining , pages =

Yoo, Jaemin and Soun, Yejun and Park, Yong-chan and Kang, U , title =. Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining , pages =. 2021 , isbn =. doi:10.1145/3447548.3467297 , abstract =

work page doi:10.1145/3447548.3467297 2021
[4]

, title =

Sharpe, William F. , title =. Journal of Portfolio Management , volume =

work page
[5]

Proceedings of the 35th International Conference on Machine Learning , pages =

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , author =. Proceedings of the 35th International Conference on Machine Learning , pages =. 2018 , editor =

work page 2018
[6]

Proceedings of The 33rd International Conference on Machine Learning , pages =

Asynchronous Methods for Deep Reinforcement Learning , author =. Proceedings of The 33rd International Conference on Machine Learning , pages =. 2016 , editor =

work page 2016
[7]

Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =

Chen, Tianqi and Guestrin, Carlos , title =. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =. 2016 , isbn =. doi:10.1145/2939672.2939785 , abstract =

work page doi:10.1145/2939672.2939785 2016
[8]

Proceedings of the ACM Web Conference 2024 , pages =

Zhang, Wentao and Zhao, Yilei and Sun, Shuo and Ying, Jie and Xie, Yonggang and Song, Zitao and Wang, Xinrun and An, Bo , title =. Proceedings of the ACM Web Conference 2024 , pages =. 2024 , isbn =. doi:10.1145/3589334.3645615 , abstract =

work page doi:10.1145/3589334.3645615 2024
[9]

Mastering Atari, Go, Chess and Shogi by Planning With a Learned Model

Schrittwieser, Julian and Antonoglou, Ioannis and Hubert, Thomas and Simonyan, Karen and Sifre, Laurent and Schmitt, Simon and Guez, Arthur and Lockhart, Edward and Hassabis, Demis and Graepel, Thore and Lillicrap, Timothy and Silver, David , title=. Nature , year=. doi:10.1038/s41586-020-03051-4 , url=

work page doi:10.1038/s41586-020-03051-4
[10]

Nature , year=

Hafner, Danijar and Pasukonis, Jurgis and Ba, Jimmy and Lillicrap, Timothy , title=. Nature , year=. doi:10.1038/s41586-025-08744-2 , url=

work page doi:10.1038/s41586-025-08744-2
[11]

TradeMaster: A Holistic Quantitative Trading Platform Empowered by Reinforcement Learning , url =

Sun, Shuo and Qin, Molei and Zhang, Wentao and Xia, Haochong and Zong, Chuqiao and Ying, Jie and Xie, Yonggang and Zhao, Lingxuan and Wang, Xinrun and An, Bo , booktitle =. TradeMaster: A Holistic Quantitative Trading Platform Empowered by Reinforcement Learning , url =

work page
[12]

FinRL-Meta: Market Environments and Benchmarks for Data-Driven Financial Reinforcement Learning , url =

Liu, Xiao-Yang and Xia, Ziyi and Rui, Jingyang and Gao, Jiechao and Yang, Hongyang and Zhu, Ming and Wang, Christina and Wang, Zhaoran and Guo, Jian , booktitle =. FinRL-Meta: Market Environments and Benchmarks for Data-Driven Financial Reinforcement Learning , url =

work page
[13]

2022 , eprint=

FinRL: A Deep Reinforcement Learning Library for Automated Stock Trading in Quantitative Finance , author=. 2022 , eprint=

work page 2022
[14]

2026 , eprint=

Model Predictive Control with Differentiable World Models for Offline Reinforcement Learning , author=. 2026 , eprint=

work page 2026
[15]

Boyd and Enzo Busseti and Steven Diamond and Ronald N

Stephen P. Boyd and Enzo Busseti and Steven Diamond and Ronald N. Kahn and Kwangmoo Koh and Peter Nystrup and Jan Speth , title =. Found. Trends Optim. , volume =. 2017 , url =. doi:10.1561/2400000023 , timestamp =

work page doi:10.1561/2400000023 2017
[16]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

Reinforcement-Learning Based Portfolio Management with Augmented Asset Movement Prediction States , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2020 , month=. doi:10.1609/aaai.v34i01.5462 , abstractNote=

work page doi:10.1609/aaai.v34i01.5462 2020
[17]

2017 , eprint=

A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem , author=. 2017 , eprint=

work page 2017
[18]

2017 , eprint=

Proximal Policy Optimization Algorithms , author=. 2017 , eprint=

work page 2017
[19]

Policy Gradient Methods for Reinforcement Learning with Function Approximation , url =

Sutton, Richard S and McAllester, David and Singh, Satinder and Mansour, Yishay , booktitle =. Policy Gradient Methods for Reinforcement Learning with Function Approximation , url =

work page
[20]

Proceedings of the 35th International Conference on Machine Learning , pages =

Addressing Function Approximation Error in Actor-Critic Methods , author =. Proceedings of the 35th International Conference on Machine Learning , pages =. 2018 , editor =

work page 2018
[21]

2019 , eprint=

Continuous control with deep reinforcement learning , author=. 2019 , eprint=

work page 2019
[22]

Proceedings of the 31st International Conference on Machine Learning , pages =

Deterministic Policy Gradient Algorithms , author =. Proceedings of the 31st International Conference on Machine Learning , pages =. 2014 , editor =

work page 2014
[23]

, title =

Williams, Ronald J. , title =. Mach. Learn. , month = may, pages =. 1992 , issue_date =. doi:10.1007/BF00992696 , abstract =

work page doi:10.1007/bf00992696 1992
[24]

MOPO: Model-based Offline Policy Optimization , url =

Yu, Tianhe and Thomas, Garrett and Yu, Lantao and Ermon, Stefano and Zou, James Y and Levine, Sergey and Finn, Chelsea and Ma, Tengyu , booktitle =. MOPO: Model-based Offline Policy Optimization , url =

work page
[25]

When to Trust Your Model: Model-Based Policy Optimization , url =

Janner, Michael and Fu, Justin and Zhang, Marvin and Levine, Sergey , booktitle =. When to Trust Your Model: Model-Based Policy Optimization , url =

work page
[26]

2023 , eprint=

Is Conditional Generative Modeling all you need for Decision-Making? , author=. 2023 , eprint=

work page 2023
[27]

Offline Reinforcement Learning as One Big Sequence Modeling Problem , url =

Janner, Michael and Li, Qiyang and Levine, Sergey , booktitle =. Offline Reinforcement Learning as One Big Sequence Modeling Problem , url =

work page
[28]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

DeepTrader: A Deep Reinforcement Learning Approach for Risk-Return Balanced Portfolio Management with Market Conditions Embedding , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2021 , month=. doi:10.1609/aaai.v35i1.16144 , abstractNote=

work page doi:10.1609/aaai.v35i1.16144 2021
[29]

Proceedings of the 31st ACM International Conference on Information & Knowledge Management , pages =

Sun, Shuo and Xue, Wanqi and Wang, Rundong and He, Xu and Zhu, Junlei and Li, Jian and An, Bo , title =. Proceedings of the 31st ACM International Conference on Information & Knowledge Management , pages =. 2022 , isbn =. doi:10.1145/3511808.3557283 , abstract =

work page doi:10.1145/3511808.3557283 2022
[30]

The Review of Financial Studies , volume =

Gu, Shihao and Kelly, Bryan and Xiu, Dacheng , title =. The Review of Financial Studies , volume =. 2020 , month =. doi:10.1093/rfs/hhaa009 , url =

work page doi:10.1093/rfs/hhaa009 2020
[31]

2018 , eprint=

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , author=. 2018 , eprint=

work page 2018
[32]

2024 , url=

Nicklas Hansen and Hao Su and Xiaolong Wang , booktitle=. 2024 , url=

work page 2024