pith. machine review for the scientific record. sign in

arxiv: 2605.01954 · v1 · submitted 2026-05-03 · 💻 cs.AI · cs.CL· cs.MA

Recognition: unknown

Moira: Language-driven Hierarchical Reinforcement Learning for Pair Trading

Authors on Pith no claims yet

Pith reviewed 2026-05-09 17:04 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.MA
keywords hierarchical reinforcement learninglarge language modelspair tradingprompt optimizationlanguage-driven policiesfinancial decision makingcredit assignment
0
0 comments X

The pith

Pair trading policies can be adapted at both abstraction and execution levels by updating LLM prompts with textual feedback instead of gradients.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the difficulty of credit assignment in hierarchical decision-making tasks where high-level choices influence low-level actions and feedback is delayed. It focuses on pair trading, which requires selecting asset pairs over long horizons and executing trades under uncertainty. By parameterizing both policy levels with large language models and optimizing them solely through prompt adjustments based on trajectory and episode feedback, the approach separates semantic abstraction from execution. This separation reduces non-stationarity and allows targeted updates. Experiments on market data demonstrate better performance than traditional and other LLM methods.

Core claim

Pair trading is formulated as a hierarchical reinforcement learning problem where high-level policies select asset pairs using semantic reasoning and low-level policies execute trades. Both are implemented as LLMs optimized exclusively via prompt updates driven by textual feedback at trajectory and episode levels, explicitly separating abstraction from execution to mitigate non-stationarity and enable adaptation under delayed feedback.

What carries the argument

The language-driven hierarchical optimization framework that parameterizes high-level and low-level policies as LLMs and adapts them through prompt updates using textual feedback.

If this is right

  • Explicit separation of abstraction selection from execution reduces non-stationarity across hierarchical levels.
  • Targeted adaptation is possible under delayed and ambiguous feedback without gradient-based fine-tuning.
  • Consistent performance improvements occur over traditional RL and LLM-based baselines in real-world market data for pair trading.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar prompt-based adaptation could extend to other hierarchical sequential tasks such as robotic planning where high-level goals and low-level controls need separate tuning.
  • Removing the need for gradient updates makes the method more accessible for domains where fine-tuning large models is computationally expensive.
  • Textual feedback might allow incorporating human expertise more directly into policy adaptation loops.

Load-bearing premise

Trajectory- and episode-level textual feedback is sufficient to adapt both high-level semantic abstractions and low-level execution policies effectively without any gradient-based fine-tuning.

What would settle it

Running the same experiments but replacing prompt updates with gradient fine-tuning of the LLMs and observing no performance difference or degradation would indicate the textual feedback mechanism is not the key driver.

Figures

Figures reproduced from arXiv: 2605.01954 by Guojun Xiong, Jimin Huang, Lingfei Qian, Polydoros Giannouris, Sophia Ananiadou, Xueqing Peng, Yuechen Jiang, Yuyan Wang.

Figure 1
Figure 1. Figure 1: Overview of the language-driven hierarchical pair trading framework. A high-level Selector LLM chooses asset view at source ↗
Figure 2
Figure 2. Figure 2: Language-based policy optimization procedure. view at source ↗
Figure 3
Figure 3. Figure 3: Normalized equity curves and trade entry points for view at source ↗
Figure 4
Figure 4. Figure 4: Normalized metric performance across optimiza view at source ↗
read the original abstract

Many sequential decision-making problems exhibit hierarchical structure, where high-level semantic choices constrain downstream actions and feedback is delayed and ambiguous. Learning in such settings is challenging due to credit assignment: performance degradation may arise from flawed abstractions, suboptimal execution, or their interaction. We study this challenge through pair trading, a domain that naturally combines long-horizon semantic reasoning for asset pair selection with short-horizon execution under partial observability. We formulate pair trading as a hierarchical reinforcement learning problem and propose a language-driven optimization framework in which both high-level and low-level policies are parameterized by large language models (LLMs) and optimized exclusively through prompt updates. Our approach leverages pretrained LLMs as hierarchical policies and uses trajectory- and episode-level textual feedback to adapt abstractions and execution without gradient-based fine-tuning. By explicitly separating abstraction selection from execution, the framework reduces non-stationarity across hierarchical levels and enables targeted adaptation under delayed feedback. Experiments on real-world market data show consistent improvements over traditional and LLM-based baselines, demonstrating the effectiveness of language-driven hierarchical reinforcement learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce a language-driven hierarchical reinforcement learning framework called Moira for pair trading. In this setup, large language models serve as both high-level policies for selecting trading pairs based on semantic reasoning and low-level policies for executing trades under partial observability. Policies are adapted exclusively through updates to prompts using textual feedback from trajectories and episodes, without any gradient-based fine-tuning. The separation of levels is argued to reduce non-stationarity, facilitating better credit assignment in the presence of delayed and ambiguous feedback. Experiments using real-world market data are said to demonstrate consistent performance improvements over traditional and LLM-based baselines.

Significance. Should the empirical results prove robust, this contribution could advance the application of LLMs to hierarchical decision-making problems in finance and beyond, particularly where gradient access is limited or undesirable. It highlights a prompt-based method for adapting hierarchical policies, potentially offering a more interpretable alternative to standard RL fine-tuning. The use of real market data adds practical relevance, though the absence of detailed validation for the credit assignment mechanism limits the immediate impact.

major comments (2)
  1. [Abstract] The statement that experiments 'show consistent improvements' lacks any quantitative details, such as specific performance metrics (e.g., Sharpe ratio, returns), baseline comparisons, number of assets or time periods tested, or statistical significance tests. Without these, it is impossible to evaluate whether the data supports the central claim of effectiveness.
  2. [Abstract] The assertion that the framework 'enables targeted adaptation under delayed feedback' by separating abstraction selection from execution relies on the unverified assumption that textual feedback can perform level-specific credit assignment. No mechanism is provided for how a textual summary distinguishes high-level semantic errors from low-level execution issues amid market noise and partial observability, which is critical to the hierarchical benefit claimed.
minor comments (1)
  1. The abstract is dense with technical terms; consider adding a brief overview of pair trading for broader accessibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below, providing clarifications from the full paper and indicating where revisions will strengthen the presentation without altering the core claims or results.

read point-by-point responses
  1. Referee: [Abstract] The statement that experiments 'show consistent improvements' lacks any quantitative details, such as specific performance metrics (e.g., Sharpe ratio, returns), baseline comparisons, number of assets or time periods tested, or statistical significance tests. Without these, it is impossible to evaluate whether the data supports the central claim of effectiveness.

    Authors: We agree that the abstract would benefit from greater specificity to allow readers to immediately assess the strength of the empirical claims. The full manuscript's experimental section (Section 5) reports quantitative results on real market data, including comparisons against traditional RL baselines and other LLM-based methods, with metrics such as cumulative returns, Sharpe ratios, and win rates across multiple asset pairs and time periods. In the revised manuscript we will condense key quantitative highlights into the abstract (e.g., average return improvement and number of evaluated pairs/periods) while preserving the high-level summary style. revision: yes

  2. Referee: [Abstract] The assertion that the framework 'enables targeted adaptation under delayed feedback' by separating abstraction selection from execution relies on the unverified assumption that textual feedback can perform level-specific credit assignment. No mechanism is provided for how a textual summary distinguishes high-level semantic errors from low-level execution issues amid market noise and partial observability, which is critical to the hierarchical benefit claimed.

    Authors: The manuscript does provide a mechanism, described in Section 3.2 and 4: high-level policies receive episode-level textual summaries that focus on pair-selection quality and semantic alignment with market regimes, while low-level policies receive trajectory-level feedback that isolates execution actions (entry/exit timing and sizing) within a fixed pair. Because the LLM reasons over these distinct textual signals, it can attribute performance degradation to either abstraction choice or execution without gradient-based credit assignment. We acknowledge that the abstract is too terse to convey this distinction and that explicit examples of feedback prompts would help readers verify the separation. We will therefore add a short illustrative example and a brief robustness discussion in the revised version. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation and evaluation are self-contained.

full rationale

The paper formulates pair trading as hierarchical RL with LLMs as policies updated exclusively via prompt-based textual feedback at trajectory and episode levels. It claims this separation reduces non-stationarity and enables adaptation without gradients. Evaluation relies on external real-world market data and comparisons to independent baselines. No equations, fitted parameters presented as predictions, self-citations as load-bearing premises, or ansatzes smuggled via prior work are present in the provided text. The chain from problem statement to empirical results does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The ledger reflects the key assumptions in the abstract; full paper may introduce additional parameters or entities not visible here.

axioms (1)
  • domain assumption Pretrained LLMs can serve as effective policies for both high-level and low-level decisions in hierarchical RL when adapted via prompts.
    Central to the framework's claim of optimization without gradient-based fine-tuning.

pith-pipeline@v0.9.0 · 5504 in / 1209 out tokens · 64827 ms · 2026-05-09T17:04:49.218864+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 17 canonical work pages · 2 internal anchors

  1. [1]

    Sotiris Anagnostidis and Jannis Bulian. 2024. How susceptible are llms to influ- ence in prompts?arXiv preprint arXiv:2408.11865(2024)

  2. [2]

    Mahmut Bağcı and Pınar Kaya Soylu. 2025. The optimal threshold selection for high-frequency pairs trading via supervised machine learning algorithms. Computational Economics(2025), 1–29

  3. [3]

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862(2022)

  4. [4]

    William K Bertram. 2010. Analytic solutions for optimal statistical arbitrage trading.Physica A: Statistical mechanics and its applications389, 11 (2010), 2234–2243

  5. [5]

    Mario Carrasco Blázquez, Camilo Prado Román, et al. 2018. Pairs trading tech- niques: An empirical contrast.European Research on Management and Business Economics24, 3 (2018), 160–167

  6. [6]

    Andrew Brim. 2020. Deep reinforcement learning pairs trading with a double deep Q-network. In2020 10th annual computing and communication workshop and conference (CCWC). IEEE, 0222–0227

  7. [7]

    Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, et al. 2023. Do as i can, not as i say: Grounding language in robotic affordances. InConference on robot learning. PMLR, 287–318

  8. [8]

    Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, et al. 2023. Open problems and fundamental limitations of reinforcement learning from human feedback.arXiv preprint arXiv:2307.15217(2023)

  9. [9]

    Peter Dayan and Geoffrey E Hinton. 1992. Feudal reinforcement learning.Ad- vances in neural information processing systems5 (1992)

  10. [10]

    Han Ding, Yinheng Li, Junhao Wang, and Hang Chen. 2024. Large language model agent in financial trading: A survey.arXiv preprint arXiv:2408.06361(2024)

  11. [11]

    Binh Do and Robert Faff. 2012. Are pairs trading profits robust to trading costs? Journal of Financial Research35, 2 (2012), 261–287

  12. [12]

    Robert J Elliott, John Van Der Hoek*, and William P Malcolm. 2005. Pairs trading. Quantitative Finance5, 3 (2005), 271–276

  13. [13]

    Saeid Fallahpour, Hasan Hakimian, Khalil Taheri, and Ehsan Ramezanifar. 2016. Pairs trading strategy optimization using the reinforcement learning method: a cointegration approach.Soft Computing20, 12 (2016), 5051–5066

  14. [14]

    George Fatouros, Kostas Metaxas, John Soldatos, and Dimosthenis Kyriazis. 2025. Can Large Language Models beat wall street? Evaluating GPT-4’s impact on financial decision-making with MarketSenseAI.Neural Computing and Applica- tions37, 30 (2025), 24893–24918

  15. [15]

    Evan Gatev, William N Goetzmann, and K Geert Rouwenhorst. 2006. Pairs trading: Performance of a relative-value arbitrage rule.The review of financial studies19, 3 (2006), 797–827

  16. [16]

    Minghong Geng, Shubham Pateria, Budhitama Subagdja, Lin Li, Xin Zhao, and Ah-Hwee Tan. 2025. L2m2: A hierarchical framework integrating large language model and multi-agent reinforcement learning. InInternational Joint Conference on Artificial Intelligence

  17. [17]

    Minjia Guo, Jianhe Liu, Ziping Luo, and Xiao Han. 2024. Deep reinforcement learning for pairs trading: Evidence from China black series futures.International Review of Economics & Finance93 (2024), 981–993

  18. [18]

    Zihao Guo, Hanqing Jin, Jiaqi Kuang, Zhongmin Qian, and Jinghan Wang. 2025. Signature Decomposition Method Applying to Pair Trading.Journal of Futures Markets(2025)

  19. [19]

    Weiguang Han, Boyi Zhang, Qianqian Xie, Min Peng, Yanzhao Lai, and Jimin Huang. 2023. Select and trade: Towards unified pair trading with hierarchical reinforcement learning. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4123–4134

  20. [20]

    Yifan Hao, Xingyuan Pan, Hanning Zhang, Chenlu Ye, Rui Pan, and Tong Zhang

  21. [21]

    Polydoros Giannouris, Yuechen Jiang, Lingfei Qian, Yuyan Wang, Xueqing Peng, Jimin Huang, Guojun Xiong, and Sophia Ananiadou

    Understanding Overadaptation in Supervised Fine-Tuning: The Role of Ensemble Methods.arXiv preprint arXiv:2506.01901(2025). Polydoros Giannouris, Yuechen Jiang, Lingfei Qian, Yuyan Wang, Xueqing Peng, Jimin Huang, Guojun Xiong, and Sophia Ananiadou

  22. [22]

    Zican Hu, Wei Liu, Xiaoye Qu, Xiangyu Yue, Chunlin Chen, Zhi Wang, and Yu Cheng. 2025. Divide and Conquer: Grounding LLMs as Efficient Decision- Making Agents via Offline Hierarchical Reinforcement Learning.arXiv preprint arXiv:2505.19761(2025)

  23. [23]

    Peng Huang and Tianxiang Wang. 2016. On the profitability of optimal mean reversion trading strategies.arXiv preprint arXiv:1602.05858(2016)

  24. [24]

    Nicolas Huck and Komivi Afawubo. 2015. Pairs trading and selection methods: is cointegration superior?Applied Economics47, 6 (2015), 599–613

  25. [25]

    Sang-Ho Kim, Deog-Yeong Park, and Ki-Hoon Lee. 2022. Hybrid deep reinforce- ment learning for pairs trading.Applied Sciences12, 3 (2022), 944

  26. [26]

    Taewook Kim and Ha Young Kim. 2019. Optimizing the Pairs-Trading Strategy Using Deep Reinforcement Learning with Trading and Stop-Loss Boundaries. Complexity2019, 1 (2019), 3582516

  27. [27]

    Christopher Krauss. 2017. Statistical arbitrage pairs trading strategies: Review and outlook.Journal of Economic Surveys31, 2 (2017), 513–545

  28. [28]

    Christopher Krauss, Xuan Anh Do, and Nicolas Huck. 2017. Deep neural net- works, gradient-boosted trees, random forests: Statistical arbitrage on the S&P 500.European Journal of Operational Research259, 2 (2017), 689–702

  29. [29]

    Andrew Levy, George Konidaris, Robert Platt, and Kate Saenko. 2019. Learning multi-level hierarchies with hindsight. InProceedings of International Conference on Learning Representations

  30. [30]

    Xiang Li, Zhenyu Li, Chen Shi, Yong Xu, Qing Du, Mingkui Tan, and Jun Huang

  31. [31]

    InProceedings of the 2024 joint international conference on computational linguistics, language resources and evaluation (LREC-COLING 2024)

    Alphafin: Benchmarking financial analysis with retrieval-augmented stock-chain framework. InProceedings of the 2024 joint international conference on computational linguistics, language resources and evaluation (LREC-COLING 2024). 773–783

  32. [32]

    Alejandro Lopez-Lira and Yuehua Tang. 2023. Can chatgpt forecast stock price movements? return predictability and large language models.arXiv preprint arXiv:2304.07619(2023)

  33. [33]

    Jing-You Lu, Hsu-Chao Lai, Wen-Yueh Shih, Yi-Feng Chen, Shen-Hang Huang, Hao-Han Chang, Jun-Zhe Wang, Jiun-Long Huang, and Tian-Shyr Dai. 2022. Structural break-aware pairs trading strategy using deep reinforcement learning. The Journal of Supercomputing78, 3 (2022), 3843–3882

  34. [34]

    Dat Mai. 2024. Stockgpt: A genai model for stock prediction and trading.arXiv preprint arXiv:2404.05101(2024)

  35. [35]

    Ofir Nachum, Shixiang Shane Gu, Honglak Lee, and Sergey Levine. 2018. Data- efficient hierarchical reinforcement learning.Advances in neural information processing systems31 (2018)

  36. [36]

    Shubham Pateria, Budhitama Subagdja, Ah-hwee Tan, and Chai Quek. 2021. Hierarchical reinforcement learning: A comprehensive survey.ACM Computing Surveys (CSUR)54, 5 (2021), 1–35

  37. [37]

    Eduardo Pignatelli, Johan Ferret, Matthieu Geist, Thomas Mesnard, Hado van Hasselt, Olivier Pietquin, and Laura Toni. 2023. A survey of temporal credit assignment in deep reinforcement learning.arXiv preprint arXiv:2312.01072 (2023)

  38. [38]

    Lingfei Qian, Xueqing Peng, Yan Wang, Vincent Jim Zhang, Huan He, Hanley Smith, Yi Han, Yueru He, Haohang Li, Yupeng Cao, et al. 2025. When Agents Trade: Live Multi-Market Trading Benchmark for LLM Agents.arXiv preprint arXiv:2510.11695(2025)

  39. [39]

    Molei Qin, Shuo Sun, Wentao Zhang, Haochong Xia, Xinrun Wang, and Bo An

  40. [40]

    InProceedings of the AAAI Conference on Artificial Intelligence, Vol

    Earnhft: Efficient hierarchical reinforcement learning for high frequency trading. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 14669–14676

  41. [41]

    Hossein Rad, Rand Kwong Yew Low, and Robert Faff. 2016. The profitability of pairs trading strategies: distance, cointegration and copula methods.Quantitative Finance16, 10 (2016), 1541–1558

  42. [42]

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems36 (2023), 53728–53741

  43. [43]

    Francesco Rotondi and Federico Russo. 2024. Machine Learning for Pairs Trading: a Clustering-based Approach.A vailable at SSRN 5080998(2024)

  44. [44]

    Utsav Singh and Vinay P Namboodiri. 2023. Pear: Primitive enabled adap- tive relabeling for boosting hierarchical reinforcement learning.arXiv preprint arXiv:2306.06394(2023)

  45. [45]

    Yufei Sun. 2025. A survey of statistical arbitrage pairs trading strategies with non- machine learning methods, 2016-2023.Faculty of Economic Sciences, University of Warsaw Working Papers2025-19 (2025)

  46. [46]

    Masood Tadi and Jiří Witzany. 2025. Copula-based trading of cointegrated cryptocurrency Pairs.Financial Innovation11, 1 (2025), 40

  47. [47]

    Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. 2017. Feudal networks for hier- archical reinforcement learning. InInternational conference on machine learning. PMLR, 3540–3549

  48. [48]

    2004.Pairs Trading: quantitative methods and analysis

    Ganapathy Vidyamurthy. 2004.Pairs Trading: quantitative methods and analysis. John Wiley & Sons

  49. [49]

    Cheng Wang, Patrik Sandås, and Peter Beling. 2021. Improving pairs trading strategies via reinforcement learning. In2021 International Conference on Applied Artificial Intelligence (ICAPAI). IEEE, 1–7

  50. [50]

    Saizhuo Wang, Hang Yuan, Lionel M Ni, and Jian Guo. 2024. Quantagent: Seeking holy grail in trading by self-improving large language model.arXiv preprint arXiv:2402.03755(2024)

  51. [51]

    Saizhuo Wang, Hang Yuan, Leon Zhou, Lionel Ni, Heung Yeung Shum, and Jian Guo. 2025. Alpha-gpt: Human-ai interactive alpha mining for quantitative investment. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 196–206

  52. [52]

    Yijia Xiao, Edward Sun, Di Luo, and Wei Wang. 2024. TradingAgents: Multi- agents LLM financial trading framework.arXiv preprint arXiv:2412.20138(2024)

  53. [53]

    Guojun Xiong, Zhiyang Deng, Keyi Wang, Yupeng Cao, Haohang Li, Yangyang Yu, Xueqing Peng, Mingquan Lin, Kaleb E Smith, Xiao-Yang Liu, et al . 2025. FLAG-Trader: Fusion LLM-Agent with Gradient-based Reinforcement Learning for Financial Trading.arXiv preprint arXiv:2502.11433(2025)

  54. [54]

    Fucui Xu and Shan Tan. 2020. Dynamic portfolio management based on pair trad- ing and deep reinforcement learning. InProceedings of the 2020 3rd International Conference on Computational Intelligence and Intelligent Systems. 50–55

  55. [55]

    Hongshen Yang and Avinash Malik. 2024. Reinforcement Learning Pair Trading: A Dynamic Scaling Approach.Journal of Risk and Financial Management17, 12 (2024), 555

  56. [56]

    Hongyang Yang, Boyu Zhang, Neng Wang, Cheng Guo, Xiaoli Zhang, Likun Lin, Junlin Wang, Tianyu Zhou, Mao Guan, Runjia Zhang, et al. 2024. Finrobot: An open-source ai agent platform for financial applications using large language models.arXiv preprint arXiv:2405.14767(2024)

  57. [57]

    Yi Yang, Yixuan Tang, and Kar Yan Tam. 2023. Investlm: A large language model for investment using financial domain instruction tuning.arXiv preprint arXiv:2309.13064(2023)

  58. [58]

    Yangyang Yu, Haohang Li, Zhi Chen, Yuechen Jiang, Yang Li, Jordan W Suchow, Denghui Zhang, and Khaldoun Khashanah. 2025. Finmem: A performance- enhanced llm trading agent with layered memory and character design.IEEE Transactions on Big Data(2025)

  59. [59]

    Yangyang Yu, Zhiyuan Yao, Haohang Li, Zhiyang Deng, Yuechen Jiang, Yupeng Cao, Zhi Chen, Jordan Suchow, Zhenyu Cui, Rong Liu, et al . 2024. Fincon: A synthesized llm multi-agent system with conceptual verbal reinforcement for enhanced financial decision making.Advances in Neural Information Processing Systems37 (2024), 137010–137045

  60. [60]

    pair ": [

    Wentao Zhang, Lingxuan Zhao, Haochong Xia, Shuo Sun, Jiaze Sun, Molei Qin, Xinyi Li, Yuqing Zhao, Yilei Zhao, Xinyu Cai, et al. 2024. A multimodal founda- tion agent for financial trading: Tool-augmented, diversified, and generalist. In Proceedings of the 30th acm sigkdd conference on knowledge discovery and data mining. 4314–4325. A Stock Universe Our ex...

  61. [61]

    The previous Trader prompt

  62. [62]

    POLICY :

    Critic feedback Task : - The Trader prompt is composed of an IMMUTABLE section ( everything outside POLICY ) and a MUTABLE section ( the POLICY block content ) . - You MUST propose updates ONLY to the POLICY rules . - You MUST NOT modify any other text from previous Trader prompt . - You MUST NOT introduce rules that depend on signals , features , or fiel...

  63. [63]

    The previous Selector prompt ( theta_prev )

  64. [64]

    Critic feedback text ( g_text )

  65. [65]

    POLICY :

    A scalar reward for the last selection Task : - The Selector prompt is composed of an IMMUTABLE section ( everything outside POLICY ) and a MUTABLE section ( the POLICY block content ) . - You MUST propose updates ONLY to the POLICY rules . - You MUST NOT modify any other text from the previous Selector prompt . - You MUST NOT introduce rules that depend ...