arxiv: 2605.01954 · v1 · submitted 2026-05-03 · 💻 cs.AI · cs.CL· cs.MA

Recognition: unknown

Moira: Language-driven Hierarchical Reinforcement Learning for Pair Trading

Polydoros Giannouris , Yuechen Jiang , Lingfei Qian , Yuyan Wang , Xueqing Peng , Jimin Huang , Guojun Xiong , Sophia Ananiadou

Authors on Pith no claims yet

Pith reviewed 2026-05-09 17:04 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.MA

keywords hierarchical reinforcement learninglarge language modelspair tradingprompt optimizationlanguage-driven policiesfinancial decision makingcredit assignment

0 comments

The pith

Pair trading policies can be adapted at both abstraction and execution levels by updating LLM prompts with textual feedback instead of gradients.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the difficulty of credit assignment in hierarchical decision-making tasks where high-level choices influence low-level actions and feedback is delayed. It focuses on pair trading, which requires selecting asset pairs over long horizons and executing trades under uncertainty. By parameterizing both policy levels with large language models and optimizing them solely through prompt adjustments based on trajectory and episode feedback, the approach separates semantic abstraction from execution. This separation reduces non-stationarity and allows targeted updates. Experiments on market data demonstrate better performance than traditional and other LLM methods.

Core claim

Pair trading is formulated as a hierarchical reinforcement learning problem where high-level policies select asset pairs using semantic reasoning and low-level policies execute trades. Both are implemented as LLMs optimized exclusively via prompt updates driven by textual feedback at trajectory and episode levels, explicitly separating abstraction from execution to mitigate non-stationarity and enable adaptation under delayed feedback.

What carries the argument

The language-driven hierarchical optimization framework that parameterizes high-level and low-level policies as LLMs and adapts them through prompt updates using textual feedback.

If this is right

Explicit separation of abstraction selection from execution reduces non-stationarity across hierarchical levels.
Targeted adaptation is possible under delayed and ambiguous feedback without gradient-based fine-tuning.
Consistent performance improvements occur over traditional RL and LLM-based baselines in real-world market data for pair trading.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar prompt-based adaptation could extend to other hierarchical sequential tasks such as robotic planning where high-level goals and low-level controls need separate tuning.
Removing the need for gradient updates makes the method more accessible for domains where fine-tuning large models is computationally expensive.
Textual feedback might allow incorporating human expertise more directly into policy adaptation loops.

Load-bearing premise

Trajectory- and episode-level textual feedback is sufficient to adapt both high-level semantic abstractions and low-level execution policies effectively without any gradient-based fine-tuning.

What would settle it

Running the same experiments but replacing prompt updates with gradient fine-tuning of the LLMs and observing no performance difference or degradation would indicate the textual feedback mechanism is not the key driver.

Figures

Figures reproduced from arXiv: 2605.01954 by Guojun Xiong, Jimin Huang, Lingfei Qian, Polydoros Giannouris, Sophia Ananiadou, Xueqing Peng, Yuechen Jiang, Yuyan Wang.

**Figure 1.** Figure 1: Overview of the language-driven hierarchical pair trading framework. A high-level Selector LLM chooses asset view at source ↗

**Figure 2.** Figure 2: Language-based policy optimization procedure. view at source ↗

**Figure 3.** Figure 3: Normalized equity curves and trade entry points for view at source ↗

**Figure 4.** Figure 4: Normalized metric performance across optimiza view at source ↗

read the original abstract

Many sequential decision-making problems exhibit hierarchical structure, where high-level semantic choices constrain downstream actions and feedback is delayed and ambiguous. Learning in such settings is challenging due to credit assignment: performance degradation may arise from flawed abstractions, suboptimal execution, or their interaction. We study this challenge through pair trading, a domain that naturally combines long-horizon semantic reasoning for asset pair selection with short-horizon execution under partial observability. We formulate pair trading as a hierarchical reinforcement learning problem and propose a language-driven optimization framework in which both high-level and low-level policies are parameterized by large language models (LLMs) and optimized exclusively through prompt updates. Our approach leverages pretrained LLMs as hierarchical policies and uses trajectory- and episode-level textual feedback to adapt abstractions and execution without gradient-based fine-tuning. By explicitly separating abstraction selection from execution, the framework reduces non-stationarity across hierarchical levels and enables targeted adaptation under delayed feedback. Experiments on real-world market data show consistent improvements over traditional and LLM-based baselines, demonstrating the effectiveness of language-driven hierarchical reinforcement learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper puts LLMs in charge of both the high-level pair selection and low-level execution in a hierarchical RL setup for trading, with everything adapted only through prompt updates from textual feedback.

read the letter

The main takeaway is that this work frames pair trading as hierarchical RL and lets pretrained LLMs handle both layers, updating them exclusively by revising prompts based on trajectory- and episode-level text summaries rather than any gradient steps or fine-tuning. The separation of levels is meant to cut non-stationarity and help with delayed, ambiguous feedback. That setup is the actual novelty here, and the paper does a clear job laying out the credit-assignment difficulty in this domain and why prompt-based adaptation might sidestep some of the usual RL headaches in finance. It also correctly notes that pair selection is a longer-horizon semantic choice while execution is short-horizon and partially observable. The experiments are reported as showing gains over traditional and LLM baselines on real market data, which at least grounds the claim in external returns rather than synthetic loops. The soft spot is the lack of detail on how the textual feedback actually achieves level-specific credit assignment. Market noise, slippage, and regime shifts can easily make a single summary mix up selection errors with execution problems, and the abstract gives no mechanism or ablation showing that the separation works as intended. Without statistical tests, run-to-run variance, or concrete examples of the feedback text, it's hard to judge whether the improvements come from the hierarchy or just from better prompting overall. This is the kind of paper that would interest people working on LLM-driven sequential decision making or on RL applications in quantitative finance. A reader looking for new ways to avoid fine-tuning in hierarchical settings would get something out of the framing. It deserves a serious referee because the problem is well-posed and the approach is distinct from prior LLM-RL work, even though the current evidence is thin and would need tightening on the experimental side. I would send it for review rather than desk-reject.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce a language-driven hierarchical reinforcement learning framework called Moira for pair trading. In this setup, large language models serve as both high-level policies for selecting trading pairs based on semantic reasoning and low-level policies for executing trades under partial observability. Policies are adapted exclusively through updates to prompts using textual feedback from trajectories and episodes, without any gradient-based fine-tuning. The separation of levels is argued to reduce non-stationarity, facilitating better credit assignment in the presence of delayed and ambiguous feedback. Experiments using real-world market data are said to demonstrate consistent performance improvements over traditional and LLM-based baselines.

Significance. Should the empirical results prove robust, this contribution could advance the application of LLMs to hierarchical decision-making problems in finance and beyond, particularly where gradient access is limited or undesirable. It highlights a prompt-based method for adapting hierarchical policies, potentially offering a more interpretable alternative to standard RL fine-tuning. The use of real market data adds practical relevance, though the absence of detailed validation for the credit assignment mechanism limits the immediate impact.

major comments (2)

[Abstract] The statement that experiments 'show consistent improvements' lacks any quantitative details, such as specific performance metrics (e.g., Sharpe ratio, returns), baseline comparisons, number of assets or time periods tested, or statistical significance tests. Without these, it is impossible to evaluate whether the data supports the central claim of effectiveness.
[Abstract] The assertion that the framework 'enables targeted adaptation under delayed feedback' by separating abstraction selection from execution relies on the unverified assumption that textual feedback can perform level-specific credit assignment. No mechanism is provided for how a textual summary distinguishes high-level semantic errors from low-level execution issues amid market noise and partial observability, which is critical to the hierarchical benefit claimed.

minor comments (1)

The abstract is dense with technical terms; consider adding a brief overview of pair trading for broader accessibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below, providing clarifications from the full paper and indicating where revisions will strengthen the presentation without altering the core claims or results.

read point-by-point responses

Referee: [Abstract] The statement that experiments 'show consistent improvements' lacks any quantitative details, such as specific performance metrics (e.g., Sharpe ratio, returns), baseline comparisons, number of assets or time periods tested, or statistical significance tests. Without these, it is impossible to evaluate whether the data supports the central claim of effectiveness.

Authors: We agree that the abstract would benefit from greater specificity to allow readers to immediately assess the strength of the empirical claims. The full manuscript's experimental section (Section 5) reports quantitative results on real market data, including comparisons against traditional RL baselines and other LLM-based methods, with metrics such as cumulative returns, Sharpe ratios, and win rates across multiple asset pairs and time periods. In the revised manuscript we will condense key quantitative highlights into the abstract (e.g., average return improvement and number of evaluated pairs/periods) while preserving the high-level summary style. revision: yes
Referee: [Abstract] The assertion that the framework 'enables targeted adaptation under delayed feedback' by separating abstraction selection from execution relies on the unverified assumption that textual feedback can perform level-specific credit assignment. No mechanism is provided for how a textual summary distinguishes high-level semantic errors from low-level execution issues amid market noise and partial observability, which is critical to the hierarchical benefit claimed.

Authors: The manuscript does provide a mechanism, described in Section 3.2 and 4: high-level policies receive episode-level textual summaries that focus on pair-selection quality and semantic alignment with market regimes, while low-level policies receive trajectory-level feedback that isolates execution actions (entry/exit timing and sizing) within a fixed pair. Because the LLM reasons over these distinct textual signals, it can attribute performance degradation to either abstraction choice or execution without gradient-based credit assignment. We acknowledge that the abstract is too terse to convey this distinction and that explicit examples of feedback prompts would help readers verify the separation. We will therefore add a short illustrative example and a brief robustness discussion in the revised version. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation and evaluation are self-contained.

full rationale

The paper formulates pair trading as hierarchical RL with LLMs as policies updated exclusively via prompt-based textual feedback at trajectory and episode levels. It claims this separation reduces non-stationarity and enables adaptation without gradients. Evaluation relies on external real-world market data and comparisons to independent baselines. No equations, fitted parameters presented as predictions, self-citations as load-bearing premises, or ansatzes smuggled via prior work are present in the provided text. The chain from problem statement to empirical results does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The ledger reflects the key assumptions in the abstract; full paper may introduce additional parameters or entities not visible here.

axioms (1)

domain assumption Pretrained LLMs can serve as effective policies for both high-level and low-level decisions in hierarchical RL when adapted via prompts.
Central to the framework's claim of optimization without gradient-based fine-tuning.

pith-pipeline@v0.9.0 · 5504 in / 1209 out tokens · 64827 ms · 2026-05-09T17:04:49.218864+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 17 canonical work pages · 2 internal anchors

[1]

Sotiris Anagnostidis and Jannis Bulian. 2024. How susceptible are llms to influ- ence in prompts?arXiv preprint arXiv:2408.11865(2024)

work page arXiv 2024
[2]

Mahmut Bağcı and Pınar Kaya Soylu. 2025. The optimal threshold selection for high-frequency pairs trading via supervised machine learning algorithms. Computational Economics(2025), 1–29

2025
[3]

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862(2022)

work page internal anchor Pith review arXiv 2022
[4]

William K Bertram. 2010. Analytic solutions for optimal statistical arbitrage trading.Physica A: Statistical mechanics and its applications389, 11 (2010), 2234–2243

2010
[5]

Mario Carrasco Blázquez, Camilo Prado Román, et al. 2018. Pairs trading tech- niques: An empirical contrast.European Research on Management and Business Economics24, 3 (2018), 160–167

2018
[6]

Andrew Brim. 2020. Deep reinforcement learning pairs trading with a double deep Q-network. In2020 10th annual computing and communication workshop and conference (CCWC). IEEE, 0222–0227

2020
[7]

Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, et al. 2023. Do as i can, not as i say: Grounding language in robotic affordances. InConference on robot learning. PMLR, 287–318

2023
[8]

Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, et al. 2023. Open problems and fundamental limitations of reinforcement learning from human feedback.arXiv preprint arXiv:2307.15217(2023)

work page internal anchor Pith review arXiv 2023
[9]

Peter Dayan and Geoffrey E Hinton. 1992. Feudal reinforcement learning.Ad- vances in neural information processing systems5 (1992)

1992
[10]

Han Ding, Yinheng Li, Junhao Wang, and Hang Chen. 2024. Large language model agent in financial trading: A survey.arXiv preprint arXiv:2408.06361(2024)

work page arXiv 2024
[11]

Binh Do and Robert Faff. 2012. Are pairs trading profits robust to trading costs? Journal of Financial Research35, 2 (2012), 261–287

2012
[12]

Robert J Elliott, John Van Der Hoek*, and William P Malcolm. 2005. Pairs trading. Quantitative Finance5, 3 (2005), 271–276

2005
[13]

Saeid Fallahpour, Hasan Hakimian, Khalil Taheri, and Ehsan Ramezanifar. 2016. Pairs trading strategy optimization using the reinforcement learning method: a cointegration approach.Soft Computing20, 12 (2016), 5051–5066

2016
[14]

George Fatouros, Kostas Metaxas, John Soldatos, and Dimosthenis Kyriazis. 2025. Can Large Language Models beat wall street? Evaluating GPT-4’s impact on financial decision-making with MarketSenseAI.Neural Computing and Applica- tions37, 30 (2025), 24893–24918

2025
[15]

Evan Gatev, William N Goetzmann, and K Geert Rouwenhorst. 2006. Pairs trading: Performance of a relative-value arbitrage rule.The review of financial studies19, 3 (2006), 797–827

2006
[16]

Minghong Geng, Shubham Pateria, Budhitama Subagdja, Lin Li, Xin Zhao, and Ah-Hwee Tan. 2025. L2m2: A hierarchical framework integrating large language model and multi-agent reinforcement learning. InInternational Joint Conference on Artificial Intelligence

2025
[17]

Minjia Guo, Jianhe Liu, Ziping Luo, and Xiao Han. 2024. Deep reinforcement learning for pairs trading: Evidence from China black series futures.International Review of Economics & Finance93 (2024), 981–993

2024
[18]

Zihao Guo, Hanqing Jin, Jiaqi Kuang, Zhongmin Qian, and Jinghan Wang. 2025. Signature Decomposition Method Applying to Pair Trading.Journal of Futures Markets(2025)

2025
[19]

Weiguang Han, Boyi Zhang, Qianqian Xie, Min Peng, Yanzhao Lai, and Jimin Huang. 2023. Select and trade: Towards unified pair trading with hierarchical reinforcement learning. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4123–4134

2023
[20]

Yifan Hao, Xingyuan Pan, Hanning Zhang, Chenlu Ye, Rui Pan, and Tong Zhang
[21]

Polydoros Giannouris, Yuechen Jiang, Lingfei Qian, Yuyan Wang, Xueqing Peng, Jimin Huang, Guojun Xiong, and Sophia Ananiadou

Understanding Overadaptation in Supervised Fine-Tuning: The Role of Ensemble Methods.arXiv preprint arXiv:2506.01901(2025). Polydoros Giannouris, Yuechen Jiang, Lingfei Qian, Yuyan Wang, Xueqing Peng, Jimin Huang, Guojun Xiong, and Sophia Ananiadou

work page arXiv 2025
[22]

Zican Hu, Wei Liu, Xiaoye Qu, Xiangyu Yue, Chunlin Chen, Zhi Wang, and Yu Cheng. 2025. Divide and Conquer: Grounding LLMs as Efficient Decision- Making Agents via Offline Hierarchical Reinforcement Learning.arXiv preprint arXiv:2505.19761(2025)

work page arXiv 2025
[23]

Peng Huang and Tianxiang Wang. 2016. On the profitability of optimal mean reversion trading strategies.arXiv preprint arXiv:1602.05858(2016)

work page arXiv 2016
[24]

Nicolas Huck and Komivi Afawubo. 2015. Pairs trading and selection methods: is cointegration superior?Applied Economics47, 6 (2015), 599–613

2015
[25]

Sang-Ho Kim, Deog-Yeong Park, and Ki-Hoon Lee. 2022. Hybrid deep reinforce- ment learning for pairs trading.Applied Sciences12, 3 (2022), 944

2022
[26]

Taewook Kim and Ha Young Kim. 2019. Optimizing the Pairs-Trading Strategy Using Deep Reinforcement Learning with Trading and Stop-Loss Boundaries. Complexity2019, 1 (2019), 3582516

2019
[27]

Christopher Krauss. 2017. Statistical arbitrage pairs trading strategies: Review and outlook.Journal of Economic Surveys31, 2 (2017), 513–545

2017
[28]

Christopher Krauss, Xuan Anh Do, and Nicolas Huck. 2017. Deep neural net- works, gradient-boosted trees, random forests: Statistical arbitrage on the S&P 500.European Journal of Operational Research259, 2 (2017), 689–702

2017
[29]

Andrew Levy, George Konidaris, Robert Platt, and Kate Saenko. 2019. Learning multi-level hierarchies with hindsight. InProceedings of International Conference on Learning Representations

2019
[30]

Xiang Li, Zhenyu Li, Chen Shi, Yong Xu, Qing Du, Mingkui Tan, and Jun Huang
[31]

InProceedings of the 2024 joint international conference on computational linguistics, language resources and evaluation (LREC-COLING 2024)

Alphafin: Benchmarking financial analysis with retrieval-augmented stock-chain framework. InProceedings of the 2024 joint international conference on computational linguistics, language resources and evaluation (LREC-COLING 2024). 773–783

2024
[32]

Alejandro Lopez-Lira and Yuehua Tang. 2023. Can chatgpt forecast stock price movements? return predictability and large language models.arXiv preprint arXiv:2304.07619(2023)

work page arXiv 2023
[33]

Jing-You Lu, Hsu-Chao Lai, Wen-Yueh Shih, Yi-Feng Chen, Shen-Hang Huang, Hao-Han Chang, Jun-Zhe Wang, Jiun-Long Huang, and Tian-Shyr Dai. 2022. Structural break-aware pairs trading strategy using deep reinforcement learning. The Journal of Supercomputing78, 3 (2022), 3843–3882

2022
[34]

Dat Mai. 2024. Stockgpt: A genai model for stock prediction and trading.arXiv preprint arXiv:2404.05101(2024)

work page arXiv 2024
[35]

Ofir Nachum, Shixiang Shane Gu, Honglak Lee, and Sergey Levine. 2018. Data- efficient hierarchical reinforcement learning.Advances in neural information processing systems31 (2018)

2018
[36]

Shubham Pateria, Budhitama Subagdja, Ah-hwee Tan, and Chai Quek. 2021. Hierarchical reinforcement learning: A comprehensive survey.ACM Computing Surveys (CSUR)54, 5 (2021), 1–35

2021
[37]

Eduardo Pignatelli, Johan Ferret, Matthieu Geist, Thomas Mesnard, Hado van Hasselt, Olivier Pietquin, and Laura Toni. 2023. A survey of temporal credit assignment in deep reinforcement learning.arXiv preprint arXiv:2312.01072 (2023)

work page arXiv 2023
[38]

Lingfei Qian, Xueqing Peng, Yan Wang, Vincent Jim Zhang, Huan He, Hanley Smith, Yi Han, Yueru He, Haohang Li, Yupeng Cao, et al. 2025. When Agents Trade: Live Multi-Market Trading Benchmark for LLM Agents.arXiv preprint arXiv:2510.11695(2025)

work page arXiv 2025
[39]

Molei Qin, Shuo Sun, Wentao Zhang, Haochong Xia, Xinrun Wang, and Bo An
[40]

InProceedings of the AAAI Conference on Artificial Intelligence, Vol

Earnhft: Efficient hierarchical reinforcement learning for high frequency trading. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 14669–14676
[41]

Hossein Rad, Rand Kwong Yew Low, and Robert Faff. 2016. The profitability of pairs trading strategies: distance, cointegration and copula methods.Quantitative Finance16, 10 (2016), 1541–1558

2016
[42]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems36 (2023), 53728–53741

2023
[43]

Francesco Rotondi and Federico Russo. 2024. Machine Learning for Pairs Trading: a Clustering-based Approach.A vailable at SSRN 5080998(2024)

2024
[44]

Utsav Singh and Vinay P Namboodiri. 2023. Pear: Primitive enabled adap- tive relabeling for boosting hierarchical reinforcement learning.arXiv preprint arXiv:2306.06394(2023)

work page arXiv 2023
[45]

Yufei Sun. 2025. A survey of statistical arbitrage pairs trading strategies with non- machine learning methods, 2016-2023.Faculty of Economic Sciences, University of Warsaw Working Papers2025-19 (2025)

2025
[46]

Masood Tadi and Jiří Witzany. 2025. Copula-based trading of cointegrated cryptocurrency Pairs.Financial Innovation11, 1 (2025), 40

2025
[47]

Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. 2017. Feudal networks for hier- archical reinforcement learning. InInternational conference on machine learning. PMLR, 3540–3549

2017
[48]

2004.Pairs Trading: quantitative methods and analysis

Ganapathy Vidyamurthy. 2004.Pairs Trading: quantitative methods and analysis. John Wiley & Sons

2004
[49]

Cheng Wang, Patrik Sandås, and Peter Beling. 2021. Improving pairs trading strategies via reinforcement learning. In2021 International Conference on Applied Artificial Intelligence (ICAPAI). IEEE, 1–7

2021
[50]

Saizhuo Wang, Hang Yuan, Lionel M Ni, and Jian Guo. 2024. Quantagent: Seeking holy grail in trading by self-improving large language model.arXiv preprint arXiv:2402.03755(2024)

work page arXiv 2024
[51]

Saizhuo Wang, Hang Yuan, Leon Zhou, Lionel Ni, Heung Yeung Shum, and Jian Guo. 2025. Alpha-gpt: Human-ai interactive alpha mining for quantitative investment. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 196–206

2025
[52]

Yijia Xiao, Edward Sun, Di Luo, and Wei Wang. 2024. TradingAgents: Multi- agents LLM financial trading framework.arXiv preprint arXiv:2412.20138(2024)

work page arXiv 2024
[53]

Guojun Xiong, Zhiyang Deng, Keyi Wang, Yupeng Cao, Haohang Li, Yangyang Yu, Xueqing Peng, Mingquan Lin, Kaleb E Smith, Xiao-Yang Liu, et al . 2025. FLAG-Trader: Fusion LLM-Agent with Gradient-based Reinforcement Learning for Financial Trading.arXiv preprint arXiv:2502.11433(2025)

work page arXiv 2025
[54]

Fucui Xu and Shan Tan. 2020. Dynamic portfolio management based on pair trad- ing and deep reinforcement learning. InProceedings of the 2020 3rd International Conference on Computational Intelligence and Intelligent Systems. 50–55

2020
[55]

Hongshen Yang and Avinash Malik. 2024. Reinforcement Learning Pair Trading: A Dynamic Scaling Approach.Journal of Risk and Financial Management17, 12 (2024), 555

2024
[56]

Hongyang Yang, Boyu Zhang, Neng Wang, Cheng Guo, Xiaoli Zhang, Likun Lin, Junlin Wang, Tianyu Zhou, Mao Guan, Runjia Zhang, et al. 2024. Finrobot: An open-source ai agent platform for financial applications using large language models.arXiv preprint arXiv:2405.14767(2024)

work page arXiv 2024
[57]

Yi Yang, Yixuan Tang, and Kar Yan Tam. 2023. Investlm: A large language model for investment using financial domain instruction tuning.arXiv preprint arXiv:2309.13064(2023)

work page arXiv 2023
[58]

Yangyang Yu, Haohang Li, Zhi Chen, Yuechen Jiang, Yang Li, Jordan W Suchow, Denghui Zhang, and Khaldoun Khashanah. 2025. Finmem: A performance- enhanced llm trading agent with layered memory and character design.IEEE Transactions on Big Data(2025)

2025
[59]

Yangyang Yu, Zhiyuan Yao, Haohang Li, Zhiyang Deng, Yuechen Jiang, Yupeng Cao, Zhi Chen, Jordan Suchow, Zhenyu Cui, Rong Liu, et al . 2024. Fincon: A synthesized llm multi-agent system with conceptual verbal reinforcement for enhanced financial decision making.Advances in Neural Information Processing Systems37 (2024), 137010–137045

2024
[60]

pair ": [

Wentao Zhang, Lingxuan Zhao, Haochong Xia, Shuo Sun, Jiaze Sun, Molei Qin, Xinyi Li, Yuqing Zhao, Yilei Zhao, Xinyu Cai, et al. 2024. A multimodal founda- tion agent for financial trading: Tool-augmented, diversified, and generalist. In Proceedings of the 30th acm sigkdd conference on knowledge discovery and data mining. 4314–4325. A Stock Universe Our ex...

2024
[61]

The previous Trader prompt
[62]

POLICY :

Critic feedback Task : - The Trader prompt is composed of an IMMUTABLE section ( everything outside POLICY ) and a MUTABLE section ( the POLICY block content ) . - You MUST propose updates ONLY to the POLICY rules . - You MUST NOT modify any other text from previous Trader prompt . - You MUST NOT introduce rules that depend on signals , features , or fiel...
[63]

The previous Selector prompt ( theta_prev )
[64]

Critic feedback text ( g_text )
[65]

POLICY :

A scalar reward for the last selection Task : - The Selector prompt is composed of an IMMUTABLE section ( everything outside POLICY ) and a MUTABLE section ( the POLICY block content ) . - You MUST propose updates ONLY to the POLICY rules . - You MUST NOT modify any other text from the previous Selector prompt . - You MUST NOT introduce rules that depend ...