Beyond Agent Architecture: Execution Assumptions and Reproducibility in LLM-Based Trading Systems

Junyi Yao; Zihao Zheng

arxiv: 2606.08285 · v1 · pith:IWGUFMR5new · submitted 2026-06-06 · 💻 cs.AI · cs.CE· q-fin.CP· q-fin.TR

Beyond Agent Architecture: Execution Assumptions and Reproducibility in LLM-Based Trading Systems

Junyi Yao , Zihao Zheng This is my paper

Pith reviewed 2026-06-27 19:33 UTC · model grok-4.3

classification 💻 cs.AI cs.CEq-fin.CPq-fin.TR

keywords LLM tradingreproducibilityexecution realismagentic systemsfinancial tradingevaluation assumptionsreporting standardstransaction costs

0 comments

The pith

LLM trading papers report architectures more clearly than execution assumptions needed for reproducibility.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper conducts a targeted review and audit of 30 primary studies on LLM-based trading systems. It examines how well they document data provenance, temporal splits, execution timing, turnover, transaction costs, and related evaluation choices. Architecture descriptions tend to be more complete than these execution details, leaving many reported results difficult to interpret economically or replicate. A small worked example with 10 equities shows that adding realistic frictions can substantially reduce apparent strategy performance. The authors conclude that progress requires standardized reporting on execution realism and comparability in addition to new agent designs.

Core claim

Across the audited sample, architecture reporting is generally clearer than the evaluation assumptions needed to judge whether a trading result is economically interpretable or reproducible. A coded evidence matrix covering 30 trade-relevant primary studies assesses point-in-time controls, split transparency, held-out evaluation, cost and turnover treatment, execution semantics, universe definition, and artifact release. The main conclusion is that the next useful step for LLM trading research is clearer reporting standards for execution realism, reproducibility, and evaluation comparability.

What carries the argument

A coded evidence matrix that systematically assesses point-in-time controls, split transparency, held-out evaluation, cost and turnover treatment, execution semantics, universe definition, and artifact release across the 30 studies.

If this is right

Trading performance numbers cannot be compared across studies without consistent disclosure of execution timing and costs.
Many published LLM trading results may shrink or disappear once realistic transaction costs and turnover are applied.
Improvements in agent architecture alone will not resolve the field's comparability problems.
Artifact release and universe definition details are frequently omitted, blocking exact replication.
A methodological scaffold with explicit friction modeling can demonstrate material changes in strategy outcomes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adopting a minimal checklist for execution reporting could allow future meta-analyses to separate viable strategies from artifacts of optimistic assumptions.
The same gaps in evaluation transparency likely appear in other LLM applications to finance and may require field-wide standards.
Clearer execution reporting would make it easier to test whether LLM agents add value beyond simple momentum or mean-reversion rules.

Load-bearing premise

The selection of 30 primary studies and the coding criteria provide an unbiased and representative picture of the broader LLM trading literature.

What would settle it

An independent audit sampling a different collection of studies or applying altered coding criteria that finds evaluation assumptions reported at least as clearly as architectures.

Figures

Figures reproduced from arXiv: 2606.08285 by Junyi Yao, Zihao Zheng.

**Figure 2.** Figure 2: Equity-curve comparison for the four strategies in the 10-stock real-data case study. The x-axis uses trading-day index rather than calendar labels to [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 3.** Figure 3: Aggregate execution-reproducibility coverage among 30 coded pri [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Study-level coding heatmap for the 30-study primary audit. Green denotes explicit reporting or centrality to the study design, yellow denotes partial [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

read the original abstract

Large language models (LLMs) and agentic systems are increasingly proposed for financial trading, yet their reported performance remains difficult to compare because studies vary in data provenance, temporal split discipline, execution timing, turnover treatment, and transaction-cost modeling. This article presents a targeted topical review and reproducibility audit of execution realism in LLM-based trading research. A coded evidence matrix covering 30 trade-relevant primary studies is used to assess point-in-time controls, split transparency, held-out evaluation, cost and turnover treatment, execution semantics, universe definition, and artifact release. Across the audited sample, architecture reporting is generally clearer than the evaluation assumptions needed to judge whether a trading result is economically interpretable or reproducible. A 10-equity worked example is included only as a methodological scaffold to illustrate how explicit friction and timing choices can materially compress active-strategy results. The main conclusion is that the next useful step for LLM trading research is not only better agent design, but also clearer reporting standards for execution realism, reproducibility, and evaluation comparability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags that LLM trading studies report architectures more clearly than the execution details needed for reproducibility, but its own sample of 30 studies lacks any stated selection method.

read the letter

The main takeaway is that this audit shows architecture descriptions in LLM trading papers tend to outpace the reporting on point-in-time data, splits, costs, turnover, and execution timing. That gap makes it hard to tell whether claimed results would hold up in practice.

What the paper does is run a targeted review of 30 studies, code them into an evidence matrix on those realism factors, and add a short 10-equity example to show how adding realistic frictions shrinks active returns. The example works as a simple scaffold rather than a full backtest. The central claim follows directly from the matrix: better execution standards would help more than another round of agent tweaks.

The soft spot is the sample itself. The abstract calls the work a targeted topical review but supplies no search terms, databases, date cutoffs, or inclusion rules. Without that, the observed difference in reporting quality could reflect how the 30 papers were chosen rather than a field-wide pattern. The authors treat the matrix as illustrative, which is honest, but it still limits how far the conclusion travels.

This is useful for anyone building or refereeing LLM trading systems who wants a checklist of execution items to watch. It is not a theoretical advance and does not claim new trading results. The issue it raises is real enough that the paper deserves a serious referee even with the selection gap; reviewers can ask for the protocol details and decide how much weight to give the audit.

Referee Report

1 major / 2 minor

Summary. The paper conducts a targeted topical review and reproducibility audit of execution realism in LLM-based trading research. It codes an evidence matrix from 30 primary studies on point-in-time controls, split transparency, held-out evaluation, cost/turnover treatment, execution semantics, universe definition, and artifact release. The central finding is that architecture reporting is generally clearer than the evaluation assumptions required to assess economic interpretability or reproducibility. A 10-equity worked example is presented solely as a methodological scaffold to show how explicit friction and timing choices can compress active-strategy results. The conclusion calls for clearer reporting standards on execution realism rather than further focus on agent architecture alone.

Significance. If the audit accurately captures patterns in the sampled literature, the work usefully redirects attention in an emerging subfield from architectural novelty to foundational reproducibility and economic validity. The coded evidence matrix offers a reusable template for future audits, and the scaffold example concretely demonstrates the material impact of unstated assumptions. These elements provide practical value even if the sample's representativeness requires clarification.

major comments (1)

[Methods / Audit Design] Methods / Audit Design (description of the 30-study sample): No search strategy, databases, date range, inclusion/exclusion criteria, or sampling justification is provided for the 'targeted topical review.' Because the central claim concerns a disparity observable 'across the audited sample' and is used to recommend field-wide changes in reporting standards, the absence of selection protocol details makes it impossible to judge whether the observed pattern is representative or an artifact of how the studies were chosen.

minor comments (2)

[Abstract and Conclusion] The abstract and conclusion could more explicitly qualify the scope as applying to the audited sample rather than the entire LLM trading literature, to avoid overgeneralization while the selection protocol remains unspecified.
[Worked Example] The 10-equity example is labeled a 'scaffold' and not a formal result, but its placement and level of detail could be clarified so readers do not mistake it for empirical evidence supporting the audit findings.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for greater transparency in our audit design. We agree that additional details on sample selection will strengthen the manuscript and will revise accordingly, while clarifying that the review is targeted rather than systematic.

read point-by-point responses

Referee: [Methods / Audit Design] Methods / Audit Design (description of the 30-study sample): No search strategy, databases, date range, inclusion/exclusion criteria, or sampling justification is provided for the 'targeted topical review.' Because the central claim concerns a disparity observable 'across the audited sample' and is used to recommend field-wide changes in reporting standards, the absence of selection protocol details makes it impossible to judge whether the observed pattern is representative or an artifact of how the studies were chosen.

Authors: We acknowledge the validity of this observation. The manuscript describes the work as a 'targeted topical review' of 30 trade-relevant primary studies but does not detail the identification process. To address this, we will add a new subsection under Methods that specifies: (1) keyword-based searches on arXiv, SSRN, and Google Scholar using terms such as 'LLM-based trading', 'LLM agent trading', and 'large language model trading system' from 2023 onward; (2) inclusion of papers that propose and evaluate an LLM or agentic trading strategy with reported performance metrics; (3) exclusion of purely theoretical or non-trading LLM papers; and (4) a convenience element limited to readily accessible English-language preprints and conference papers known in the subfield at the time of writing. We will also explicitly state that the sample is not intended to be exhaustive or statistically representative of the entire literature, but rather illustrative of patterns in published LLM trading work. This revision will allow readers to better evaluate the scope while preserving the paper's focus on execution assumptions rather than claiming broad generalizability. revision: yes

Circularity Check

0 steps flagged

No circularity: observational audit with no derivations or fitted predictions

full rationale

The paper is a targeted topical review and reproducibility audit of execution assumptions in 30 external LLM trading studies. It codes an evidence matrix on reporting practices and draws an observational conclusion that architecture reporting exceeds evaluation-assumption transparency in the sample. No mathematical derivations, first-principles results, parameter fitting, or predictions appear; the 10-equity example is explicitly labeled an illustration only. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing support. The central claim rests on the external audit rather than reducing to the paper's own inputs by construction, satisfying the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the representativeness of the 30-study sample and the objectivity of the evidence matrix coding; no free parameters, new axioms, or invented entities are introduced.

axioms (1)

standard math Standard practices for literature coding and evidence synthesis apply to the construction of the 30-study matrix.
The audit uses a coded evidence matrix whose validity depends on conventional review methodology.

pith-pipeline@v0.9.1-grok · 5713 in / 1197 out tokens · 21271 ms · 2026-06-27T19:33:45.551764+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 9 linked inside Pith

[1]

Large language model agent in financial trading: A survey,

H. Ding, Y . Li, J. Wang, H. Chen, D. Guo, and Y . Zhang, “Large language model agent in financial trading: A survey,” arXiv preprint arXiv:2408.06361, 2024. [Online]. Available: https: //arxiv.org/abs/2408.06361

arXiv 2024
[2]

TradingGPT: Multi- agent system with layered memory and distinct characters for enhanced financial trading performance,

Y . Li, Y . Yu, H. Li, Z. Chen, and K. Khashanah, “TradingGPT: Multi- agent system with layered memory and distinct characters for enhanced financial trading performance,”arXiv preprint arXiv:2309.03736, 2023. [Online]. Available: https://arxiv.org/abs/2309.03736

arXiv 2023
[4]

Available: https://arxiv.org/abs/2407.18957

[Online]. Available: https://arxiv.org/abs/2407.18957

Pith/arXiv arXiv
[6]

Available: https://arxiv.org/abs/2412.20138

[Online]. Available: https://arxiv.org/abs/2412.20138

arXiv
[7]

FinMem: A performance- enhanced LLM trading agent with layered memory and character design,

Y . Yu, H. Li, Z. Chen, Y . Jiang, Y . Li, D. Zhang, R. Liu, J. W. Suchow, and K. Khashanah, “FinMem: A performance- enhanced LLM trading agent with layered memory and character design,”arXiv preprint arXiv:2311.13743, 2023. [Online]. Available: https://arxiv.org/abs/2311.13743

arXiv 2023
[8]

FinAgent: A multimodal foundation agent for financial trading: Tool-augmented, diversified, and generalist,

W. Zhang, L. Zhao, H. Xia, S. Sun, J. Sun, M. Qin, X. Li, Y . Zhao, Y . Zhao, X. Cai, L. Zheng, X. Wang, and B. An, “FinAgent: A multimodal foundation agent for financial trading: Tool-augmented, diversified, and generalist,”arXiv preprint arXiv:2402.18485, 2024. [Online]. Available: https://arxiv.org/abs/2402.18485

arXiv 2024
[9]

CryptoTrade: A reflective LLM-based agent to guide zero-shot cryptocurrency trading,

Y . Li, B. Luo, Q. Wang, N. Chen, X. Liu, and B. He, “CryptoTrade: A reflective LLM-based agent to guide zero-shot cryptocurrency trading,”arXiv preprint arXiv:2407.09546, 2024. [Online]. Available: https://arxiv.org/abs/2407.09546

arXiv 2024
[10]

Agentic trading: When LLM agents meet financial markets,

Y . Xia, P. You, T. Wang, F. Liu, H. Qi, X. Wu, and S. Zhang, “Agentic trading: When LLM agents meet financial markets,” arXiv preprint arXiv:2605.19337, 2026. [Online]. Available: https: //arxiv.org/abs/2605.19337

Pith/arXiv arXiv 2026
[11]

BloombergGPT: A large language model for finance,

S. Wu, O. Irsoy, S. Lu, V . Dabravolski, M. Dredze, S. Gehrmann, P. Kambadur, D. Rosenberg, and G. Mann, “BloombergGPT: A large language model for finance,”arXiv preprint arXiv:2303.17564, 2023. [Online]. Available: https://arxiv.org/abs/2303.17564

Pith/arXiv arXiv 2023
[12]

FinGPT: Open-source financial large language models,

H. Yang, X.-Y . Liu, and C. D. Wang, “FinGPT: Open-source financial large language models,”arXiv preprint arXiv:2306.06031, 2023. [Online]. Available: https://arxiv.org/abs/2306.06031

arXiv 2023
[13]

Large language models in finance: A survey,

Y . Li, S. Wang, H. Ding, and H. Chen, “Large language models in finance: A survey,”arXiv preprint arXiv:2311.10723, 2023. [Online]. Available: https://arxiv.org/abs/2311.10723

arXiv 2023
[14]

A survey of large language models in finance (FinLLMs),

J. Lee, N. Stevens, S. C. Han, and M. Song, “A survey of large language models in finance (FinLLMs),”arXiv preprint arXiv:2402.02315, 2024. [Online]. Available: https://arxiv.org/abs/2402.02315

arXiv 2024
[15]

FinanceBench: A new benchmark for financial question answering,

P. Islam, A. Kannappan, D. Kiela, R. Qian, N. Scherrer, and B. Vidgen, “FinanceBench: A new benchmark for financial question answering,”arXiv preprint arXiv:2311.11944, 2023. [Online]. Available: https://arxiv.org/abs/2311.11944

Pith/arXiv arXiv 2023
[16]

FinBen: A holistic financial benchmark for large language models,

Q. Xie, W. Han, Z. Chen, R. Xiang, X. Zhang, Y . He, M. Xiao, D. Li, Y . Dai, D. Feng, Y . Xu, H. Kang, Z. Kuang, C. Yuan, K. Yang, Z. Luo, T. Zhang, Z. Liu, G. Xiong, Z. Deng, Y . Jiang, Z. Yao, H. Li, Y . Yu, G. Hu, J. Huang, X.-Y . Liu, A. Lopez- Lira, B. Wang, Y . Lai, H. Wang, M. Peng, S. Ananiadou, and J. Huang, “FinBen: A holistic financial benchma...

arXiv 2024
[17]

FinRobot: An open-source AI agent platform for financial applications using large language models,

H. Yang, B. Zhang, N. Wang, C. Guo, X. Zhang, L. Lin, J. Wang, T. Zhou, M. Guan, R. Zhang, and C. D. Wang, “FinRobot: An open-source AI agent platform for financial applications using large language models,”arXiv preprint arXiv:2405.14767, 2024. [Online]. Available: https://arxiv.org/abs/2405.14767

arXiv 2024
[18]

Can ChatGPT forecast stock price movements? return predictability and large language models,

A. Lopez-Lira and Y . Tang, “Can ChatGPT forecast stock price movements? return predictability and large language models,”arXiv preprint arXiv:2304.07619, 2023. [Online]. Available: https://arxiv.org/ abs/2304.07619

arXiv 2023
[19]

Sentiment trading with large language models,

K. Kirtac and G. Germano, “Sentiment trading with large language models,”arXiv preprint arXiv:2412.19245, 2024. [Online]. Available: https://arxiv.org/abs/2412.19245

arXiv 2024
[20]

AlphaAgents: Large language model based multi-agents for equity portfolio constructions,

T. Zhao, J. Lyu, S. Jones, H. Garber, S. Pasquali, and D. Mehta, “AlphaAgents: Large language model based multi-agents for equity portfolio constructions,”arXiv preprint arXiv:2508.11152, 2025. [Online]. Available: https://arxiv.org/abs/2508.11152

arXiv 2025
[21]

ContestTrade: A multi-agent trading system based on internal contest mechanism,

L. Zhao, R. Sun, Z. Jiang, B. Yang, Y . Bai, M. Chen, X. Wang, J. Li, and Z. Bai, “ContestTrade: A multi-agent trading system based on internal contest mechanism,”arXiv preprint arXiv:2508.00554, 2025. [Online]. Available: https://arxiv.org/abs/2508.00554

arXiv 2025
[22]

Toward expert investment teams: A multi-agent LLM system with fine-grained trading tasks,

K. Miyazaki, T. Kawahara, S. Roberts, and S. Zohren, “Toward expert investment teams: A multi-agent LLM system with fine-grained trading tasks,”arXiv preprint arXiv:2602.23330, 2026. [Online]. Available: https://arxiv.org/abs/2602.23330

arXiv 2026
[23]

AlphaCrafter: A full-stack multi-agent framework for cross-sectional quantitative trading,

Y . Yuan, J. Sheng, S. Zeng, J. Wang, and J. Liu, “AlphaCrafter: A full-stack multi-agent framework for cross-sectional quantitative trading,”arXiv preprint arXiv:2605.05580, 2026. [Online]. Available: https://arxiv.org/abs/2605.05580

Pith/arXiv arXiv 2026
[24]

MM-DREX: Multimodal-driven dynamic routing of LLM experts for financial trading,

Y . Chen, Y . Jiang, Z. Ma, Y . Cao, J. Keung, K. Kuang, L. Gan, Y . Wu, and F. Wu, “MM-DREX: Multimodal-driven dynamic routing of LLM experts for financial trading,”arXiv preprint arXiv:2509.05080, 2025. [Online]. Available: https://arxiv.org/abs/2509.05080

arXiv 2025
[25]

QRAFTI: An agentic framework for empirical research in quantitative finance,

T. Lim, K. Muthuraman, and M. Sury, “QRAFTI: An agentic framework for empirical research in quantitative finance,”arXiv preprint arXiv:2604.18500, 2026. [Online]. Available: https://arxiv.org/ abs/2604.18500

Pith/arXiv arXiv 2026
[26]

Alpha-GPT: Human-AI interactive alpha mining for quantitative investment,

S. Wang, H. Yuan, L. Zhou, L. M. Ni, H.-Y . Shum, and J. Guo, “Alpha-GPT: Human-AI interactive alpha mining for quantitative investment,”arXiv preprint arXiv:2308.00016, 2023. [Online]. Available: https://arxiv.org/abs/2308.00016

arXiv 2023
[27]

QuantAgent: Seeking holy grail in trading by self-improving large language model,

S. Wang, H. Yuan, L. M. Ni, and J. Guo, “QuantAgent: Seeking holy grail in trading by self-improving large language model,” arXiv preprint arXiv:2402.03755, 2024. [Online]. Available: https: //arxiv.org/abs/2402.03755

arXiv 2024
[28]

StockGPT: A GenAI model for stock prediction and trading,

D. Mai, “StockGPT: A GenAI model for stock prediction and trading,”arXiv preprint arXiv:2404.05101, 2024. [Online]. Available: https://arxiv.org/abs/2404.05101

arXiv 2024
[30]

Available: https://arxiv.org/abs/2502.16789

[Online]. Available: https://arxiv.org/abs/2502.16789

arXiv
[32]

Available: https://arxiv.org/abs/2502.07393

[Online]. Available: https://arxiv.org/abs/2502.07393

arXiv
[33]

Hubble: An LLM-driven agentic framework for safe, diverse, and reproducible alpha factor discovery,

R. Shi, S. Yan, Y . Cai, and C. Lv, “Hubble: An LLM-driven agentic framework for safe, diverse, and reproducible alpha factor discovery,”arXiv preprint arXiv:2604.09601, 2026. [Online]. Available: https://arxiv.org/abs/2604.09601

Pith/arXiv arXiv 2026
[34]

INVESTORBENCH: A benchmark for financial decision-making tasks with LLM-based agent,

H. Li, Y . Cao, Y . Yu, S. R. Javaji, Z. Deng, Y . He, Y . Jiang, Z. Zhu, K. Subbalakshmi, G. Xiong, J. Huang, L. Qian, X. Peng, Q. Xie, and J. W. Suchow, “INVESTORBENCH: A benchmark for financial decision-making tasks with LLM-based agent,”arXiv preprint arXiv:2412.18174, 2024. [Online]. Available: https://arxiv.org/abs/2412. 18174

arXiv 2024
[35]

AI-Trader: Benchmarking autonomous agents in real-time financial markets,

T. Fan, Y . Yang, Y . Jiang, Y . Zhang, Y . Chen, and C. Huang, “AI-Trader: Benchmarking autonomous agents in real-time financial markets,”arXiv preprint arXiv:2512.10971, 2025. [Online]. Available: https://arxiv.org/abs/2512.10971

arXiv 2025
[36]

PortBench: A correlation-aware, full-pipeline benchmark for LLM-driven portfolio management,

Y . Zhao, S. Chen, and N. Su, “PortBench: A correlation-aware, full-pipeline benchmark for LLM-driven portfolio management,” arXiv preprint arXiv:2605.27887, 2026. [Online]. Available: https: //arxiv.org/abs/2605.27887

Pith/arXiv arXiv 2026
[37]

Look-Ahead-Bench: A standardized benchmark of look-ahead bias in point-in-time LLMs for finance,

M. Benhenda, “Look-Ahead-Bench: A standardized benchmark of look-ahead bias in point-in-time LLMs for finance,”arXiv preprint arXiv:2601.13770, 2026. [Online]. Available: https://arxiv.org/abs/2601. 13770

arXiv 2026
[38]

Can large language models trade? testing financial theories with LLM agents in market simulations,

A. Lopez-Lira, “Can large language models trade? testing financial theories with LLM agents in market simulations,”arXiv preprint arXiv:2504.10789, 2025. [Online]. Available: https://arxiv.org/abs/2504. 10789

arXiv 2025
[40]

Available: https://arxiv.org/abs/2601.13082

[Online]. Available: https://arxiv.org/abs/2601.13082

arXiv
[41]

Ranking abuse via strategic pairwise data perturbations,

J. Yao, Z. Zheng, and J. Long, “Ranking abuse via strategic pairwise data perturbations,”arXiv preprint arXiv:2604.17805, 2026. [Online]. Available: https://arxiv.org/abs/2604.17805

Pith/arXiv arXiv 2026
[42]

The PRISMA 2020 statement: An updated guideline for reporting systematic reviews,

M. J. Page, J. E. McKenzie, P. M. Bossuyt, I. Boutron, T. C. Hoffmann, C. D. Mulrow, L. Shamseer, J. M. Tetzlaff, E. A. Akl, S. E. Brennan, R. Chou, J. Glanville, J. M. Grimshaw, A. Hr ´objartsson, M. M. Lalu, T. Li, E. W. Loder, E. Mayo-Wilson, S. McDonald, L. A. McGuinness, L. A. Stewart, J. Thomas, A. C. Tricco, V . A. Welch, P. Whiting, and BEYOND AGE...

2020
[43]

Chain-of-alpha: Unleashing the power of large language models for alpha mining in quantitative trading,

L. Cao, “Chain-of-alpha: Unleashing the power of large language models for alpha mining in quantitative trading,”arXiv preprint arXiv:2508.06312, 2025, withdrawn. [Online]. Available: https://arxiv. org/abs/2508.06312

arXiv 2025
[44]

Open-FinLLMs: Open multimodal large language models for financial applications,

J. Huang, M. Xiao, D. Li, Z. Jiang, Y . Yang, Y . Zhang, L. Qian, Y . Wang, X. Peng, Y . Renet al., “Open-FinLLMs: Open multimodal large language models for financial applications,” arXiv preprint arXiv:2408.11878, 2024. [Online]. Available: https: //arxiv.org/abs/2408.11878

arXiv 2024
[45]

Alpha-GPT 2.0: Human-in-the-loop AI for quantitative investment,

H. Yuan, S. Wang, and J. Guo, “Alpha-GPT 2.0: Human-in-the-loop AI for quantitative investment,”arXiv preprint arXiv:2402.09746, 2024. [Online]. Available: https://arxiv.org/abs/2402.09746

arXiv 2024
[46]

Learning to generate explainable stock predictions using self-reflective large language models,

K. J. L. Koa, Y . Ma, R. Ng, and T.-S. Chua, “Learning to generate explainable stock predictions using self-reflective large language models,”arXiv preprint arXiv:2402.03659, 2024. [Online]. Available: https://arxiv.org/abs/2402.03659

arXiv 2024
[48]

Available: https://arxiv.org/abs/2403.12285

[Online]. Available: https://arxiv.org/abs/2403.12285

arXiv
[49]

FinCon: A synthesized LLM multi-agent system with conceptual verbal reinforcement for enhanced financial decision making,

Y . Yu, Z. Yao, H. Li, Z. Deng, Y . Cao, Z. Chen, J. W. Suchow, R. Liu, Z. Cui, Z. Xu, D. Zhang, K. Subbalakshmi, G. Xiong, Y . He, J. Huang, D. Li, and Q. Xie, “FinCon: A synthesized LLM multi-agent system with conceptual verbal reinforcement for enhanced financial decision making,”arXiv preprint arXiv:2407.06567, 2024. [Online]. Available: https://arxiv...

arXiv 2024
[51]

Available: https://arxiv.org/abs/2508.02366

[Online]. Available: https://arxiv.org/abs/2508.02366

arXiv
[52]

Chatgpt,

OpenAI, “Chatgpt,” https://chatgpt.com/, 2026, aI system used for manuscript drafting assistance

2026
[53]

——, “Codex,” https://openai.com/codex/, 2026, aI coding assistant used for LaTeX editing and build verification

2026
[54]

Submission guidelines for authors,

IEEE Access, “Submission guidelines for authors,” https: //ieeeaccess.ieee.org/guide-for-authors/submission-guidelines/, 2026, accessed: 2026-05-30

2026
[55]

Preparing your article,

——, “Preparing your article,” https://ieeeaccess.ieee.org/authors/ preparing-your-article/, 2026, accessed: 2026-05-30

2026

[1] [1]

Large language model agent in financial trading: A survey,

H. Ding, Y . Li, J. Wang, H. Chen, D. Guo, and Y . Zhang, “Large language model agent in financial trading: A survey,” arXiv preprint arXiv:2408.06361, 2024. [Online]. Available: https: //arxiv.org/abs/2408.06361

arXiv 2024

[2] [2]

TradingGPT: Multi- agent system with layered memory and distinct characters for enhanced financial trading performance,

Y . Li, Y . Yu, H. Li, Z. Chen, and K. Khashanah, “TradingGPT: Multi- agent system with layered memory and distinct characters for enhanced financial trading performance,”arXiv preprint arXiv:2309.03736, 2023. [Online]. Available: https://arxiv.org/abs/2309.03736

arXiv 2023

[3] [4]

Available: https://arxiv.org/abs/2407.18957

[Online]. Available: https://arxiv.org/abs/2407.18957

Pith/arXiv arXiv

[4] [6]

Available: https://arxiv.org/abs/2412.20138

[Online]. Available: https://arxiv.org/abs/2412.20138

arXiv

[5] [7]

FinMem: A performance- enhanced LLM trading agent with layered memory and character design,

Y . Yu, H. Li, Z. Chen, Y . Jiang, Y . Li, D. Zhang, R. Liu, J. W. Suchow, and K. Khashanah, “FinMem: A performance- enhanced LLM trading agent with layered memory and character design,”arXiv preprint arXiv:2311.13743, 2023. [Online]. Available: https://arxiv.org/abs/2311.13743

arXiv 2023

[6] [8]

FinAgent: A multimodal foundation agent for financial trading: Tool-augmented, diversified, and generalist,

W. Zhang, L. Zhao, H. Xia, S. Sun, J. Sun, M. Qin, X. Li, Y . Zhao, Y . Zhao, X. Cai, L. Zheng, X. Wang, and B. An, “FinAgent: A multimodal foundation agent for financial trading: Tool-augmented, diversified, and generalist,”arXiv preprint arXiv:2402.18485, 2024. [Online]. Available: https://arxiv.org/abs/2402.18485

arXiv 2024

[7] [9]

CryptoTrade: A reflective LLM-based agent to guide zero-shot cryptocurrency trading,

Y . Li, B. Luo, Q. Wang, N. Chen, X. Liu, and B. He, “CryptoTrade: A reflective LLM-based agent to guide zero-shot cryptocurrency trading,”arXiv preprint arXiv:2407.09546, 2024. [Online]. Available: https://arxiv.org/abs/2407.09546

arXiv 2024

[8] [10]

Agentic trading: When LLM agents meet financial markets,

Y . Xia, P. You, T. Wang, F. Liu, H. Qi, X. Wu, and S. Zhang, “Agentic trading: When LLM agents meet financial markets,” arXiv preprint arXiv:2605.19337, 2026. [Online]. Available: https: //arxiv.org/abs/2605.19337

Pith/arXiv arXiv 2026

[9] [11]

BloombergGPT: A large language model for finance,

S. Wu, O. Irsoy, S. Lu, V . Dabravolski, M. Dredze, S. Gehrmann, P. Kambadur, D. Rosenberg, and G. Mann, “BloombergGPT: A large language model for finance,”arXiv preprint arXiv:2303.17564, 2023. [Online]. Available: https://arxiv.org/abs/2303.17564

Pith/arXiv arXiv 2023

[10] [12]

FinGPT: Open-source financial large language models,

H. Yang, X.-Y . Liu, and C. D. Wang, “FinGPT: Open-source financial large language models,”arXiv preprint arXiv:2306.06031, 2023. [Online]. Available: https://arxiv.org/abs/2306.06031

arXiv 2023

[11] [13]

Large language models in finance: A survey,

Y . Li, S. Wang, H. Ding, and H. Chen, “Large language models in finance: A survey,”arXiv preprint arXiv:2311.10723, 2023. [Online]. Available: https://arxiv.org/abs/2311.10723

arXiv 2023

[12] [14]

A survey of large language models in finance (FinLLMs),

J. Lee, N. Stevens, S. C. Han, and M. Song, “A survey of large language models in finance (FinLLMs),”arXiv preprint arXiv:2402.02315, 2024. [Online]. Available: https://arxiv.org/abs/2402.02315

arXiv 2024

[13] [15]

FinanceBench: A new benchmark for financial question answering,

P. Islam, A. Kannappan, D. Kiela, R. Qian, N. Scherrer, and B. Vidgen, “FinanceBench: A new benchmark for financial question answering,”arXiv preprint arXiv:2311.11944, 2023. [Online]. Available: https://arxiv.org/abs/2311.11944

Pith/arXiv arXiv 2023

[14] [16]

FinBen: A holistic financial benchmark for large language models,

Q. Xie, W. Han, Z. Chen, R. Xiang, X. Zhang, Y . He, M. Xiao, D. Li, Y . Dai, D. Feng, Y . Xu, H. Kang, Z. Kuang, C. Yuan, K. Yang, Z. Luo, T. Zhang, Z. Liu, G. Xiong, Z. Deng, Y . Jiang, Z. Yao, H. Li, Y . Yu, G. Hu, J. Huang, X.-Y . Liu, A. Lopez- Lira, B. Wang, Y . Lai, H. Wang, M. Peng, S. Ananiadou, and J. Huang, “FinBen: A holistic financial benchma...

arXiv 2024

[15] [17]

FinRobot: An open-source AI agent platform for financial applications using large language models,

H. Yang, B. Zhang, N. Wang, C. Guo, X. Zhang, L. Lin, J. Wang, T. Zhou, M. Guan, R. Zhang, and C. D. Wang, “FinRobot: An open-source AI agent platform for financial applications using large language models,”arXiv preprint arXiv:2405.14767, 2024. [Online]. Available: https://arxiv.org/abs/2405.14767

arXiv 2024

[16] [18]

Can ChatGPT forecast stock price movements? return predictability and large language models,

A. Lopez-Lira and Y . Tang, “Can ChatGPT forecast stock price movements? return predictability and large language models,”arXiv preprint arXiv:2304.07619, 2023. [Online]. Available: https://arxiv.org/ abs/2304.07619

arXiv 2023

[17] [19]

Sentiment trading with large language models,

K. Kirtac and G. Germano, “Sentiment trading with large language models,”arXiv preprint arXiv:2412.19245, 2024. [Online]. Available: https://arxiv.org/abs/2412.19245

arXiv 2024

[18] [20]

AlphaAgents: Large language model based multi-agents for equity portfolio constructions,

T. Zhao, J. Lyu, S. Jones, H. Garber, S. Pasquali, and D. Mehta, “AlphaAgents: Large language model based multi-agents for equity portfolio constructions,”arXiv preprint arXiv:2508.11152, 2025. [Online]. Available: https://arxiv.org/abs/2508.11152

arXiv 2025

[19] [21]

ContestTrade: A multi-agent trading system based on internal contest mechanism,

L. Zhao, R. Sun, Z. Jiang, B. Yang, Y . Bai, M. Chen, X. Wang, J. Li, and Z. Bai, “ContestTrade: A multi-agent trading system based on internal contest mechanism,”arXiv preprint arXiv:2508.00554, 2025. [Online]. Available: https://arxiv.org/abs/2508.00554

arXiv 2025

[20] [22]

Toward expert investment teams: A multi-agent LLM system with fine-grained trading tasks,

K. Miyazaki, T. Kawahara, S. Roberts, and S. Zohren, “Toward expert investment teams: A multi-agent LLM system with fine-grained trading tasks,”arXiv preprint arXiv:2602.23330, 2026. [Online]. Available: https://arxiv.org/abs/2602.23330

arXiv 2026

[21] [23]

AlphaCrafter: A full-stack multi-agent framework for cross-sectional quantitative trading,

Y . Yuan, J. Sheng, S. Zeng, J. Wang, and J. Liu, “AlphaCrafter: A full-stack multi-agent framework for cross-sectional quantitative trading,”arXiv preprint arXiv:2605.05580, 2026. [Online]. Available: https://arxiv.org/abs/2605.05580

Pith/arXiv arXiv 2026

[22] [24]

MM-DREX: Multimodal-driven dynamic routing of LLM experts for financial trading,

Y . Chen, Y . Jiang, Z. Ma, Y . Cao, J. Keung, K. Kuang, L. Gan, Y . Wu, and F. Wu, “MM-DREX: Multimodal-driven dynamic routing of LLM experts for financial trading,”arXiv preprint arXiv:2509.05080, 2025. [Online]. Available: https://arxiv.org/abs/2509.05080

arXiv 2025

[23] [25]

QRAFTI: An agentic framework for empirical research in quantitative finance,

T. Lim, K. Muthuraman, and M. Sury, “QRAFTI: An agentic framework for empirical research in quantitative finance,”arXiv preprint arXiv:2604.18500, 2026. [Online]. Available: https://arxiv.org/ abs/2604.18500

Pith/arXiv arXiv 2026

[24] [26]

Alpha-GPT: Human-AI interactive alpha mining for quantitative investment,

S. Wang, H. Yuan, L. Zhou, L. M. Ni, H.-Y . Shum, and J. Guo, “Alpha-GPT: Human-AI interactive alpha mining for quantitative investment,”arXiv preprint arXiv:2308.00016, 2023. [Online]. Available: https://arxiv.org/abs/2308.00016

arXiv 2023

[25] [27]

QuantAgent: Seeking holy grail in trading by self-improving large language model,

S. Wang, H. Yuan, L. M. Ni, and J. Guo, “QuantAgent: Seeking holy grail in trading by self-improving large language model,” arXiv preprint arXiv:2402.03755, 2024. [Online]. Available: https: //arxiv.org/abs/2402.03755

arXiv 2024

[26] [28]

StockGPT: A GenAI model for stock prediction and trading,

D. Mai, “StockGPT: A GenAI model for stock prediction and trading,”arXiv preprint arXiv:2404.05101, 2024. [Online]. Available: https://arxiv.org/abs/2404.05101

arXiv 2024

[27] [30]

Available: https://arxiv.org/abs/2502.16789

[Online]. Available: https://arxiv.org/abs/2502.16789

arXiv

[28] [32]

Available: https://arxiv.org/abs/2502.07393

[Online]. Available: https://arxiv.org/abs/2502.07393

arXiv

[29] [33]

Hubble: An LLM-driven agentic framework for safe, diverse, and reproducible alpha factor discovery,

R. Shi, S. Yan, Y . Cai, and C. Lv, “Hubble: An LLM-driven agentic framework for safe, diverse, and reproducible alpha factor discovery,”arXiv preprint arXiv:2604.09601, 2026. [Online]. Available: https://arxiv.org/abs/2604.09601

Pith/arXiv arXiv 2026

[30] [34]

INVESTORBENCH: A benchmark for financial decision-making tasks with LLM-based agent,

H. Li, Y . Cao, Y . Yu, S. R. Javaji, Z. Deng, Y . He, Y . Jiang, Z. Zhu, K. Subbalakshmi, G. Xiong, J. Huang, L. Qian, X. Peng, Q. Xie, and J. W. Suchow, “INVESTORBENCH: A benchmark for financial decision-making tasks with LLM-based agent,”arXiv preprint arXiv:2412.18174, 2024. [Online]. Available: https://arxiv.org/abs/2412. 18174

arXiv 2024

[31] [35]

AI-Trader: Benchmarking autonomous agents in real-time financial markets,

T. Fan, Y . Yang, Y . Jiang, Y . Zhang, Y . Chen, and C. Huang, “AI-Trader: Benchmarking autonomous agents in real-time financial markets,”arXiv preprint arXiv:2512.10971, 2025. [Online]. Available: https://arxiv.org/abs/2512.10971

arXiv 2025

[32] [36]

PortBench: A correlation-aware, full-pipeline benchmark for LLM-driven portfolio management,

Y . Zhao, S. Chen, and N. Su, “PortBench: A correlation-aware, full-pipeline benchmark for LLM-driven portfolio management,” arXiv preprint arXiv:2605.27887, 2026. [Online]. Available: https: //arxiv.org/abs/2605.27887

Pith/arXiv arXiv 2026

[33] [37]

Look-Ahead-Bench: A standardized benchmark of look-ahead bias in point-in-time LLMs for finance,

M. Benhenda, “Look-Ahead-Bench: A standardized benchmark of look-ahead bias in point-in-time LLMs for finance,”arXiv preprint arXiv:2601.13770, 2026. [Online]. Available: https://arxiv.org/abs/2601. 13770

arXiv 2026

[34] [38]

Can large language models trade? testing financial theories with LLM agents in market simulations,

A. Lopez-Lira, “Can large language models trade? testing financial theories with LLM agents in market simulations,”arXiv preprint arXiv:2504.10789, 2025. [Online]. Available: https://arxiv.org/abs/2504. 10789

arXiv 2025

[35] [40]

Available: https://arxiv.org/abs/2601.13082

[Online]. Available: https://arxiv.org/abs/2601.13082

arXiv

[36] [41]

Ranking abuse via strategic pairwise data perturbations,

J. Yao, Z. Zheng, and J. Long, “Ranking abuse via strategic pairwise data perturbations,”arXiv preprint arXiv:2604.17805, 2026. [Online]. Available: https://arxiv.org/abs/2604.17805

Pith/arXiv arXiv 2026

[37] [42]

The PRISMA 2020 statement: An updated guideline for reporting systematic reviews,

M. J. Page, J. E. McKenzie, P. M. Bossuyt, I. Boutron, T. C. Hoffmann, C. D. Mulrow, L. Shamseer, J. M. Tetzlaff, E. A. Akl, S. E. Brennan, R. Chou, J. Glanville, J. M. Grimshaw, A. Hr ´objartsson, M. M. Lalu, T. Li, E. W. Loder, E. Mayo-Wilson, S. McDonald, L. A. McGuinness, L. A. Stewart, J. Thomas, A. C. Tricco, V . A. Welch, P. Whiting, and BEYOND AGE...

2020

[38] [43]

Chain-of-alpha: Unleashing the power of large language models for alpha mining in quantitative trading,

L. Cao, “Chain-of-alpha: Unleashing the power of large language models for alpha mining in quantitative trading,”arXiv preprint arXiv:2508.06312, 2025, withdrawn. [Online]. Available: https://arxiv. org/abs/2508.06312

arXiv 2025

[39] [44]

Open-FinLLMs: Open multimodal large language models for financial applications,

J. Huang, M. Xiao, D. Li, Z. Jiang, Y . Yang, Y . Zhang, L. Qian, Y . Wang, X. Peng, Y . Renet al., “Open-FinLLMs: Open multimodal large language models for financial applications,” arXiv preprint arXiv:2408.11878, 2024. [Online]. Available: https: //arxiv.org/abs/2408.11878

arXiv 2024

[40] [45]

Alpha-GPT 2.0: Human-in-the-loop AI for quantitative investment,

H. Yuan, S. Wang, and J. Guo, “Alpha-GPT 2.0: Human-in-the-loop AI for quantitative investment,”arXiv preprint arXiv:2402.09746, 2024. [Online]. Available: https://arxiv.org/abs/2402.09746

arXiv 2024

[41] [46]

Learning to generate explainable stock predictions using self-reflective large language models,

K. J. L. Koa, Y . Ma, R. Ng, and T.-S. Chua, “Learning to generate explainable stock predictions using self-reflective large language models,”arXiv preprint arXiv:2402.03659, 2024. [Online]. Available: https://arxiv.org/abs/2402.03659

arXiv 2024

[42] [48]

Available: https://arxiv.org/abs/2403.12285

[Online]. Available: https://arxiv.org/abs/2403.12285

arXiv

[43] [49]

FinCon: A synthesized LLM multi-agent system with conceptual verbal reinforcement for enhanced financial decision making,

Y . Yu, Z. Yao, H. Li, Z. Deng, Y . Cao, Z. Chen, J. W. Suchow, R. Liu, Z. Cui, Z. Xu, D. Zhang, K. Subbalakshmi, G. Xiong, Y . He, J. Huang, D. Li, and Q. Xie, “FinCon: A synthesized LLM multi-agent system with conceptual verbal reinforcement for enhanced financial decision making,”arXiv preprint arXiv:2407.06567, 2024. [Online]. Available: https://arxiv...

arXiv 2024

[44] [51]

Available: https://arxiv.org/abs/2508.02366

[Online]. Available: https://arxiv.org/abs/2508.02366

arXiv

[45] [52]

Chatgpt,

OpenAI, “Chatgpt,” https://chatgpt.com/, 2026, aI system used for manuscript drafting assistance

2026

[46] [53]

——, “Codex,” https://openai.com/codex/, 2026, aI coding assistant used for LaTeX editing and build verification

2026

[47] [54]

Submission guidelines for authors,

IEEE Access, “Submission guidelines for authors,” https: //ieeeaccess.ieee.org/guide-for-authors/submission-guidelines/, 2026, accessed: 2026-05-30

2026

[48] [55]

Preparing your article,

——, “Preparing your article,” https://ieeeaccess.ieee.org/authors/ preparing-your-article/, 2026, accessed: 2026-05-30

2026