pith. sign in

arxiv: 2606.27570 · v1 · pith:WHPWAJSPnew · submitted 2026-06-25 · 💻 cs.LO

Auditing AI Investment Recommendations as Executable Actions

Pith reviewed 2026-06-29 00:28 UTC · model grok-4.3

classification 💻 cs.LO
keywords AI auditinginvestment recommendationsexecutable actionsvaliditystabilityagreementbaseline verifierfinancial constraints
0
0 comments X

The pith

AI investment recommendations must be audited for executability before evaluating their returns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard ways of judging AI investment advice by return or agreement with a reference can accept recommendations that cannot be executed due to portfolio rules and fees. It introduces an auditing protocol using a replayable baseline to measure three distinct properties: validity under constraints, stability over runs, and agreement with the baseline. These properties do not align, as the advisor with the highest agreement score of 0.94 is valid in only 0.58 of cases. Frontier models often fail due to arithmetic errors in orders rather than poor investment logic, but providing deterministic fee calculations improves their validity. This approach prioritizes executable actions over predictive performance.

Core claim

An AI-generated recommendation should first be audited as an executable financial action using a deterministic baseline verifier, and only then judged on return. The protocol scores on validity under portfolio and fee constraints, stability across repeated runs, and agreement with the baseline. These separate cleanly, with agreement being misleading: the top-agreeing control at 0.94 is admissible only 0.58 of the time. Models fail on arithmetic, fixed by supplying fee arithmetic, with no alpha claim made for the verifier.

What carries the argument

The three-property auditing protocol (validity, stability, agreement) applied to a deterministic replayable baseline verifier derived from the fee schedule.

If this is right

  • Agreement metrics alone certify invalid actions in 42 percent of cases.
  • Frontier models achieve near-perfect validity when fee arithmetic is supplied deterministically.
  • Validity checks must precede return or agreement judgments for AI financial advice.
  • Failures often stem from order arithmetic rather than investment judgment.
  • The baseline provides a transparent, replayable standard without claiming superior returns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar auditing protocols could apply to AI recommendations in other constrained domains such as logistics or medical dosing.
  • Freezing the baseline inputs allows for reproducible benchmarks across different AI systems.
  • AI models may benefit from integrated constraint solvers to avoid arithmetic failures in recommendations.
  • Expanding the scenario bank beyond 120 cases could identify systematic failure modes in current models.

Load-bearing premise

The three properties of validity, stability, and agreement can be measured separately without one determining the others, and the baseline's guardrails derive directly from the fee schedule with decisions replayable from frozen inputs.

What would settle it

Finding a recommendation that scores high on agreement and validity but cannot actually be executed in a live trading environment due to unaccounted market conditions, or a case where the baseline rejects an action that proves executable.

Figures

Figures reproduced from arXiv: 2606.27570 by \'Agney Lopes Roth Ferraz, Sidnei Barbieri, Wellington Vargas.

Figure 1
Figure 1. Figure 1: The audit boundary turns advisor text into executable orders and typed violations before any return is computed. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The archetype controls separate validity, agree [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Naive splitting pays recurring fee premiums; the [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Each factor reshapes the ranking differently; equal [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Rolling one-year excess Sharpe often favors equal [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
read the original abstract

AI systems increasingly produce investment recommendations, yet the usual evaluations ask the wrong question. Realized return is noisy and easy to overfit, and agreement with a reference portfolio can reward advice that cannot be executed. We argue that an AI-generated recommendation should first be audited as an executable financial action, and only then judged on return. We make this concrete with a deterministic, replayable baseline and a protocol that scores any advisor on three properties a single number conflates: validity under portfolio and fee constraints, stability across repeated runs, and agreement with the baseline. These properties separate cleanly, and agreement is the most misleading in isolation: across a 120-scenario bank, the control that agrees most with the baseline (0.94) is admissible in only 0.58 of its runs, so agreement certifies an invalid action in 42% of them. On an adversarial set, two frontier models are admissible in barely half of their bare-prompt runs and fail on order arithmetic, not judgment; supplying the fee arithmetic deterministically lifts both to near-perfect validity. We make no alpha claim: the baseline is a transparent verifier whose guardrails follow from the fee schedule and whose decisions replay from frozen inputs, and every figure and table regenerates offline from the artifact.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper argues that evaluations of AI investment recommendations should prioritize auditing them as executable financial actions (under portfolio and fee constraints) before assessing returns or agreement with references. It introduces a deterministic, replayable baseline verifier and a scoring protocol for three separable properties—validity, stability across runs, and baseline agreement—supported by a 120-scenario control example (0.94 agreement but only 0.58 admissibility) and results on frontier models where bare-prompt runs fail on order arithmetic (admissible in ~half the cases) but improve to near-perfect validity when fee arithmetic is supplied deterministically. The work makes no alpha claim and supplies a transparent, parameter-free artifact for regenerating all figures and tables offline.

Significance. If the separation of properties and the baseline's replayability hold as described, the framework offers a concrete, falsifiable method for assessing executability of AI financial advice that avoids conflating agreement with validity. The explicit provision of a deterministic artifact and the absence of fitted parameters or invented entities strengthen verifiability and reproducibility, which are positive contributions in this area.

minor comments (2)
  1. The abstract and introduction would benefit from explicit section references or a table summarizing the three properties (validity, stability, agreement) with their definitions and how each is computed from the baseline.
  2. Clarify in the methods how the 120-scenario bank is constructed and whether the adversarial set is disjoint, to allow readers to assess the generality of the 0.94/0.58 separation result.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the constructive and positive review. The assessment correctly identifies the core contribution: a separable, replayable protocol for auditing executability before returns or agreement metrics. We appreciate the emphasis on verifiability and the absence of fitted parameters. No major comments requiring substantive changes were raised, and we will address any minor editorial points in the revised version.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper defines a deterministic baseline verifier whose rules derive directly from the fee schedule and replay from frozen inputs, with no fitted parameters, self-referential definitions, or load-bearing self-citations. The three properties (validity, stability, agreement) are shown to separate via an explicit 120-scenario control example where high agreement (0.94) yields low admissibility (0.58), without any reduction of a prediction or claim to its own inputs by construction. The protocol is presented as an independent auditing method rather than a derived result that collapses into its premises.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields no visible free parameters or invented entities; the central assumption is that executability under constraints is the primary filter and that the three scored properties are independent.

axioms (1)
  • domain assumption Validity, stability, and agreement separate cleanly as distinct measurable properties.
    Stated directly in the abstract as a result of applying the protocol.

pith-pipeline@v0.9.1-grok · 5754 in / 1266 out tokens · 30931 ms · 2026-06-29T00:28:26.902179+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 11 canonical work pages

  1. [1]

    Alamdari, Toryn Q

    Parand A. Alamdari, Toryn Q. Klassen, and Sheila A. McIlraith. 2026. Formal Methods Meet LLMs: Auditing, Monitoring, and Intervention for Compliance of Advanced AI Systems. arXiv:2605.16198 [cs.AI] https://arxiv.org/abs/2605.16198

  2. [2]

    Asness, Andrea Frazzini, and Lasse Heje Pedersen

    Clifford S. Asness, Andrea Frazzini, and Lasse Heje Pedersen. 2019. Quality Mi- nus Junk.Review of Accounting Studies24, 1 (2019), 34–112

  3. [3]

    Bailey, Jonathan M

    David H. Bailey, Jonathan M. Borwein, Marcos López de Prado, and Qiji Jim Zhu

  4. [4]

    Pseudo-Mathematics and Financial Charlatanism: The Effects of Backtest Overfitting on Out-of-Sample Performance.Notices of the American Mathemati- cal Society61, 5 (2014), 458–471

  5. [5]

    Divij Chawla, Ashita Bhutada, Do Duc Anh, Abhinav Raghunathan, Vinod SP, Cathy Guo, Dar Win Liew, Prannaya Gupta, Rishabh Bhardwaj, Rajat Bhardwaj, and Soujanya Poria. 2025. Evaluating AI for Finance: Is AI Credible at Assessing Investment Risk? arXiv:2505.18953

  6. [6]

    Yanxu Chen, Zijun Yao, Yantao Liu, Amy Xin, Jin Ye, Jianing Yu, Lei Hou, and Juanzi Li. 2025. StockBench: Can LLM Agents Trade Stocks Profitably in Real- World Markets? arXiv:2510.02209 [cs.LG] doi:10.48550/arXiv.2510.02209

  7. [7]

    Constantinides

    George M. Constantinides. 1986. Capital Market Equilibrium with Transaction Costs.Journal of Political Economy94, 4 (1986), 842–862

  8. [8]

    Francesco D’Acunto, Nagpurnanand Prabhala, and Alberto G. Rossi. 2019. The Promises and Pitfalls of Robo-Advising.The Review of Financial Studies32, 5 (2019), 1983–2020. doi:10.1093/rfs/hhz014

  9. [9]

    M. H. A. Davis and A. R. Norman. 1990. Portfolio Selection with Transaction Costs.Mathematics of Operations Research15, 4 (1990), 676–713

  10. [10]

    Tomás de la Rosa. 2023. Planning for the Efficient Updating of Mutual Fund Portfolios. arXiv:2311.16204

  11. [11]

    Victor DeMiguel, Lorenzo Garlappi, and Raman Uppal. 2009. Optimal Versus Naive Diversification: How Inefficient Is the 1/N Portfolio Strategy?The Review of Financial Studies22, 5 (2009), 1915–1953

  12. [12]

    Soufiane El Amine El Alami, Abderazzak Mouiha, Abdelatif Hafid, and Ahmed El Hilali Alaoui. 2025. Machine Learning and Deep Learning in Computational Finance: A Systematic Review. arXiv:2511.21588

  13. [13]

    Nicolae Gârleanu and Lasse Heje Pedersen. 2013. Dynamic Trading with Pre- dictable Returns and Transaction Costs.The Journal of Finance68, 6 (2013), 2309–2340. doi:10.1111/jofi.12080

  14. [14]

    Shihao Gu, Bryan Kelly, and Dacheng Xiu. 2020. Empirical Asset Pricing via Machine Learning.The Review of Financial Studies33, 5 (2020), 2223–2273. doi:10. 1093/rfs/hhaa009

  15. [15]

    Tiansheng Hu, Tongyan Hu, Liuyang Bai, Yilun Zhao, Arman Cohan, and Chen Zhao. 2025. FinTrust: A Comprehensive Benchmark of Trustworthiness Evalua- tion in Finance Domain. InProceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing. Association for Computational Linguistics, Suzhou, China, 10099–10128. doi:10.18653/v1/2025.e...

  16. [16]

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of Hallucination in Natural Language Generation.Comput. Surveys55, 12, Article 248 (2023), 38 pages. doi:10.1145/3571730

  17. [17]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Re- solve Real-World GitHub Issues?. InThe Twelfth International Conference on Learning Representations (ICLR). OpenReview.net, Vienna, Austria, 52 pages

  18. [18]

    Sridhar Kanuri and Robert W. McLeod. 2016. Sustainable Competitive Advan- tage and Stock Performance: The Case for Wide Moat Stocks.Applied Economics 48, 52 (2016), 5117–5127

  19. [19]

    Hoyoung Lee, Junhyuk Seo, Suhwan Park, Junhyeong Lee, Wonbin Ahn, Chanyeol Choi, Alejandro Lopez-Lira, and Yongjae Lee. 2025. Your AI, Not Your View: The Bias of LLMs in Investment Analysis. InProceedings of the 6th ACM International Conference on AI in Finance. Association for Computing Machinery, New York, NY, USA, 13 pages. doi:10.1145/3768292.3770375

  20. [20]

    Weixian Waylon Li, Hyeonjun Kim, Mihai Cucuringu, and Tiejun Ma. 2026. Can LLM-based Financial Investing Strategies Outperform the Market in Long Run?. InProceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1. Association for Computing Machinery, New York, NY, USA, 2711–2722. doi:10.1145/3770854.3785702

  21. [21]

    Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michi- hiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D Manning, Christopher Re, Diana Acosta-Navas, Drew A. Hud- son, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hon...

  22. [22]

    Hong Liu. 2004. Optimal Consumption and Investment with Transaction Costs and Multiple Risky Assets.The Journal of Finance59, 1 (2004), 289–338. doi:10. 1111/j.1540-6261.2004.00634.x

  23. [23]

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, et al. 2024. AgentBench: Evaluating LLMs as Agents. InThe Twelfth International Conference on Learning Representa- tions (ICLR). OpenReview.net, Vienna, Austria, 58 pages

  24. [24]

    Miguel Sousa Lobo, Maryam Fazel, and Stephen Boyd. 2007. Portfolio Optimiza- tion with Linear and Fixed Transaction Costs.Annals of Operations Research152, 1 (2007), 341–365

  25. [25]

    Lesly Miculicich, Mihir Parmar, Hamid Palangi, Krishnamurthy Dj Dvijotham, Mirko Montanari, Tomas Pfister, and Long T. Le. 2025. VeriGuard: Enhancing LLM Agent Safety via Verified Code Generation. arXiv:2510.05156. Google Cloud AI Research and Google DeepMind

  26. [26]

    Robert Novy-Marx. 2013. The Other Side of Value: The Gross Profitability Pre- mium.Journal of Financial Economics108, 1 (2013), 1–28

  27. [27]

    Daesan Oh, Taehwan Kim, Junkyu Jang, and Sung-Hyuk Park. 2025. Democratiz- ing Alpha: LLM-Driven Portfolio Construction for Retail Investors Using Public Financial Media. NeurIPS 2025 Workshop: Generative AI in Finance

  28. [28]

    Lingfei Qian, Xueqing Peng, Hanley Smith, Yi Han, Yueru He, Haohang Li, Yu- peng Cao, Yangyang Yu, Guojun Xiong, Peng Lu, Yan Wang, Vincent Jim Zhang, Huan He, Jian-Yun Nie, Alejandro Lopez-Lira, Jimin Huang, and Sophia Ana- niadou. 2026. When Agents Trade: Live Multi-Market Trading Arena for LLM Agents. InProceedings of the ACM Web Conference 2026. Assoc...

  29. [29]

    Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. InProceed- ings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 4902–4912. doi:10.18653/v1/ 2020.acl-main.442

  30. [30]

    Fernando Spadea and Oshani Seneviratne. 2025. Aligning Language Models with Investor and Market Behavior for Financial Recommendations. InProceedings of the 6th ACM International Conference on AI in Finance. Association for Comput- ing Machinery, New York, NY, USA, 9 pages. doi:10.1145/3768292.3770399

  31. [31]

    Poskitt, and Jun Sun

    Haoyu Wang, Christopher M. Poskitt, and Jun Sun. 2025. AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents. arXiv:2503.18666 [cs.AI] https://arxiv.org/abs/2503.18666

  32. [32]

    Poskitt, Jiali Wei, and Jun Sun

    Haoyu Wang, Christopher M. Poskitt, Jiali Wei, and Jun Sun. 2026. ProbGuard: Probabilistic Runtime Monitoring for LLM Agent Safety. arXiv:2508.00500 [cs.AI] https://arxiv.org/abs/2508.00500

  33. [33]

    Qianqian Xie et al. 2024. FinBen: A Holistic Financial Benchmark for Large Lan- guage Models. InAdvances in Neural Information Processing Systems 37 (NeurIPS 2024), Datasets and Benchmarks Track. Curran Associates, Inc., Vancouver, BC, Canada, 28 pages. Auditing AI Investment Recommendations as Executable Actions Auditing AI Investment Recommendations as ...

  34. [34]

    Qianqian Xie, Weiguang Han, Xiao Zhang, Yanzhao Lai, Min Peng, Alejandro Lopez-Lira, and Jimin Huang. 2023. PIXIU: A Comprehensive Benchmark, In- struction Dataset and Large Language Model for Finance. InAdvances in Neural Information Processing Systems 36 (NeurIPS 2023), Datasets and Benchmarks Track. Curran Associates, Inc., New Orleans, LA, USA, 16 pages

  35. [35]

    Rongju Zhang, Nicolas Langrené, Yu Tian, Zili Zhu, Fima Klebaner, and Kais Hamza. 2019. Dynamic Portfolio Optimization with Liquidity Cost and Mar- ket Impact: A Simulation-and-Regression Approach.Quantitative Finance19, 3 (2019), 519–532

  36. [36]

    Yuhan Zhi, Xiaoyu Zhang, Longtian Wang, Shumin Jiang, Shiqing Ma, Xiaohong Guan, and Chao Shen. 2025. Exposing Product Bias in LLM Investment Recom- mendation. arXiv:2503.08750 [cs.CL] doi:10.48550/arXiv.2503.08750