Hedge-Bench: Benchmarking Agents on Hard, Realistic Tasks Pertaining to Financial Reasoning
Pith reviewed 2026-06-28 09:59 UTC · model grok-4.3
The pith
Frontier AI agents score below 16 percent on a benchmark of 102 real hedge fund analyst tasks graded against expert reasoning traces.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Hedge-Bench 1.0 consists of 102 actual on-the-job tasks grounded in the explicit reasoning traces of professional hedge fund analysts working with relevant information sources. This approach enables deterministic grading against verified expert steps. Frontier models and agents score below 16 percent on the benchmark.
What carries the argument
Hedge-Bench 1.0, a benchmark of 102 tasks drawn from hedge fund analyst work and supplied with explicit expert reasoning traces that permit deterministic grading.
If this is right
- Current agents cannot yet replicate the traceable reasoning steps used in professional financial analysis.
- Benchmarks that rely on model-based judging introduce circularity that deterministic trace matching avoids.
- The released dataset and harness allow repeated, consistent testing of future models on the same tasks.
- Progress on these tasks would require agents to handle synthesis and judgment rather than only retrieval and calculation.
Where Pith is reading between the lines
- Success on this benchmark could serve as a template for creating similar traceable-task sets in legal or medical domains.
- The performance gap suggests that simply increasing model scale may not close the difference without explicit mechanisms for step verification.
- Agents trained or prompted to output intermediate traces matching the expert format might show measurable gains on the same tasks.
Load-bearing premise
The 102 tasks drawn from hedge fund work and their accompanying expert traces are representative of open-ended analyst questions and support fully deterministic grading.
What would settle it
An independent replication in which the same agents are run through the published evaluation harness and achieve scores above 30 percent would contradict the reported performance levels.
read the original abstract
AI agents can increasingly handle the mechanical tasks of financial analysis: retrieving documents, calculating formulas, updating spreadsheets. The harder, more valuable challenge is reasoning through the open-ended questions that define expert Analyst work. Existing benchmarks do not capture this class of problem, and those that attempt to evaluate open-ended reasoning rely on model-judged outputs that introduce noise and circularity. We present Hedge-Bench 1.0: a benchmark of 102 actual, on-the-job tasks grounded in the explicit reasoning traces of professional hedge fund analysts working with relevant information sources. This approach enables deterministic grading against verified expert steps. Frontier models and agents score below 16\% on the benchmark. We publish the dataset and evaluation harness at github.com/Trata-Inc/trata-hedge-bench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Hedge-Bench 1.0, a benchmark of 102 tasks drawn from actual professional hedge-fund analyst workflows, each paired with verified expert reasoning traces to enable deterministic grading. It argues that existing benchmarks suffer from noise and circularity due to model-based judging of open-ended outputs, and reports that frontier models and agents score below 16% on this new benchmark. The dataset and evaluation harness are released publicly.
Significance. If the construction and grading procedure hold, the benchmark provides a concrete, falsifiable measure of the gap between current AI agents and expert-level financial reasoning on realistic, open-ended tasks. The grounding in professional traces and deterministic evaluation against expert steps is a methodological strength that could influence how future financial-reasoning benchmarks are designed.
major comments (3)
- [§3] §3 (Benchmark Construction): The claim that the 102 tasks are representative of expert Analyst work rests on selection from professional hedge-fund traces, but the manuscript provides no explicit inclusion/exclusion criteria, sampling procedure, or quantification of task diversity (e.g., by topic, time horizon, or information-source type). This directly affects the generalizability of the <16% result.
- [§4] §4 (Evaluation Procedure): The deterministic grading approach is presented as eliminating noise and circularity, yet no inter-rater reliability statistics for the expert traces, no rubric examples, and no analysis of grading edge cases or disagreement rates are reported. Without these, the central claim that scores are reliably below 16% cannot be fully evaluated.
- [Results] Results section: The headline performance numbers (<16%) are given without statistical error bars, confidence intervals, or breakdown by model/agent type and task category. This makes it impossible to assess whether the reported ceiling is robust or sensitive to small changes in the 102-task set.
minor comments (2)
- [Abstract] The abstract and introduction repeatedly use “deterministic grading” without a concise definition or pointer to the exact grading algorithm in the evaluation harness.
- [Table 1] Table 1 (model scores) would benefit from an additional column showing the number of tasks each model/agent was evaluated on, to clarify coverage.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major point below, indicating revisions where the manuscript can be strengthened by adding requested details.
read point-by-point responses
-
Referee: [§3] §3 (Benchmark Construction): The claim that the 102 tasks are representative of expert Analyst work rests on selection from professional hedge-fund traces, but the manuscript provides no explicit inclusion/exclusion criteria, sampling procedure, or quantification of task diversity (e.g., by topic, time horizon, or information-source type). This directly affects the generalizability of the <16% result.
Authors: We agree that explicit documentation of the selection process is needed for assessing generalizability. The 102 tasks were drawn from a larger collection of verified professional hedge-fund analyst workflows, with selection focused on open-ended tasks requiring integration of multiple heterogeneous sources. In the revision we will add a subsection to §3 that states the inclusion/exclusion criteria, describes the sampling procedure from the trace pool, and reports quantitative diversity statistics by topic, time horizon, and information-source type. revision: yes
-
Referee: [§4] §4 (Evaluation Procedure): The deterministic grading approach is presented as eliminating noise and circularity, yet no inter-rater reliability statistics for the expert traces, no rubric examples, and no analysis of grading edge cases or disagreement rates are reported. Without these, the central claim that scores are reliably below 16% cannot be fully evaluated.
Authors: The grading itself is deterministic because it verifies completion of specific steps listed in the expert traces rather than relying on open-ended judgment. Nevertheless, we accept that reliability information on the traces themselves would strengthen the claim. In the revised §4 we will report inter-rater reliability statistics for the expert traces, provide sample rubrics, and include an analysis of grading edge cases and observed disagreement rates. revision: yes
-
Referee: Results section: The headline performance numbers (<16%) are given without statistical error bars, confidence intervals, or breakdown by model/agent type and task category. This makes it impossible to assess whether the reported ceiling is robust or sensitive to small changes in the 102-task set.
Authors: We will expand the Results section to include statistical error bars and confidence intervals around the aggregate scores. We will also add performance breakdowns by model/agent type and by task category so readers can evaluate robustness and sensitivity to the task set. revision: yes
Circularity Check
No significant circularity; empirical benchmark construction is self-contained
full rationale
The paper introduces Hedge-Bench as an empirical dataset of 102 tasks drawn from professional hedge-fund traces, with explicit expert reasoning for deterministic grading. No equations, derivations, fitted parameters, or predictions appear in the manuscript. The central claim (frontier models score <16%) is a direct empirical measurement against the constructed benchmark and does not reduce to any self-referential step, self-citation chain, or ansatz. The construction is externally verifiable via the published dataset and does not rely on internal normalization or uniqueness theorems. This is the normal case of a benchmark paper whose result is independent of its own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 102 tasks represent the harder, more valuable challenge of open-ended financial reasoning that defines expert Analyst work.
Reference graph
Works this paper leans on
-
[1]
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=
FinQA: A Dataset of Numerical Reasoning over Financial Data , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=
2021
-
[2]
2022 , eprint=
ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering , author=. 2022 , eprint=
2022
-
[3]
TAT - QA : A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance
Zhu, Fengbin and Lei, Wenqiang and Huang, Youcheng and Wang, Chao and Zhang, Shuo and Lv, Jiancheng and Feng, Fuli and Chua, Tat-Seng. TAT - QA : A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conf...
-
[4]
2023 , eprint=
FinanceBench: A New Benchmark for Financial Question Answering , author=. 2023 , eprint=
2023
-
[5]
2024 , isbn =
Xie, Qianqian and Han, Weiguang and Chen, Zhengyu and Xiang, Ruoyu and Zhang, Xiao and He, Yueru and Xiao, Mengxi and Li, Dong and Dai, Yongfu and Feng, Duanyu and Xu, Yijing and Kang, Haoqiang and Kuang, Ziyan and Yuan, Chenhan and Yang, Kailai and Luo, Zheheng and Zhang, Tianlin and Liu, Zhiwei and Xiong, Guojun and Deng, Zhiyang and Jiang, Yuechen and ...
2024
-
[6]
2023 , url=
Qianqian Xie and Weiguang Han and Xiao Zhang and Yanzhao Lai and Min Peng and Alejandro Lopez-Lira and Jimin Huang , booktitle=. 2023 , url=
2023
-
[7]
2025 , eprint=
DocFinQA: A Long-Context Financial Reasoning Dataset , author=. 2025 , eprint=
2025
-
[8]
2025 , eprint=
MultiFinBen: Benchmarking Large Language Models for Multilingual and Multimodal Financial Application , author=. 2025 , eprint=
2025
-
[9]
2025 , eprint=
Finance Agent Benchmark: Benchmarking LLMs on Real-world Financial Research Tasks , author=. 2025 , eprint=
2025
-
[10]
2026 , eprint=
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces , author=. 2026 , eprint=
2026
-
[11]
2025 , month = oct, howpublished =
Anthropic , title =. 2025 , month = oct, howpublished =
2025
-
[12]
2026 , month = may, howpublished =
Anthropic , title =. 2026 , month = may, howpublished =
2026
-
[13]
2026 , month = apr, howpublished =
Anthropic , title =. 2026 , month = apr, howpublished =
2026
-
[14]
2026 , month = feb, howpublished =
Anthropic , title =. 2026 , month = feb, howpublished =
2026
-
[15]
2026 , month = may, howpublished =
Kavukcuoglu, Koray and Dean, Jeff and Vinyals, Oriol and Shazeer, Noam , title =. 2026 , month = may, howpublished =
2026
-
[16]
2026 , month = feb, howpublished =
2026
-
[17]
2026 , month = apr, howpublished =
Introducing. 2026 , month = apr, howpublished =
2026
-
[18]
2026 , month = mar, howpublished =
Introducing. 2026 , month = mar, howpublished =
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.