Hedge-Bench: Benchmarking Agents on Hard, Realistic Tasks Pertaining to Financial Reasoning

Alice Lu; Andy Lyu; Eric Cho; Shawn Huang

arxiv: 2606.03918 · v1 · pith:J74XWQNWnew · submitted 2026-06-02 · 💻 cs.AI

Hedge-Bench: Benchmarking Agents on Hard, Realistic Tasks Pertaining to Financial Reasoning

Eric Cho , Shawn Huang , Alice Lu , Andy Lyu This is my paper

Pith reviewed 2026-06-28 09:59 UTC · model grok-4.3

classification 💻 cs.AI

keywords AI agentsfinancial reasoningbenchmarkhedge fund tasksreasoning tracesdeterministic evaluationopen-ended analysis

0 comments

The pith

Frontier AI agents score below 16 percent on a benchmark of 102 real hedge fund analyst tasks graded against expert reasoning traces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Hedge-Bench as a set of 102 tasks taken directly from professional hedge fund work, each paired with explicit analyst reasoning steps. This design supports grading by matching outputs to verified traces instead of relying on model judgments. Current frontier models and agents reach less than 16 percent success under this standard. The benchmark targets the open-ended synthesis and judgment that separate mechanical data tasks from expert financial analysis. Accurate measurement of this gap matters for understanding where AI still falls short in high-stakes decision domains.

Core claim

Hedge-Bench 1.0 consists of 102 actual on-the-job tasks grounded in the explicit reasoning traces of professional hedge fund analysts working with relevant information sources. This approach enables deterministic grading against verified expert steps. Frontier models and agents score below 16 percent on the benchmark.

What carries the argument

Hedge-Bench 1.0, a benchmark of 102 tasks drawn from hedge fund analyst work and supplied with explicit expert reasoning traces that permit deterministic grading.

If this is right

Current agents cannot yet replicate the traceable reasoning steps used in professional financial analysis.
Benchmarks that rely on model-based judging introduce circularity that deterministic trace matching avoids.
The released dataset and harness allow repeated, consistent testing of future models on the same tasks.
Progress on these tasks would require agents to handle synthesis and judgment rather than only retrieval and calculation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Success on this benchmark could serve as a template for creating similar traceable-task sets in legal or medical domains.
The performance gap suggests that simply increasing model scale may not close the difference without explicit mechanisms for step verification.
Agents trained or prompted to output intermediate traces matching the expert format might show measurable gains on the same tasks.

Load-bearing premise

The 102 tasks drawn from hedge fund work and their accompanying expert traces are representative of open-ended analyst questions and support fully deterministic grading.

What would settle it

An independent replication in which the same agents are run through the published evaluation harness and achieve scores above 30 percent would contradict the reported performance levels.

read the original abstract

AI agents can increasingly handle the mechanical tasks of financial analysis: retrieving documents, calculating formulas, updating spreadsheets. The harder, more valuable challenge is reasoning through the open-ended questions that define expert Analyst work. Existing benchmarks do not capture this class of problem, and those that attempt to evaluate open-ended reasoning rely on model-judged outputs that introduce noise and circularity. We present Hedge-Bench 1.0: a benchmark of 102 actual, on-the-job tasks grounded in the explicit reasoning traces of professional hedge fund analysts working with relevant information sources. This approach enables deterministic grading against verified expert steps. Frontier models and agents score below 16\% on the benchmark. We publish the dataset and evaluation harness at github.com/Trata-Inc/trata-hedge-bench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Hedge-Bench gives a new benchmark of 102 real hedge-fund tasks graded against expert traces, with frontier models below 16%, but the abstract supplies almost no information on task selection or grading reliability.

read the letter

The core of this paper is a benchmark built from 102 actual on-the-job tasks taken from professional hedge fund analysts, each paired with their explicit reasoning traces so grading can be done deterministically against those traces. That setup is the main thing worth noticing. It tries to fix the noise and circularity that come from having models judge other models on open-ended financial questions.

The approach is straightforward and the published dataset plus harness is a practical step. Showing that current frontier models and agents stay under 16% on these tasks gives a concrete signal that the problems are harder than the mechanical retrieval and calculation work that existing tests cover.

The abstract gives no information on how the tasks were picked from the larger pool of hedge fund work, whether multiple analysts reviewed the traces for consistency, or how the scores were aggregated with any measure of variance. Those omissions make it difficult to judge whether the 102 tasks are representative or whether the grading procedure introduces its own artifacts. The stress-test note says the construction looks internally consistent on the surface, and nothing in the abstract contradicts that, but the lack of those details is still the main limitation.

This is useful for groups working on agent evaluation in finance or other domains that need open-ended professional reasoning. The published resources lower the barrier to trying it out. It is worth sending to peer review so the task construction and grading protocol can be examined in full.

Referee Report

3 major / 2 minor

Summary. The paper introduces Hedge-Bench 1.0, a benchmark of 102 tasks drawn from actual professional hedge-fund analyst workflows, each paired with verified expert reasoning traces to enable deterministic grading. It argues that existing benchmarks suffer from noise and circularity due to model-based judging of open-ended outputs, and reports that frontier models and agents score below 16% on this new benchmark. The dataset and evaluation harness are released publicly.

Significance. If the construction and grading procedure hold, the benchmark provides a concrete, falsifiable measure of the gap between current AI agents and expert-level financial reasoning on realistic, open-ended tasks. The grounding in professional traces and deterministic evaluation against expert steps is a methodological strength that could influence how future financial-reasoning benchmarks are designed.

major comments (3)

[§3] §3 (Benchmark Construction): The claim that the 102 tasks are representative of expert Analyst work rests on selection from professional hedge-fund traces, but the manuscript provides no explicit inclusion/exclusion criteria, sampling procedure, or quantification of task diversity (e.g., by topic, time horizon, or information-source type). This directly affects the generalizability of the <16% result.
[§4] §4 (Evaluation Procedure): The deterministic grading approach is presented as eliminating noise and circularity, yet no inter-rater reliability statistics for the expert traces, no rubric examples, and no analysis of grading edge cases or disagreement rates are reported. Without these, the central claim that scores are reliably below 16% cannot be fully evaluated.
[Results] Results section: The headline performance numbers (<16%) are given without statistical error bars, confidence intervals, or breakdown by model/agent type and task category. This makes it impossible to assess whether the reported ceiling is robust or sensitive to small changes in the 102-task set.

minor comments (2)

[Abstract] The abstract and introduction repeatedly use “deterministic grading” without a concise definition or pointer to the exact grading algorithm in the evaluation harness.
[Table 1] Table 1 (model scores) would benefit from an additional column showing the number of tasks each model/agent was evaluated on, to clarify coverage.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below, indicating revisions where the manuscript can be strengthened by adding requested details.

read point-by-point responses

Referee: [§3] §3 (Benchmark Construction): The claim that the 102 tasks are representative of expert Analyst work rests on selection from professional hedge-fund traces, but the manuscript provides no explicit inclusion/exclusion criteria, sampling procedure, or quantification of task diversity (e.g., by topic, time horizon, or information-source type). This directly affects the generalizability of the <16% result.

Authors: We agree that explicit documentation of the selection process is needed for assessing generalizability. The 102 tasks were drawn from a larger collection of verified professional hedge-fund analyst workflows, with selection focused on open-ended tasks requiring integration of multiple heterogeneous sources. In the revision we will add a subsection to §3 that states the inclusion/exclusion criteria, describes the sampling procedure from the trace pool, and reports quantitative diversity statistics by topic, time horizon, and information-source type. revision: yes
Referee: [§4] §4 (Evaluation Procedure): The deterministic grading approach is presented as eliminating noise and circularity, yet no inter-rater reliability statistics for the expert traces, no rubric examples, and no analysis of grading edge cases or disagreement rates are reported. Without these, the central claim that scores are reliably below 16% cannot be fully evaluated.

Authors: The grading itself is deterministic because it verifies completion of specific steps listed in the expert traces rather than relying on open-ended judgment. Nevertheless, we accept that reliability information on the traces themselves would strengthen the claim. In the revised §4 we will report inter-rater reliability statistics for the expert traces, provide sample rubrics, and include an analysis of grading edge cases and observed disagreement rates. revision: yes
Referee: Results section: The headline performance numbers (<16%) are given without statistical error bars, confidence intervals, or breakdown by model/agent type and task category. This makes it impossible to assess whether the reported ceiling is robust or sensitive to small changes in the 102-task set.

Authors: We will expand the Results section to include statistical error bars and confidence intervals around the aggregate scores. We will also add performance breakdowns by model/agent type and by task category so readers can evaluate robustness and sensitivity to the task set. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark construction is self-contained

full rationale

The paper introduces Hedge-Bench as an empirical dataset of 102 tasks drawn from professional hedge-fund traces, with explicit expert reasoning for deterministic grading. No equations, derivations, fitted parameters, or predictions appear in the manuscript. The central claim (frontier models score <16%) is a direct empirical measurement against the constructed benchmark and does not reduce to any self-referential step, self-citation chain, or ansatz. The construction is externally verifiable via the published dataset and does not rely on internal normalization or uniqueness theorems. This is the normal case of a benchmark paper whose result is independent of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that the selected tasks and expert traces accurately capture open-ended financial reasoning without selection bias or grading ambiguity.

axioms (1)

domain assumption The 102 tasks represent the harder, more valuable challenge of open-ended financial reasoning that defines expert Analyst work.
This premise is required to establish the benchmark's relevance and is stated in the abstract.

pith-pipeline@v0.9.1-grok · 5663 in / 1154 out tokens · 23069 ms · 2026-06-28T09:59:10.145222+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 1 canonical work pages

[1]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

FinQA: A Dataset of Numerical Reasoning over Financial Data , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

2021
[2]

2022 , eprint=

ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering , author=. 2022 , eprint=

2022
[3]

TAT - QA : A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance

Zhu, Fengbin and Lei, Wenqiang and Huang, Youcheng and Wang, Chao and Zhang, Shuo and Lv, Jiancheng and Feng, Fuli and Chua, Tat-Seng. TAT - QA : A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conf...

work page doi:10.18653/v1/2021.acl-long.254 2021
[4]

2023 , eprint=

FinanceBench: A New Benchmark for Financial Question Answering , author=. 2023 , eprint=

2023
[5]

2024 , isbn =

Xie, Qianqian and Han, Weiguang and Chen, Zhengyu and Xiang, Ruoyu and Zhang, Xiao and He, Yueru and Xiao, Mengxi and Li, Dong and Dai, Yongfu and Feng, Duanyu and Xu, Yijing and Kang, Haoqiang and Kuang, Ziyan and Yuan, Chenhan and Yang, Kailai and Luo, Zheheng and Zhang, Tianlin and Liu, Zhiwei and Xiong, Guojun and Deng, Zhiyang and Jiang, Yuechen and ...

2024
[6]

2023 , url=

Qianqian Xie and Weiguang Han and Xiao Zhang and Yanzhao Lai and Min Peng and Alejandro Lopez-Lira and Jimin Huang , booktitle=. 2023 , url=

2023
[7]

2025 , eprint=

DocFinQA: A Long-Context Financial Reasoning Dataset , author=. 2025 , eprint=

2025
[8]

2025 , eprint=

MultiFinBen: Benchmarking Large Language Models for Multilingual and Multimodal Financial Application , author=. 2025 , eprint=

2025
[9]

2025 , eprint=

Finance Agent Benchmark: Benchmarking LLMs on Real-world Financial Research Tasks , author=. 2025 , eprint=

2025
[10]

2026 , eprint=

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces , author=. 2026 , eprint=

2026
[11]

2025 , month = oct, howpublished =

Anthropic , title =. 2025 , month = oct, howpublished =

2025
[12]

2026 , month = may, howpublished =

Anthropic , title =. 2026 , month = may, howpublished =

2026
[13]

2026 , month = apr, howpublished =

Anthropic , title =. 2026 , month = apr, howpublished =

2026
[14]

2026 , month = feb, howpublished =

Anthropic , title =. 2026 , month = feb, howpublished =

2026
[15]

2026 , month = may, howpublished =

Kavukcuoglu, Koray and Dean, Jeff and Vinyals, Oriol and Shazeer, Noam , title =. 2026 , month = may, howpublished =

2026
[16]

2026 , month = feb, howpublished =

2026
[17]

2026 , month = apr, howpublished =

Introducing. 2026 , month = apr, howpublished =

2026
[18]

2026 , month = mar, howpublished =

Introducing. 2026 , month = mar, howpublished =

2026

[1] [1]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

FinQA: A Dataset of Numerical Reasoning over Financial Data , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

2021

[2] [2]

2022 , eprint=

ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering , author=. 2022 , eprint=

2022

[3] [3]

TAT - QA : A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance

Zhu, Fengbin and Lei, Wenqiang and Huang, Youcheng and Wang, Chao and Zhang, Shuo and Lv, Jiancheng and Feng, Fuli and Chua, Tat-Seng. TAT - QA : A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conf...

work page doi:10.18653/v1/2021.acl-long.254 2021

[4] [4]

2023 , eprint=

FinanceBench: A New Benchmark for Financial Question Answering , author=. 2023 , eprint=

2023

[5] [5]

2024 , isbn =

Xie, Qianqian and Han, Weiguang and Chen, Zhengyu and Xiang, Ruoyu and Zhang, Xiao and He, Yueru and Xiao, Mengxi and Li, Dong and Dai, Yongfu and Feng, Duanyu and Xu, Yijing and Kang, Haoqiang and Kuang, Ziyan and Yuan, Chenhan and Yang, Kailai and Luo, Zheheng and Zhang, Tianlin and Liu, Zhiwei and Xiong, Guojun and Deng, Zhiyang and Jiang, Yuechen and ...

2024

[6] [6]

2023 , url=

Qianqian Xie and Weiguang Han and Xiao Zhang and Yanzhao Lai and Min Peng and Alejandro Lopez-Lira and Jimin Huang , booktitle=. 2023 , url=

2023

[7] [7]

2025 , eprint=

DocFinQA: A Long-Context Financial Reasoning Dataset , author=. 2025 , eprint=

2025

[8] [8]

2025 , eprint=

MultiFinBen: Benchmarking Large Language Models for Multilingual and Multimodal Financial Application , author=. 2025 , eprint=

2025

[9] [9]

2025 , eprint=

Finance Agent Benchmark: Benchmarking LLMs on Real-world Financial Research Tasks , author=. 2025 , eprint=

2025

[10] [10]

2026 , eprint=

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces , author=. 2026 , eprint=

2026

[11] [11]

2025 , month = oct, howpublished =

Anthropic , title =. 2025 , month = oct, howpublished =

2025

[12] [12]

2026 , month = may, howpublished =

Anthropic , title =. 2026 , month = may, howpublished =

2026

[13] [13]

2026 , month = apr, howpublished =

Anthropic , title =. 2026 , month = apr, howpublished =

2026

[14] [14]

2026 , month = feb, howpublished =

Anthropic , title =. 2026 , month = feb, howpublished =

2026

[15] [15]

2026 , month = may, howpublished =

Kavukcuoglu, Koray and Dean, Jeff and Vinyals, Oriol and Shazeer, Noam , title =. 2026 , month = may, howpublished =

2026

[16] [16]

2026 , month = feb, howpublished =

2026

[17] [17]

2026 , month = apr, howpublished =

Introducing. 2026 , month = apr, howpublished =

2026

[18] [18]

2026 , month = mar, howpublished =

Introducing. 2026 , month = mar, howpublished =

2026