arxiv: 2311.11944 · v1 · submitted 2023-11-20 · 💻 cs.CL · cs.AI· cs.CE· stat.ML

Recognition: 1 theorem link

FinanceBench: A New Benchmark for Financial Question Answering

Pranab Islam , Anand Kannappan , Douwe Kiela , Rebecca Qian , Nino Scherrer , Bertie Vidgen

Authors on Pith no claims yet

Pith reviewed 2026-05-16 04:55 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CEstat.ML

keywords financial question answeringLLM evaluationbenchmarkGPT-4retrieval augmentationhallucinationsenterprise applicationspublic company data

0 comments

The pith

Existing LLMs fail to correctly answer or refuse 81 percent of financial questions even with retrieval support.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

FinanceBench supplies a dataset of 10,231 questions on publicly traded companies, each paired with answers and evidence strings, to test large language models on open-book financial question answering. Evaluation of 16 model configurations on 150 sampled cases reveals that GPT-4-Turbo paired with a retrieval system incorrectly answers or refuses 81 percent of the questions. The questions are constructed to be ecologically valid, straightforward, and clear-cut, establishing a minimum performance baseline. The results expose persistent issues such as hallucinations across models, including Llama2 and Claude2, even when relevant evidence is supplied via longer context windows. These findings indicate that current augmentation approaches remain insufficient for reliable enterprise financial applications.

Core claim

FinanceBench is a test suite of 10,231 questions about publicly traded companies with corresponding answers and evidence strings. When 16 state-of-the-art model configurations are tested on a sample of 150 cases, GPT-4-Turbo used with a retrieval system incorrectly answers or refuses 81 percent of questions. Augmentation techniques such as longer context windows improve results modestly but prove unrealistic for enterprise use due to latency and inability to process larger documents, while all models exhibit hallucinations and other weaknesses that limit practical suitability.

What carries the argument

FinanceBench, a dataset of 10,231 questions on publicly traded companies with paired answers and evidence strings that sets a minimum standard for open-book financial QA.

If this is right

Retrieval augmentation alone proves insufficient for accurate financial QA across tested models.
Longer context windows raise performance but increase latency and fail to scale to larger financial documents.
All examined models display hallucinations and refusals that restrict enterprise deployment.
The benchmark establishes a baseline that current LLMs do not meet for straightforward financial questions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Enterprises using LLMs for financial analysis would need supplementary human review or verification layers to catch errors.
Development of domain-specific models trained on financial statements could address the observed gaps in numerical and evidence-based reasoning.
Similar benchmark construction in other specialized domains might expose parallel reliability shortfalls in general-purpose LLMs.
Releasing the full 10,231-question set could enable targeted fine-tuning or new retrieval methods for financial data.

Load-bearing premise

The 150 sampled cases accurately represent the full set of 10,231 questions and that all questions are ecologically valid and clear-cut as stated.

What would settle it

A model configuration that correctly answers at least 90 percent of the 150 cases without refusals or hallucinations would challenge the reported limitations.

read the original abstract

FinanceBench is a first-of-its-kind test suite for evaluating the performance of LLMs on open book financial question answering (QA). It comprises 10,231 questions about publicly traded companies, with corresponding answers and evidence strings. The questions in FinanceBench are ecologically valid and cover a diverse set of scenarios. They are intended to be clear-cut and straightforward to answer to serve as a minimum performance standard. We test 16 state of the art model configurations (including GPT-4-Turbo, Llama2 and Claude2, with vector stores and long context prompts) on a sample of 150 cases from FinanceBench, and manually review their answers (n=2,400). The cases are available open-source. We show that existing LLMs have clear limitations for financial QA. Notably, GPT-4-Turbo used with a retrieval system incorrectly answered or refused to answer 81% of questions. While augmentation techniques such as using longer context window to feed in relevant evidence improve performance, they are unrealistic for enterprise settings due to increased latency and cannot support larger financial documents. We find that all models examined exhibit weaknesses, such as hallucinations, that limit their suitability for use by enterprises.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FinanceBench releases a useful new dataset for financial QA and shows top models failing on most questions even with retrieval, though the manual scoring needs tighter validation.

read the letter

FinanceBench introduces a new open benchmark for financial question answering and demonstrates that leading models like GPT-4-Turbo still fail on the majority of cases even when given retrieval access. The dataset has 10,231 questions drawn from public company filings, each with answers and evidence. They test a range of models and setups, including long-context and retrieval-augmented ones, on 150 sampled questions. The manual review of 2,400 model answers provides direct evidence of issues like hallucinations and refusals. Releasing the data openly is a plus for anyone wanting to build on this. The central claim that GPT-4-Turbo plus retrieval gets 81% wrong holds up as a new empirical result. The questions are designed to be clear-cut, which makes the failures more striking. That said, the manual evaluation does not report inter-annotator agreement or lay out an explicit rubric. Financial answers often turn on precise details like units or time periods, so without those controls the error rate could be sensitive to how annotators interpret edge cases. The sample of 150 is modest, and while they note it, broader validation would strengthen the findings. This paper is for teams working on enterprise AI or financial applications who need benchmarks that reflect real constraints. Readers interested in LLM reliability in high-stakes domains will get value from the dataset and the baseline results. I would recommend sending it for peer review. The contribution of the benchmark dataset is clear enough to merit referee attention, with some tightening on the evaluation details.

Referee Report

2 major / 2 minor

Summary. The paper introduces FinanceBench, a benchmark of 10,231 ecologically valid financial QA questions on publicly traded companies (with answers and evidence strings). It evaluates 16 LLM configurations (GPT-4-Turbo, Llama2, Claude2, with retrieval and long-context variants) on a 150-question sample via manual review of 2,400 answers, reports that GPT-4-Turbo + retrieval fails or refuses on 81% of cases, and concludes that current models exhibit hallucinations and other weaknesses that limit enterprise use.

Significance. If the empirical failure rates hold, FinanceBench supplies a needed domain-specific benchmark for financial QA and quantifies concrete gaps in numerical reasoning and evidence handling. The open release of the 150-case sample and the transparent evaluation protocol are strengths that enable follow-on work.

major comments (2)

[Evaluation section (manual review protocol)] The headline 81% failure rate for GPT-4-Turbo + retrieval rests entirely on manual correctness judgments of 2,400 answers. The manuscript provides no inter-annotator agreement statistic, no explicit annotation rubric, and no description of how borderline cases (numerical precision, unit handling, scope of refusal) were resolved. This directly affects whether the reported error rate reflects model behavior or annotation variance.
[Experimental setup (sampling)] The 150-case sample is drawn from the full 10,231-question set, yet no justification or statistical check is supplied that the sample preserves the distribution of question types, document lengths, or difficulty. Claims about “all models examined” therefore rest on an unverified assumption of representativeness.

minor comments (2)

[Abstract and §3] The abstract states the questions are “intended to be clear-cut and straightforward”; the manuscript should include a short appendix or table showing the distribution of question categories (e.g., numerical extraction vs. qualitative) to let readers assess this claim.
[Results section] Table or figure reporting per-model results should include confidence intervals or exact counts alongside the 81% figure so readers can gauge precision of the headline statistic.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We will revise the manuscript to address the concerns regarding the evaluation protocol and sampling methodology, thereby improving the transparency and robustness of our results.

read point-by-point responses

Referee: The headline 81% failure rate for GPT-4-Turbo + retrieval rests entirely on manual correctness judgments of 2,400 answers. The manuscript provides no inter-annotator agreement statistic, no explicit annotation rubric, and no description of how borderline cases (numerical precision, unit handling, scope of refusal) were resolved. This directly affects whether the reported error rate reflects model behavior or annotation variance.

Authors: We agree that the manuscript would benefit from a more detailed description of the manual evaluation process. The reviews were conducted by a single expert annotator with extensive experience in financial analysis. We will add an explicit annotation rubric to the revised paper, detailing criteria for numerical precision (allowing for standard financial rounding), unit handling, and when a refusal is considered a failure. Borderline cases were resolved conservatively by requiring the answer to be fully supported by the evidence string. Since only one annotator was involved, inter-annotator agreement is not applicable; we will explicitly state this in the revision and discuss potential limitations. This change will be incorporated in the Evaluation section. revision: yes
Referee: The 150-case sample is drawn from the full 10,231-question set, yet no justification or statistical check is supplied that the sample preserves the distribution of question types, document lengths, or difficulty. Claims about “all models examined” therefore rest on an unverified assumption of representativeness.

Authors: We acknowledge the need to demonstrate that the 150-case sample is representative. The sample was chosen randomly from the full benchmark with the intent to cover a diverse range of question types and complexities. In the revised manuscript, we will include a justification along with statistical comparisons (such as distribution of question categories and document lengths) between the sample and the full set to validate representativeness. This will be added to the Experimental setup section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation with direct measurements

full rationale

The paper introduces FinanceBench as a new dataset of 10,231 questions with provided answers and evidence, then reports direct empirical results from testing 16 model configurations on a 150-question sample and manually reviewing 2,400 outputs. The headline 81% failure rate for GPT-4-Turbo + retrieval is a straightforward count of human-judged incorrect or refused answers, with no equations, fitted parameters, predictions derived from inputs, or self-citation chains that reduce the central claim to a tautology. The work is self-contained against external models and released data; no load-bearing step equates a result to its own construction by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard benchmark assumptions without introducing free parameters, new entities, or ad-hoc axioms beyond the claim that the questions are clear-cut.

axioms (1)

domain assumption The questions are clear-cut and straightforward to answer using the provided evidence strings
Stated in the abstract as the intent of the benchmark to serve as a minimum performance standard.

pith-pipeline@v0.9.0 · 5528 in / 1084 out tokens · 65908 ms · 2026-05-16T04:55:57.594528+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

IndiaFinBench: An Evaluation Benchmark for Large Language Model Performance on Indian Financial Regulatory Text
cs.CL 2026-04 accept novelty 8.0

IndiaFinBench is the first public benchmark for LLMs on Indian financial regulatory text, with twelve models scoring 70.4-89.7% accuracy and all outperforming a 69% human baseline.
BizCompass: Benchmarking the Reasoning Capabilities of LLMs in Business Knowledge and Applications
cs.CE 2026-04 unverdicted novelty 7.0

BizCompass is a dual-axis benchmark evaluating LLMs on business knowledge in finance, economics, statistics, and operations management, linked to analyst, trader, and consultant roles, with public datasets released af...
Crowded in B-Space: Calibrating Shared Directions for LoRA Merging
cs.CL 2026-04 unverdicted novelty 7.0

Pico reduces LoRA merge interference by calibrating over-shared directions in the B matrix before merging, yielding 3.4-8.3 point accuracy gains and sometimes beating joint training.
Credo: Declarative Control of LLM Pipelines via Beliefs and Policies
cs.AI 2026-04 unverdicted novelty 7.0

Credo proposes representing LLM agent state as beliefs and regulating pipeline behavior with declarative policies stored in a database for adaptive, auditable control.
FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks
cs.AI 2026-04 unverdicted novelty 7.0

FinTrace supplies trajectory-level metrics for LLM financial tool calling, exposing gaps in information use and output quality, while its preference dataset enables DPO training that boosts intermediate metrics.
FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks
cs.CL 2026-04 unverdicted novelty 7.0

FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.
Training Transformers for KV Cache Compressibility
cs.LG 2026-05 unverdicted novelty 6.0

Training transformers with KV sparsification during continued pretraining produces representations that admit better post-hoc KV cache compression, improving quality under memory budgets for long-context tasks.
Training Transformers for KV Cache Compressibility
cs.LG 2026-05 unverdicted novelty 6.0

KV compressibility is a property of learned transformer representations that can be improved by training with KV sparsification, leading to better quality-budget tradeoffs in downstream compression for retrieval, QA, ...
LATTICE: Evaluating Decision Support Utility of Crypto Agents
cs.CR 2026-04 unverdicted novelty 6.0

LATTICE is a scalable LLM-judge benchmark for crypto agent decision support that reveals performance trade-offs among real-world copilots across dimensions and tasks.
The Price of Agreement: Measuring LLM Sycophancy in Agentic Financial Applications
cs.AI 2026-04 unverdicted novelty 6.0

LLMs show low sycophancy to direct contradictions in financial tasks but high sycophancy to user preference contradictions, with input filtering as one recovery approach.
SysTradeBench: An Iterative Build-Test-Patch Benchmark for Strategy-to-Code Trading Systems with Drift-Aware Diagnostics
cs.SE 2026-04 unverdicted novelty 6.0

SysTradeBench evaluates 17 LLMs on 12 trading strategies, finding over 91.7% code validity but rapid convergence in iterative fixes and a continued need for human oversight on critical strategies.
FinReasoning: A Hierarchical Benchmark for Reliable Financial Research Reporting
cs.CL 2026-02 unverdicted novelty 6.0

FinReasoning is a hierarchical benchmark that decomposes LLM financial research capabilities into semantic consistency, data alignment, and deep insight, revealing model-type differences in auditing versus insight generation.
Efficient Serving for Dynamic Agent Workflows with Prediction-based KV-Cache Management
cs.LG 2026-05 unverdicted novelty 5.0

PBKV predicts agent invocations in dynamic LLM workflows to manage KV-cache reuse, delivering up to 1.85x speedup over LRU and 1.26x over KVFlow.
AgenticRAG: Agentic Retrieval for Enterprise Knowledge Bases
cs.AI 2026-05 unverdicted novelty 5.0

AgenticRAG equips an LLM with iterative retrieval and navigation tools, delivering 49.6% recall@1 on BRIGHT, 0.96 factuality on WixQA, and 92% correctness on FinanceBench.
Architecture Matters More Than Scale: A Comparative Study of Retrieval and Memory Augmentation for Financial QA Under SME Compute Constraints
cs.IR 2026-04 unverdicted novelty 5.0

Under tight compute limits, structured memory raises precision on exact financial calculations while retrieval-based methods perform better on conversational queries.
Adaptive Query Routing: A Tier-Based Framework for Hybrid Retrieval Across Financial, Legal, and Medical Documents
cs.IR 2026-04 conditional novelty 5.0

Tree reasoning outperforms vector search on complex document queries but a hybrid approach balances results across tiers, with validation showing an 11.7-point gap on real finance documents.
Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG
cs.CL 2026-04 unverdicted novelty 5.0

Systematic tests show that specific PDF parsers combined with overlapping chunking strategies better preserve structure and improve RAG answer correctness on financial QA benchmarks including the new TableQuest dataset.
Byte-Exact Deduplication in Retrieval-Augmented Generation: A Three-Regime Empirical Analysis Across Public Benchmarks
cs.CL 2026-05 unverdicted novelty 4.0

Byte-exact deduplication reduces RAG context size by 0.16% to 80.34% across three regimes with zero measurable quality regression per multi-vendor LLM evaluation.
Architecture Matters More Than Scale: A Comparative Study of Retrieval and Memory Augmentation for Financial QA Under SME Compute Constraints
cs.IR 2026-04 unverdicted novelty 4.0

Structured memory improves precision on deterministic financial calculations while retrieval-augmented generation outperforms in conversational settings, supporting a hybrid deployment framework for resource-constrained SMEs.
Conversations Risk Detection LLMs in Financial Agents via Multi-Stage Generative Rollout
cs.CR 2026-04 unverdicted novelty 4.0

FinSec is a multi-stage detection system for financial LLM dialogues that reaches 90.13% F1 score, cuts attack success rate to 9.09%, and raises AUPRC to 0.9189.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · cited by 18 Pith papers

[1]

In Findings of the Association for Computational Linguistics: ACL 2023 , pages 1298–1313, Toronto, Canada

Learning to generalize for cross-domain QA. In Findings of the Association for Computational Linguistics: ACL 2023 , pages 1298–1313, Toronto, Canada. Association for Computational Linguistics. Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. 2023. Capabilities of gpt-4 on medical challenge problems. OpenAI. 2023. Gpt-4 t...

work page 2023
[2]

ACM Comput

Qa dataset explosion: A taxonomy of nlp resources for question answering and reading com- prehension. ACM Comput. Surv., 55(10). Julio Cesar Salinas Alvarado, Karin Verspoor, and Tim- othy Baldwin. 2015. Domain adaption of named entity recognition to support credit risk assessment. In Proceedings of the Australasian Language Tech- nology Association Works...

work page 2015
[3]

Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat- Seng Chua

Tablegpt: Towards unifying tables, nature language and commands into one gpt. Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat- Seng Chua. 2021. TAT-QA: A question answering benchmark on a hybrid of tabular and textual con- tent in finance. In Proceedings of the 59th Annual Meeting of the Association for C...

work page 2021
[4]

fi- nacebench_id_0000

A unique ID (of the form, “fi- nacebench_id_0000”)

work page
[5]

A value for whether it is in the eval sample of 298 cases (‘1’), in the open source sample (‘2’) or in neither (‘0’)

work page
[6]

The company’s sector following GICS sector definitions

work page
[7]

7https://replicate.com/meta/llama-2-70b-chat

The name of the public filing used to pose and answer the question. 7https://replicate.com/meta/llama-2-70b-chat

work page
[8]

Where possible, we used static PDFs from the com- pany’s investor relations page or other rep- utable sources like EDGAR

A link to the relevant public filing. Where possible, we used static PDFs from the com- pany’s investor relations page or other rep- utable sources like EDGAR

work page
[9]

10K, 10Q)

A label for the document type (e.g. 10K, 10Q)

work page
[10]

If the document is an 8K, then it refers to calendar year since these documents generally are not released following fiscal year calen- dars

The fiscal year that the document is referenc- ing. If the document is an 8K, then it refers to calendar year since these documents generally are not released following fiscal year calen- dars. The fiscal years were labelled using the following convention: use the calendar year of the fiscal year end as the fiscal year. This means if a fiscal year ends in...

work page 2023
[11]

The question type (reflecting the three types in FINANCE BENCH : domain-relevant, novel- generated, and metrics-generated)

work page
[12]

numerical reason- ing)

The type of reasoning (e.g. numerical reason- ing)

work page
[13]

The domain-relevant question number (if rele- vant), which runs from dg01 to dg25

work page
[14]

The gold standard answer

work page
[15]

In the cases of domain- relevant questions and novel-generated ques- tions, these are the evidence texts that anno- tators directly extracted themselves

The evidence text. In the cases of domain- relevant questions and novel-generated ques- tions, these are the evidence texts that anno- tators directly extracted themselves. In the case of metrics-generated questions, we con- structed the evidence text as follows: (i) for each base metric that is a building block of the main metric in question, extract the...

work page
[16]

Note that all page numbers are 1-indexed

The evidence text page number. Note that all page numbers are 1-indexed

work page
[17]

This is to provide a larger relevant context around each evidence text chosen by annota- tors

The full page text found in the financial doc- ument for each evidence text page number. This is to provide a larger relevant context around each evidence text chosen by annota- tors

work page
[18]

Where relevant, the justification for each an- swer. Dataset description There are 40 companies in FINANCE BENCH , of which 32 companies are in the metrics-generated questions and 37 are in the domain-generic and novel-generated questions. 29 companies appear in all three types of questions. There are 360 documents in total, of which 270 are 10Ks (the vas...

work page