Recognition: 1 theorem link
FinanceBench: A New Benchmark for Financial Question Answering
Pith reviewed 2026-05-16 04:55 UTC · model grok-4.3
The pith
Existing LLMs fail to correctly answer or refuse 81 percent of financial questions even with retrieval support.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FinanceBench is a test suite of 10,231 questions about publicly traded companies with corresponding answers and evidence strings. When 16 state-of-the-art model configurations are tested on a sample of 150 cases, GPT-4-Turbo used with a retrieval system incorrectly answers or refuses 81 percent of questions. Augmentation techniques such as longer context windows improve results modestly but prove unrealistic for enterprise use due to latency and inability to process larger documents, while all models exhibit hallucinations and other weaknesses that limit practical suitability.
What carries the argument
FinanceBench, a dataset of 10,231 questions on publicly traded companies with paired answers and evidence strings that sets a minimum standard for open-book financial QA.
If this is right
- Retrieval augmentation alone proves insufficient for accurate financial QA across tested models.
- Longer context windows raise performance but increase latency and fail to scale to larger financial documents.
- All examined models display hallucinations and refusals that restrict enterprise deployment.
- The benchmark establishes a baseline that current LLMs do not meet for straightforward financial questions.
Where Pith is reading between the lines
- Enterprises using LLMs for financial analysis would need supplementary human review or verification layers to catch errors.
- Development of domain-specific models trained on financial statements could address the observed gaps in numerical and evidence-based reasoning.
- Similar benchmark construction in other specialized domains might expose parallel reliability shortfalls in general-purpose LLMs.
- Releasing the full 10,231-question set could enable targeted fine-tuning or new retrieval methods for financial data.
Load-bearing premise
The 150 sampled cases accurately represent the full set of 10,231 questions and that all questions are ecologically valid and clear-cut as stated.
What would settle it
A model configuration that correctly answers at least 90 percent of the 150 cases without refusals or hallucinations would challenge the reported limitations.
read the original abstract
FinanceBench is a first-of-its-kind test suite for evaluating the performance of LLMs on open book financial question answering (QA). It comprises 10,231 questions about publicly traded companies, with corresponding answers and evidence strings. The questions in FinanceBench are ecologically valid and cover a diverse set of scenarios. They are intended to be clear-cut and straightforward to answer to serve as a minimum performance standard. We test 16 state of the art model configurations (including GPT-4-Turbo, Llama2 and Claude2, with vector stores and long context prompts) on a sample of 150 cases from FinanceBench, and manually review their answers (n=2,400). The cases are available open-source. We show that existing LLMs have clear limitations for financial QA. Notably, GPT-4-Turbo used with a retrieval system incorrectly answered or refused to answer 81% of questions. While augmentation techniques such as using longer context window to feed in relevant evidence improve performance, they are unrealistic for enterprise settings due to increased latency and cannot support larger financial documents. We find that all models examined exhibit weaknesses, such as hallucinations, that limit their suitability for use by enterprises.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FinanceBench, a benchmark of 10,231 ecologically valid financial QA questions on publicly traded companies (with answers and evidence strings). It evaluates 16 LLM configurations (GPT-4-Turbo, Llama2, Claude2, with retrieval and long-context variants) on a 150-question sample via manual review of 2,400 answers, reports that GPT-4-Turbo + retrieval fails or refuses on 81% of cases, and concludes that current models exhibit hallucinations and other weaknesses that limit enterprise use.
Significance. If the empirical failure rates hold, FinanceBench supplies a needed domain-specific benchmark for financial QA and quantifies concrete gaps in numerical reasoning and evidence handling. The open release of the 150-case sample and the transparent evaluation protocol are strengths that enable follow-on work.
major comments (2)
- [Evaluation section (manual review protocol)] The headline 81% failure rate for GPT-4-Turbo + retrieval rests entirely on manual correctness judgments of 2,400 answers. The manuscript provides no inter-annotator agreement statistic, no explicit annotation rubric, and no description of how borderline cases (numerical precision, unit handling, scope of refusal) were resolved. This directly affects whether the reported error rate reflects model behavior or annotation variance.
- [Experimental setup (sampling)] The 150-case sample is drawn from the full 10,231-question set, yet no justification or statistical check is supplied that the sample preserves the distribution of question types, document lengths, or difficulty. Claims about “all models examined” therefore rest on an unverified assumption of representativeness.
minor comments (2)
- [Abstract and §3] The abstract states the questions are “intended to be clear-cut and straightforward”; the manuscript should include a short appendix or table showing the distribution of question categories (e.g., numerical extraction vs. qualitative) to let readers assess this claim.
- [Results section] Table or figure reporting per-model results should include confidence intervals or exact counts alongside the 81% figure so readers can gauge precision of the headline statistic.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We will revise the manuscript to address the concerns regarding the evaluation protocol and sampling methodology, thereby improving the transparency and robustness of our results.
read point-by-point responses
-
Referee: The headline 81% failure rate for GPT-4-Turbo + retrieval rests entirely on manual correctness judgments of 2,400 answers. The manuscript provides no inter-annotator agreement statistic, no explicit annotation rubric, and no description of how borderline cases (numerical precision, unit handling, scope of refusal) were resolved. This directly affects whether the reported error rate reflects model behavior or annotation variance.
Authors: We agree that the manuscript would benefit from a more detailed description of the manual evaluation process. The reviews were conducted by a single expert annotator with extensive experience in financial analysis. We will add an explicit annotation rubric to the revised paper, detailing criteria for numerical precision (allowing for standard financial rounding), unit handling, and when a refusal is considered a failure. Borderline cases were resolved conservatively by requiring the answer to be fully supported by the evidence string. Since only one annotator was involved, inter-annotator agreement is not applicable; we will explicitly state this in the revision and discuss potential limitations. This change will be incorporated in the Evaluation section. revision: yes
-
Referee: The 150-case sample is drawn from the full 10,231-question set, yet no justification or statistical check is supplied that the sample preserves the distribution of question types, document lengths, or difficulty. Claims about “all models examined” therefore rest on an unverified assumption of representativeness.
Authors: We acknowledge the need to demonstrate that the 150-case sample is representative. The sample was chosen randomly from the full benchmark with the intent to cover a diverse range of question types and complexities. In the revised manuscript, we will include a justification along with statistical comparisons (such as distribution of question categories and document lengths) between the sample and the full set to validate representativeness. This will be added to the Experimental setup section. revision: yes
Circularity Check
No circularity: empirical benchmark evaluation with direct measurements
full rationale
The paper introduces FinanceBench as a new dataset of 10,231 questions with provided answers and evidence, then reports direct empirical results from testing 16 model configurations on a 150-question sample and manually reviewing 2,400 outputs. The headline 81% failure rate for GPT-4-Turbo + retrieval is a straightforward count of human-judged incorrect or refused answers, with no equations, fitted parameters, predictions derived from inputs, or self-citation chains that reduce the central claim to a tautology. The work is self-contained against external models and released data; no load-bearing step equates a result to its own construction by definition.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The questions are clear-cut and straightforward to answer using the provided evidence strings
Forward citations
Cited by 20 Pith papers
-
IndiaFinBench: An Evaluation Benchmark for Large Language Model Performance on Indian Financial Regulatory Text
IndiaFinBench is the first public benchmark for LLMs on Indian financial regulatory text, with twelve models scoring 70.4-89.7% accuracy and all outperforming a 69% human baseline.
-
BizCompass: Benchmarking the Reasoning Capabilities of LLMs in Business Knowledge and Applications
BizCompass is a dual-axis benchmark evaluating LLMs on business knowledge in finance, economics, statistics, and operations management, linked to analyst, trader, and consultant roles, with public datasets released af...
-
Crowded in B-Space: Calibrating Shared Directions for LoRA Merging
Pico reduces LoRA merge interference by calibrating over-shared directions in the B matrix before merging, yielding 3.4-8.3 point accuracy gains and sometimes beating joint training.
-
Credo: Declarative Control of LLM Pipelines via Beliefs and Policies
Credo proposes representing LLM agent state as beliefs and regulating pipeline behavior with declarative policies stored in a database for adaptive, auditable control.
-
FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks
FinTrace supplies trajectory-level metrics for LLM financial tool calling, exposing gaps in information use and output quality, while its preference dataset enables DPO training that boosts intermediate metrics.
-
FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks
FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.
-
Training Transformers for KV Cache Compressibility
Training transformers with KV sparsification during continued pretraining produces representations that admit better post-hoc KV cache compression, improving quality under memory budgets for long-context tasks.
-
Training Transformers for KV Cache Compressibility
KV compressibility is a property of learned transformer representations that can be improved by training with KV sparsification, leading to better quality-budget tradeoffs in downstream compression for retrieval, QA, ...
-
LATTICE: Evaluating Decision Support Utility of Crypto Agents
LATTICE is a scalable LLM-judge benchmark for crypto agent decision support that reveals performance trade-offs among real-world copilots across dimensions and tasks.
-
The Price of Agreement: Measuring LLM Sycophancy in Agentic Financial Applications
LLMs show low sycophancy to direct contradictions in financial tasks but high sycophancy to user preference contradictions, with input filtering as one recovery approach.
-
SysTradeBench: An Iterative Build-Test-Patch Benchmark for Strategy-to-Code Trading Systems with Drift-Aware Diagnostics
SysTradeBench evaluates 17 LLMs on 12 trading strategies, finding over 91.7% code validity but rapid convergence in iterative fixes and a continued need for human oversight on critical strategies.
-
FinReasoning: A Hierarchical Benchmark for Reliable Financial Research Reporting
FinReasoning is a hierarchical benchmark that decomposes LLM financial research capabilities into semantic consistency, data alignment, and deep insight, revealing model-type differences in auditing versus insight generation.
-
Efficient Serving for Dynamic Agent Workflows with Prediction-based KV-Cache Management
PBKV predicts agent invocations in dynamic LLM workflows to manage KV-cache reuse, delivering up to 1.85x speedup over LRU and 1.26x over KVFlow.
-
AgenticRAG: Agentic Retrieval for Enterprise Knowledge Bases
AgenticRAG equips an LLM with iterative retrieval and navigation tools, delivering 49.6% recall@1 on BRIGHT, 0.96 factuality on WixQA, and 92% correctness on FinanceBench.
-
Architecture Matters More Than Scale: A Comparative Study of Retrieval and Memory Augmentation for Financial QA Under SME Compute Constraints
Under tight compute limits, structured memory raises precision on exact financial calculations while retrieval-based methods perform better on conversational queries.
-
Adaptive Query Routing: A Tier-Based Framework for Hybrid Retrieval Across Financial, Legal, and Medical Documents
Tree reasoning outperforms vector search on complex document queries but a hybrid approach balances results across tiers, with validation showing an 11.7-point gap on real finance documents.
-
Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG
Systematic tests show that specific PDF parsers combined with overlapping chunking strategies better preserve structure and improve RAG answer correctness on financial QA benchmarks including the new TableQuest dataset.
-
Byte-Exact Deduplication in Retrieval-Augmented Generation: A Three-Regime Empirical Analysis Across Public Benchmarks
Byte-exact deduplication reduces RAG context size by 0.16% to 80.34% across three regimes with zero measurable quality regression per multi-vendor LLM evaluation.
-
Architecture Matters More Than Scale: A Comparative Study of Retrieval and Memory Augmentation for Financial QA Under SME Compute Constraints
Structured memory improves precision on deterministic financial calculations while retrieval-augmented generation outperforms in conversational settings, supporting a hybrid deployment framework for resource-constrained SMEs.
-
Conversations Risk Detection LLMs in Financial Agents via Multi-Stage Generative Rollout
FinSec is a multi-stage detection system for financial LLM dialogues that reaches 90.13% F1 score, cuts attack success rate to 9.09%, and raises AUPRC to 0.9189.
Reference graph
Works this paper leans on
-
[1]
Learning to generalize for cross-domain QA. In Findings of the Association for Computational Linguistics: ACL 2023 , pages 1298–1313, Toronto, Canada. Association for Computational Linguistics. Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. 2023. Capabilities of gpt-4 on medical challenge problems. OpenAI. 2023. Gpt-4 t...
work page 2023
-
[2]
Qa dataset explosion: A taxonomy of nlp resources for question answering and reading com- prehension. ACM Comput. Surv., 55(10). Julio Cesar Salinas Alvarado, Karin Verspoor, and Tim- othy Baldwin. 2015. Domain adaption of named entity recognition to support credit risk assessment. In Proceedings of the Australasian Language Tech- nology Association Works...
work page 2015
-
[3]
Tablegpt: Towards unifying tables, nature language and commands into one gpt. Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat- Seng Chua. 2021. TAT-QA: A question answering benchmark on a hybrid of tabular and textual con- tent in finance. In Proceedings of the 59th Annual Meeting of the Association for C...
work page 2021
- [4]
-
[5]
A value for whether it is in the eval sample of 298 cases (‘1’), in the open source sample (‘2’) or in neither (‘0’)
-
[6]
The company’s sector following GICS sector definitions
-
[7]
7https://replicate.com/meta/llama-2-70b-chat
The name of the public filing used to pose and answer the question. 7https://replicate.com/meta/llama-2-70b-chat
-
[8]
A link to the relevant public filing. Where possible, we used static PDFs from the com- pany’s investor relations page or other rep- utable sources like EDGAR
- [9]
-
[10]
The fiscal year that the document is referenc- ing. If the document is an 8K, then it refers to calendar year since these documents generally are not released following fiscal year calen- dars. The fiscal years were labelled using the following convention: use the calendar year of the fiscal year end as the fiscal year. This means if a fiscal year ends in...
work page 2023
-
[11]
The question type (reflecting the three types in FINANCE BENCH : domain-relevant, novel- generated, and metrics-generated)
- [12]
-
[13]
The domain-relevant question number (if rele- vant), which runs from dg01 to dg25
-
[14]
The gold standard answer
-
[15]
The evidence text. In the cases of domain- relevant questions and novel-generated ques- tions, these are the evidence texts that anno- tators directly extracted themselves. In the case of metrics-generated questions, we con- structed the evidence text as follows: (i) for each base metric that is a building block of the main metric in question, extract the...
-
[16]
Note that all page numbers are 1-indexed
The evidence text page number. Note that all page numbers are 1-indexed
-
[17]
This is to provide a larger relevant context around each evidence text chosen by annota- tors
The full page text found in the financial doc- ument for each evidence text page number. This is to provide a larger relevant context around each evidence text chosen by annota- tors
-
[18]
Where relevant, the justification for each an- swer. Dataset description There are 40 companies in FINANCE BENCH , of which 32 companies are in the metrics-generated questions and 37 are in the domain-generic and novel-generated questions. 29 companies appear in all three types of questions. There are 360 documents in total, of which 270 are 10Ks (the vas...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.