FinBen: A Holistic Financial Benchmark for Large Language Models

Alejandro Lopez-Lira; Benyou Wang; Chenhan Yuan; Dong Li; Duanyu Feng; Gang Hu; Guojun Xiong; Haohang Li; Haoqiang Kang; Hao Wang

arxiv: 2402.12659 · v2 · pith:2P3B2F6Inew · submitted 2024-02-20 · 💻 cs.CL · cs.AI· cs.CE

FinBen: A Holistic Financial Benchmark for Large Language Models

Qianqian Xie , Weiguang Han , Zhengyu Chen , Ruoyu Xiang , Xiao Zhang , Yueru He , Mengxi Xiao , Dong Li

show 26 more authors

Yongfu Dai Duanyu Feng Yijing Xu Haoqiang Kang Ziyan Kuang Chenhan Yuan Kailai Yang Zheheng Luo Tianlin Zhang Zhiwei Liu Guojun Xiong Zhiyang Deng Yuechen Jiang Zhiyuan Yao Haohang Li Yangyang Yu Gang Hu Jiajia Huang Xiao-Yang Liu Alejandro Lopez-Lira Benyou Wang Yanzhao Lai Hao Wang Min Peng Sophia Ananiadou Jimin Huang

This is my paper

classification 💻 cs.CL cs.AIcs.CE

keywords llmsevaluationfinancialfinbentasksdatasetsgenerationtext

0 comments

read the original abstract

LLMs have transformed NLP and shown promise in various fields, yet their potential in finance is underexplored due to a lack of comprehensive evaluation benchmarks, the rapid development of LLMs, and the complexity of financial tasks. In this paper, we introduce FinBen, the first extensive open-source evaluation benchmark, including 36 datasets spanning 24 financial tasks, covering seven critical aspects: information extraction (IE), textual analysis, question answering (QA), text generation, risk management, forecasting, and decision-making. FinBen offers several key innovations: a broader range of tasks and datasets, the first evaluation of stock trading, novel agent and Retrieval-Augmented Generation (RAG) evaluation, and three novel open-source evaluation datasets for text summarization, question answering, and stock trading. Our evaluation of 15 representative LLMs, including GPT-4, ChatGPT, and the latest Gemini, reveals several key findings: While LLMs excel in IE and textual analysis, they struggle with advanced reasoning and complex tasks like text generation and forecasting. GPT-4 excels in IE and stock trading, while Gemini is better at text generation and forecasting. Instruction-tuned LLMs improve textual analysis but offer limited benefits for complex tasks such as QA. FinBen has been used to host the first financial LLMs shared task at the FinNLP-AgentScen workshop during IJCAI-2024, attracting 12 teams. Their novel solutions outperformed GPT-4, showcasing FinBen's potential to drive innovation in financial LLMs. All datasets, results, and codes are released for the research community: https://github.com/The-FinAI/PIXIU.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Meta-Benchmarks for Financial-Services LLM Evaluation
cs.AI 2026-07 unverdicted novelty 7.0

A meta-benchmarking framework organizes 452 LLM benchmarks into 41 O*NET Generalized Work Activities and 38 BIAN domains, using discrimination-coverage-recency weights to scale K-factors in an Elo tournament for compa...
LLM-Based Examination of Eligibility Criteria from Securities Prospectuses at the German Central Bank
cs.CL 2026-06 unverdicted novelty 7.0

LLMs are applied in a generative pipeline for extracting, normalizing, and interpreting eligibility criteria from securities prospectuses, achieving up to 91% precision in document-level decisions with a conservative bias.
CLExEval: A Human-in-the-Loop Framework for Qualitative Evaluation of LLM Clinical Reasoning
cs.CL 2026-06 unverdicted novelty 6.0

CLExEval introduces a human-annotated evaluation framework on 40 rare cases that identifies verbosity bias, hidden knowledge paradox, and 68.6% reasoning-to-output mismatch in LLMs while showing LLM-as-a-Judge overest...
FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models
cs.CL 2026-05 unverdicted novelty 6.0

FINESSE-Bench is a hierarchical benchmark suite of eight datasets with 3,993 questions for evaluating LLMs on financial domain knowledge, technical analysis, and professional competencies.
FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models
cs.CL 2026-05 unverdicted novelty 6.0

FINESSE-Bench is a new hierarchical benchmark suite combining certification-style exams, trading tasks, and a Russian olympiad set to evaluate LLMs on financial competencies at multiple difficulty levels.
Agentic Retrieval-Augmented Generation for Financial Document Question Answering
cs.AI 2026-05 unverdicted novelty 6.0

FinAgent-RAG achieves 76.81-78.46% execution accuracy on financial QA benchmarks by combining contrastive retrieval, program-of-thought code generation, and adaptive strategy routing, outperforming baselines by 5.62-9...
SysTradeBench: An Iterative Build-Test-Patch Benchmark for Strategy-to-Code Trading Systems with Drift-Aware Diagnostics
cs.SE 2026-04 unverdicted novelty 6.0

SysTradeBench evaluates 17 LLMs on 12 trading strategies, finding over 91.7% code validity but rapid convergence in iterative fixes and a continued need for human oversight on critical strategies.
The Alpha Illusion: Reported Alpha from LLM Trading Agents Should Not Be Treated as Deployment Evidence
cs.CE 2026-05 accept novelty 5.0

Reported alpha from end-to-end LLM trading agents does not constitute deployment evidence until it passes structural tests for temporal integrity, frictions, robustness, calibration, execution, and disaggregation.
Beyond Agent Architecture: Execution Assumptions and Reproducibility in LLM-Based Trading Systems
cs.AI 2026-06 unverdicted novelty 3.0

Reproducibility audit of 30 LLM trading papers shows execution assumptions under-reported relative to agent architectures, illustrated by a 10-equity example where frictions compress returns.
Predicting Liquidity-Aware Bond Yields using Causal GANs and Deep Reinforcement Learning with LLM Evaluation
q-fin.CP 2025-02 unverdicted novelty 3.0

CausalGAN + SAC RL pipeline generates synthetic bond yield data; fine-tuned Qwen2.5-7B LLM produces trading signals, with reported MAE 0.103, 60% profit rate, and LLM score 3.37/5.