FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning

Aaryamonvikram Singh; Chen Xu; Daniil Orel; Debopriyo Banerjee; Dhruv Sahnan; Fajri Koto; Fan Zhang; Georgi Georgiev; Hachem Madmoun; Haonan Li

arxiv: 2506.02515 · v4 · submitted 2025-06-03 · 💻 cs.CL · cs.AI· cs.LG

FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning

Zhuohan Xie , Daniil Orel , Rushil Thareja , Dhruv Sahnan , Hachem Madmoun , Fan Zhang , Debopriyo Banerjee , Georgi Georgiev

show 17 more authors

Xueqing Peng Lingfei Qian Jimin Huang Jinyan Su Aaryamonvikram Singh Rui Xing Rania Elbadry Chen Xu Haonan Li Fajri Koto Ivan Koychev Tanmoy Chakraborty Yuxia Wang Salem Lahlou Veselin Stoyanov Sophia Ananiadou Preslav Nakov

This is my paper

Pith reviewed 2026-05-19 11:15 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords financial reasoningchain-of-thoughtbenchmarklarge language modelssymbolic reasoningverifiable evaluationfine-tuned models

0 comments

The pith

FinChain benchmark shows frontier LLMs have clear limits in symbolic financial reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FinChain as the first benchmark built specifically for verifiable chain-of-thought evaluation in finance. It covers 58 topics across 12 domains through parameterized symbolic templates backed by executable Python code, allowing fully machine-checked reasoning steps and scalable generation of new examples without data contamination. The authors also define CHAINEVAL, a metric that scores both the final numerical answer and the consistency of the intermediate reasoning chain. When 26 leading models are tested, frontier systems show persistent failures on these multi-step tasks, yet models that have been fine-tuned on financial or mathematical data close much of the performance difference. This matters because financial decisions require transparent, auditable reasoning rather than plausible final numbers alone.

Core claim

FinChain spans 58 topics across 12 financial domains using parameterized symbolic templates with executable Python code that enable fully machine-verifiable reasoning and scalable, contamination-free data generation. The CHAINEVAL dynamic alignment measure jointly evaluates final-answer correctness and step-level reasoning consistency. Evaluation of 26 leading LLMs reveals that even frontier LLMs exhibit clear limitations in symbolic financial reasoning, while domain-adapted and math-enhanced fine-tuned models can substantially narrow this gap.

What carries the argument

FinChain benchmark of parameterized symbolic templates with executable Python code, paired with the CHAINEVAL dynamic alignment measure that scores both final answers and intermediate reasoning consistency.

If this is right

FinChain supplies a scalable testbed for measuring and improving verifiable multi-step financial reasoning in AI systems.
Domain-specific and math-enhanced fine-tuning emerges as an effective route to reduce gaps in symbolic financial tasks.
The benchmark can guide development of financial AI that supports transparency and external verification of each reasoning step.
Persistent weaknesses identified by FinChain highlight the need for new training methods focused on chain-level consistency rather than answer matching alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar template-and-verification designs could be adapted to create benchmarks for other high-stakes domains that demand auditable reasoning, such as regulatory compliance or tax planning.
Models that improve on FinChain may transfer better to real-time financial tools where users must trace and validate every calculation.
The approach opens the possibility of automatically generating training data for financial reasoning that remains free of leakage from public internet sources.

Load-bearing premise

The parameterized symbolic templates and CHAINEVAL metric accurately capture the intermediate reasoning steps required for real-world financial analysis and transparency.

What would settle it

A model that achieves high CHAINEVAL scores on FinChain yet produces inconsistent or incorrect reasoning when applied to real financial statements or regulatory filings that require similar multi-step calculations.

read the original abstract

Multi-step symbolic reasoning is essential for robust financial analysis; yet, current benchmarks largely overlook this capability. Existing datasets such as FinQA and ConvFinQA emphasize final numerical answers while neglecting the intermediate reasoning steps required for transparency and verification. To address this gap, we introduce FinChain, the first benchmark specifically designed for verifiable Chain-of-Thought evaluation in finance. FinChain spans 58 topics across 12 financial domains, each represented by parameterized symbolic templates with executable Python code that enable fully machine-verifiable reasoning and scalable, contamination-free data generation. To assess reasoning capacity, we propose CHAINEVAL, a dynamic alignment measure that jointly evaluates both the final-answer correctness and the step-level reasoning consistency. Our evaluation of 26 leading LLMs reveals that even frontier LLMs exhibit clear limitations in symbolic financial reasoning, while domain-adapted and math-enhanced fine-tuned models can substantially narrow this gap. Overall, FinChain exposes persistent weaknesses in multi-step financial reasoning and provides a foundation for developing trustworthy, interpretable, and verifiable financial AI. This project is available at https://github.com/mbzuai-nlp/finchain.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FinChain builds a verifiable financial CoT benchmark with symbolic templates and Python execution plus a joint answer-plus-consistency metric, but the fixed structures may miss alternative valid reasoning paths.

read the letter

FinChain is a benchmark that generates financial reasoning examples from parameterized symbolic templates backed by executable Python code, then scores both final answers and step consistency with a new measure called CHAINEVAL. The evaluations on 26 LLMs show frontier models still struggle while some domain or math fine-tuned ones close much of the gap. That is the core takeaway worth knowing right away.

Referee Report

2 major / 2 minor

Summary. The paper introduces FinChain, the first benchmark for verifiable chain-of-thought financial reasoning. It spans 58 topics across 12 domains using parameterized symbolic templates paired with executable Python code for machine-verifiable steps and scalable, contamination-free generation. The authors propose CHAINEVAL, a dynamic alignment metric that scores both final-answer correctness and step-level reasoning consistency. Evaluation of 26 LLMs shows frontier models exhibit limitations in symbolic financial reasoning while domain-adapted and math-enhanced fine-tuned models substantially narrow the gap.

Significance. If the templates and CHAINEVAL metric prove robust, the work supplies a much-needed resource for assessing intermediate reasoning transparency in financial AI, directly addressing the final-answer focus of datasets such as FinQA. The machine-verifiable Python execution and open GitHub release constitute clear strengths for reproducibility and future model development.

major comments (2)

[§3] §3 (Benchmark Construction): The central claim that FinChain reveals genuine limitations in frontier LLMs rests on the assumption that the fixed symbolic templates and CHAINEVAL alignment capture the essential intermediate steps of real financial reasoning. The manuscript provides no validation study comparing template-derived traces against expert-annotated alternative valid paths or conditional branches, which is load-bearing for interpreting the reported performance gaps.
[§5] §5 (Experiments and Results): The statement that fine-tuned models 'substantially narrow this gap' requires explicit reporting of whether the fine-tuning corpora overlap with the FinChain template parameters or generated instances; without this, the improvement could partly reflect memorization rather than improved symbolic reasoning.

minor comments (2)

[Abstract and §2] The abstract and §2 should state the total number of generated instances and the distribution across the 58 topics to allow readers to assess coverage and statistical power.
[Results figures] Figure captions and axis labels in the results section would benefit from explicit mention of the CHAINEVAL components (final-answer vs. step-consistency) for immediate interpretability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the manuscript. We address each major point below and outline planned revisions.

read point-by-point responses

Referee: [§3] §3 (Benchmark Construction): The central claim that FinChain reveals genuine limitations in frontier LLMs rests on the assumption that the fixed symbolic templates and CHAINEVAL alignment capture the essential intermediate steps of real financial reasoning. The manuscript provides no validation study comparing template-derived traces against expert-annotated alternative valid paths or conditional branches, which is load-bearing for interpreting the reported performance gaps.

Authors: We thank the referee for this observation. The 58 symbolic templates were authored by financial domain experts to represent canonical, executable reasoning sequences drawn from standard practices across the 12 domains. CHAINEVAL was developed precisely to score step-level consistency against these machine-verifiable traces. We acknowledge that the current manuscript does not contain a formal expert validation study comparing template traces to alternative valid paths or conditional branches. In the revised version we will add a subsection in §3 that (i) details the expert-driven template construction process, (ii) provides concrete examples of plausible alternative reasoning paths, and (iii) includes an explicit limitations paragraph on the scope of the template-based approach, thereby clarifying the interpretation of the performance gaps. revision: yes
Referee: [§5] §5 (Experiments and Results): The statement that fine-tuned models 'substantially narrow this gap' requires explicit reporting of whether the fine-tuning corpora overlap with the FinChain template parameters or generated instances; without this, the improvement could partly reflect memorization rather than improved symbolic reasoning.

Authors: We agree that explicit disclosure is required. The domain-adapted and math-enhanced models we evaluate were fine-tuned on publicly released datasets (e.g., financial QA corpora and mathematical reasoning collections) that predate the creation of FinChain. Because FinChain’s parameterized templates and generated instances are novel and not contained in any prior public training data, no overlap exists. In the revised §5 we will insert a dedicated paragraph that (i) lists the exact fine-tuning corpora, (ii) states the temporal and content-based absence of overlap with FinChain, and (iii) notes that all test instances were generated after model release dates, thereby ruling out memorization as an explanation for the observed gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces FinChain as a new benchmark using parameterized symbolic templates with executable Python for verifiable CoT in financial reasoning, along with the CHAINEVAL metric for joint final-answer and step-level evaluation. Central claims about LLM limitations are empirical results from testing 26 existing models on this benchmark; no parameters are fitted inside the paper to generate 'predictions' that reduce to those same inputs, no self-citation chains justify uniqueness or load-bearing premises, and no equations or derivations equate outputs to definitions by construction. The work is self-contained as an external benchmark against which model performance is measured.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the assumption that the chosen 58 topics and 12 domains adequately represent the space of multi-step financial reasoning and that executable Python templates can faithfully encode human-like intermediate steps without introducing artifacts.

axioms (1)

domain assumption Symbolic templates with Python code can represent the intermediate reasoning steps required for transparent and verifiable financial analysis.
Invoked in the description of FinChain construction to enable machine verification.

pith-pipeline@v0.9.0 · 5843 in / 1142 out tokens · 35400 ms · 2026-05-19T11:15:37.096438+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FinChain spans 58 topics across 12 financial domains, each represented by parameterized symbolic templates with executable Python code... CHAINEVAL, a dynamic alignment measure that jointly evaluates both the final-answer correctness and the step-level reasoning consistency.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose ChainEval, an evaluation framework that assesses model outputs along two axes: final answer correctness and reasoning step alignment.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SAHM: A Benchmark for Arabic Financial and Shari'ah-Compliant Reasoning
cs.CL 2026-04 conditional novelty 8.0

SAHM is the first Arabic financial benchmark with seven tasks including AAOIFI standards QA, fatwa reasoning, accounting exams, sentiment analysis, summarization, and event-cause reasoning, showing that Arabic fluency...
LPDS: Evaluating LLM Robustness Through Logic-Preserving Difficulty Scaling
cs.LG 2026-05 conditional novelty 6.0

LPDS quantifies difficulty of logic-preserving problem variations and searches for the hardest ones, producing up to 5x larger performance drops than random sampling and better robustness gains from fine-tuning on dif...