arxiv: 2603.19254 · v2 · submitted 2026-02-25 · 💻 cs.CL

Recognition: no theorem link

FinReasoning: A Hierarchical Benchmark for Reliable Financial Research Reporting

Yiyun Zhu , Yidong Jiang , Ziwen Xu , Yinsheng Yao , Dawei Cheng , Jinru Ding , Jie Xu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:38 UTC · model grok-4.3

classification 💻 cs.CL

keywords financial researchLLM evaluationbenchmarkhallucination correctionsemantic consistencymulti-agent systemsfinancial analysis

0 comments

The pith

FinReasoning decomposes financial research reporting into semantic consistency, data alignment, and deep insight to expose model-specific gaps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FinReasoning as a hierarchical benchmark that splits financial research capabilities into three layers instead of scoring outputs in one pass. It adds a fine-grained framework with a 12-indicator rubric focused on auditing, correction, and analytical depth. Testing across model types shows closed-source models handle the full pipeline well, open-source general models consistently drop on semantic consistency, and financial-domain models produce moderate insights but skip foundational checks. The design targets better role assignment in multi-agent financial systems and has entered real-world pilot use.

Core claim

FinReasoning decomposes financial research reporting into semantic consistency, data alignment, and deep insight. A strengthened evaluation framework improves hallucination-correction checks and applies a 12-indicator rubric to core analytical skills. The benchmark finds closed-source models perform strongly overall, open-source general models underperform on semantic consistency, and financial-domain models generate moderate insights yet lack foundational auditing skills.

What carries the argument

The FinReasoning hierarchical benchmark that decomposes capabilities into semantic consistency, data alignment, and deep insight, evaluated through a 12-indicator rubric for analytical skills and hallucination correction.

If this is right

Closed-source models become the preferred choice for core reasoning roles in multi-agent financial workflows.
Open-source general models require targeted fixes for semantic consistency before use in quality-sensitive generation tasks.
Financial-domain models need added auditing and correction training to move beyond moderate insight generation.
The three-layer structure allows clearer task routing among agents than single-score benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same layered evaluation could transfer to research reporting in medicine or law by swapping domain data.
Repeated use of the benchmark on successive model versions would track whether auditing skills improve over time.
Pilot deployments already running suggest the rubric can serve as a pre-release filter for financial tools.

Load-bearing premise

The chosen split into semantic consistency, data alignment, and deep insight plus the 12-indicator rubric fully captures the skills required for reliable financial research reporting.

What would settle it

A controlled deployment where models scored high or low on FinReasoning are used to produce real financial reports and independent experts measure the rate of factual errors, numerical inconsistencies, and shallow analysis against ground-truth corporate data.

Figures

Figures reproduced from arXiv: 2603.19254 by Dawei Cheng, Jie Xu, Jinru Ding, Yidong Jiang, Yinsheng Yao, Yiyun Zhu, Ziwen Xu.

**Figure 1.** Figure 1: An overview of the taxonomy of FinReasoning benchmark, which is organized into three complementary tracks: [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: An overview of the FinReasoning benchmark, presenting its data sources, the construction procedures for each task, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Comparative performance across model categories [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Visualized DI scores across tasks and model series. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

Large language models (LLMs) are increasingly deployed in financial research workflows, where their role is evolving from single-model assistance for human analysts toward autonomous collaboration among multiple agents. Yet real-world deployments still expose factual errors, numerical inconsistencies, and shallow analysis, which can distort assessments of corporate fundamentals and trigger severe economic losses. While existing benchmarks have begun to evaluate such failures, they score all aspects of the generated analysis in one pass, failing to distinguish whether a model fails at foundational stages like auditing and correction, or underperforms at generating research-grade insights. Consequently, it obscures capability bottlenecks and the specialized strengths essential for multi-agent role assignment. To address these gaps, we introduce FinReasoning, a hierarchical benchmark that decomposes the core capabilities of financial research into semantic consistency, data alignment, and deep insight. We further propose a fine-grained evaluation framework that strengthens hallucination-correction assessment and incorporates a 12-indicator rubric for core analytical skills. FinReasoning reveals clear capability stratification across model types. Closed-source models (like Doubao-Seed-1.8) perform strongly overall and are better suited for core reasoning agents in multi-agent financial systems; open-source general models (like Qwen3-235B) show clear capability divergence and consistently underperform on Semantic Consistency, making them less suited for quality-sensitive generation tasks; financial-domain models (like Fin-R1) generate moderate insights but lack foundational auditing skills. Our work has already been deployed in pilot tests across several real-world scenarios. The resource is available at https://github.com/TongjiFinLab/FinReasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces FinReasoning, a hierarchical benchmark that decomposes financial research reporting capabilities into semantic consistency, data alignment, and deep insight, evaluated with a 12-indicator rubric. It reports capability stratification across model types, with closed-source models performing strongly overall, open-source general models underperforming on semantic consistency, and financial-domain models lacking foundational auditing skills, and notes pilot deployments in real-world scenarios.

Significance. If the rubric and decomposition are validated, the benchmark could meaningfully advance evaluation of LLMs for financial workflows by identifying specific bottlenecks suitable for multi-agent role assignment, offering a more granular alternative to single-pass scoring in existing benchmarks.

major comments (2)

[Evaluation Framework] Evaluation Framework (12-indicator rubric and three-way decomposition): The stratification claims depend on the rubric accurately capturing auditing and insight capabilities, but no inter-rater agreement (e.g., Fleiss' kappa), expert correlation, or ablation showing non-redundancy is reported; without this, model-type differences cannot be confidently attributed to capability gaps rather than rubric artifacts or annotator bias.
[Results] Results and Analysis: The abstract states clear stratification (e.g., open-source models underperform on Semantic Consistency; financial models lack auditing skills), yet the manuscript supplies no quantitative tables, per-indicator scores, error analysis, or statistical tests supporting these differences; this is load-bearing for the central claim of capability bottlenecks.

minor comments (2)

[Abstract] The claim of deployment in pilot tests is mentioned without any reported outcomes, metrics, or lessons learned that would strengthen the practical relevance.
[Benchmark Design] Notation for the three capability dimensions could be made more consistent across sections to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the evaluation framework and results presentation. We address each major comment below and will revise the manuscript accordingly to strengthen the validation and quantitative support for our claims.

read point-by-point responses

Referee: [Evaluation Framework] Evaluation Framework (12-indicator rubric and three-way decomposition): The stratification claims depend on the rubric accurately capturing auditing and insight capabilities, but no inter-rater agreement (e.g., Fleiss' kappa), expert correlation, or ablation showing non-redundancy is reported; without this, model-type differences cannot be confidently attributed to capability gaps rather than rubric artifacts or annotator bias.

Authors: We agree that explicit validation metrics for the 12-indicator rubric are needed to support the attribution of model-type differences to capability gaps. In the revised manuscript we will add: (1) Fleiss' kappa computed over multiple expert annotators on a held-out subset of reports, (2) Pearson/Spearman correlations between rubric scores and independent expert holistic ratings, and (3) an ablation study that successively removes indicator groups to demonstrate non-redundancy. These additions will be placed in a new subsection of the evaluation framework and will directly address concerns about rubric artifacts or annotator bias. revision: yes
Referee: [Results] Results and Analysis: The abstract states clear stratification (e.g., open-source models underperform on Semantic Consistency; financial models lack auditing skills), yet the manuscript supplies no quantitative tables, per-indicator scores, error analysis, or statistical tests supporting these differences; this is load-bearing for the central claim of capability bottlenecks.

Authors: We acknowledge that the current results section lacks the granular quantitative evidence required to substantiate the stratification claims. In the revision we will expand the Results and Analysis section with: (1) a main table reporting all 12 indicator scores plus the three aggregate dimensions for every evaluated model, (2) per-indicator error analysis with representative failure cases, and (3) statistical tests (paired t-tests or Wilcoxon rank-sum with Bonferroni correction) quantifying the significance of differences between model categories. These tables and analyses will be added before the discussion of multi-agent implications. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark and rubric introduced independently without derivations or self-referential reductions

full rationale

The paper introduces FinReasoning as a new hierarchical benchmark decomposing financial research capabilities into semantic consistency, data alignment, and deep insight, along with a 12-indicator rubric. No equations, fitted parameters, predictions, or derivations are present. The framework is presented as an independent evaluation tool applied to model outputs rather than derived from or reducing to those outputs by construction. Claims of capability stratification follow from rubric application and are not load-bearing on any self-citation chain or ansatz. This matches the default expectation for benchmark papers with no mathematical derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work relies on standard assumptions from NLP benchmarking literature but introduces new evaluation categories without stating additional free parameters, axioms, or invented entities in the abstract.

pith-pipeline@v0.9.0 · 5607 in / 1119 out tokens · 25277 ms · 2026-05-15T19:38:41.390604+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 9 internal anchors

[1]

Anthropic. 2025. System Card: Claude Opus 4 & Claude Sonnet 4. https: //www.anthropic.com/claude-4-system-card

work page 2025
[2]

Lei Bai, Zhongrui Cai, Yuhang Cao, Maosong Cao, Weihan Cao, Chiyu Chen, Haojiong Chen, Kai Chen, Pengcheng Chen, Ying Chen, et al. 2025. Intern-s1: A scientific multimodal foundation model. arXiv preprint arXiv:2508.15763 (2025)

work page arXiv 2025
[3]

Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, et al. 2025. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . 3639–3664

work page 2025
[4]

Yejin Bang, Ziwei Ji, Alan Schelten, Anthony Hartshorn, Tara Fowler, Cheng Zhang, Nicola Cancedda, and Pascale Fung. 2025. HalluLens: LLM Halluci- nation Benchmark. In Proceedings of the 63rd Annual Meeting of the Associ- ation for Computational Linguistics (Volume 1: Long Papers) , Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehv...

work page doi:10.18653/v1/2025.acl-long.1176 2025
[5]

Noga BenYoash, Menachem Brief, Oded Ovadia, Gil Shenderovitz, Moshik Mishaeli, Rachel Lemberg, and Eitam Sheetrit. 2025. SECQUE: A Benchmark for Evaluating Real-World Financial Analysis Capabilities. In Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM 2), Ofir Ar- viv, Miruna Clinciu, Kaustubh Dhole, Rotem Dror, Sebastian Gehr...

work page 2025
[6]

Justin Brake. 2025. Major N.L. healthcare report contains errors likely gener- ated by A.I. https://theindependent.ca/news/lji/major-n-l-healthcare-report- contains-errors-likely-generated-by-a-i/. Accessed: 2026-02-11

work page 2025
[7]

Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, and William Yang Wang. 2021. FinQA: A Dataset of Numerical Reasoning over Financial Data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , Marie-Francine Moens, Xuanjing Hua...

work page 2021
[8]

DeepSeek-V3 Technical Report

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Zican Dong, Tianyi Tang, Junyi Li, Wayne Xin Zhao, and Ji-Rong Wen. 2024. Bam- boo: A comprehensive benchmark for evaluating long text modeling capacities of large language models. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2086–2099

work page 2024
[10]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv e-prints (2024), arXiv–2407

work page 2024
[11]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

work page doi:10.1038/s41586-025-09422-z 2025
[12]

Simeng Han, Hailey Schoelkopf, Yilun Zhao, Zhenting Qi, Martin Riddell, Wenfei Zhou, James Coady, David Peng, Yujie Qiao, Luke Benson, Lucy Sun, Alexander Wardle-Solano, Hannah Szabó, Ekaterina Zubova, Matthew Burtell, Jonathan Fan, Yixin Liu, Brian Wong, Malcolm Sailor, Ansong Ni, Linyong Nan, Jungo Kasai, Tao Yu, Rui Zhang, Alexander Fabbri, Wojciech Ma...

work page doi:10.18653/v1/2024.emnlp-main.1229 2024
[13]

Tiansheng Hu, Tongyan Hu, Liuyang Bai, Yilun Zhao, Arman Cohan, and Chen Zhao. 2025. FinTrust: A Comprehensive Benchmark of Trustworthiness Eval- uation in Finance Domain. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (Eds.). Associ...

work page doi:10.18653/v1/ 2025
[14]

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, and Bertie Vidgen. 2023. FinanceBench: A New Benchmark for Financial Question Answering. arXiv:2311.11944 [cs.CL]

work page internal anchor Pith review arXiv 2023
[16]

Lanlan Ji, Dominic Seyler, Gunkirat Kaur, Manjunath Hegde, Koustuv Das- gupta, and Bing Xiang. 2025. PHANTOM: A Benchmark for Hallucination Detection in Financial Long-Context QA. In The Thirty-ninth Annual Confer- ence on Neural Information Processing Systems Datasets and Benchmarks Track . https://openreview.net/forum?id=5YQAo0S3Hm

work page 2025
[17]

Junzhe Jiang, Chang Yang, Aixin Cui, Sihan Jin, Ruiyu Wang, Bo Li, Xiao Huang, Dongning Sun, and Xinrun Wang. 2025. FinMaster: A Holistic Benchmark for Mastering Full-Pipeline Financial Workflows with LLMs. arXiv:2505.13533 [cs.AI] https://arxiv.org/abs/2505.13533

work page arXiv 2025
[18]

Satyananda Kashyap, Sola Shirai, Nandana Mihindukulasooriya, and Horst Samu- lowitz. 2025. StructText: A Synthetic Table-to-Text Approach for Benchmark Generation with Multi-Dimensional Evaluation. Proceedings of the VLDB Endow- ment. ISSN 2150 (2025), 8097

work page 2025
[19]

Viet Lai, Michael Krumdick, Charles Lovering, Varshini Reddy, Craig Schmidt, and Chris Tanner. 2025. SEC-QA: A Systematic Evaluation Corpus for Financial QA. In Proceedings of The 10th Workshop on Financial Technology and Natural Language Processing, Chung-Chi Chen, Genta Indra Winata, Stephen Rawls, Anirban Das, Hsin-Hsi Chen, and Hiroya Takamura (Eds.)....

work page doi:10.18653/v1/2025.finnlp-2.15 2025
[20]

Junyi Li, Jie Chen, Ruiyang Ren, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and Ji- Rong Wen. 2024. The Dawn After the Dark: An Empirical Study on Factuality Hal- lucination in Large Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar ...

work page doi:10.18653/v1/2024.acl- 2024
[21]

Zhaowei Liu, Xin Guo, Zhi Yang, Fangqi Lou, Lingfeng Zeng, Mengping Li, Qi Qi, Zhiqiang Liu, Yiyang Han, Dongpo Cheng, Ronghao Chen, Huacan Wang, Xingdong Feng, Huixia Judy Wang, Chengchun Shi, and Liwen Zhang. 2026. Fin-R1: A Large Language Model for Financial Reasoning through Reinforcement Learning. arXiv:2503.16252 [cs.CL] https://arxiv.org/abs/2503.16252

work page arXiv 2026
[22]

Mihir Parmar, Nisarg Patel, Neeraj Varshney, Mutsumi Nakamura, Man Luo, Santosh Mashetty, Arindam Mitra, and Chitta Baral. 2024. LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , Lun-Wei Ku, An...

work page doi:10.18653/v1/2024.acl-long.739 2024
[23]

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Nora Redmond. 2025. This bank is using AI versions of its analysts to meet clients’ demand for videos. https://www.businessinsider.com/ai-videos-analysts- research-notes-ubs-swiss-bank-2025-5. Accessed: 2026-02-11

work page 2025
[25]

Abulhair Saparov and He He. 2023. Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought. In The Eleventh Interna- tional Conference on Learning Representations . https://openreview.net/forum?id= qFVVBzXxR2V

work page 2023
[26]

ByteDance Seed, Jiaze Chen, Tiantian Fan, Xin Liu, Lingjun Liu, Zhiqi Lin, Mingx- uan Wang, Chengyi Wang, Xiangpeng Wei, Wenyuan Xu, et al . 2025. Seed1. 5-thinking: Advancing superb reasoning models with reinforcement learning. arXiv preprint arXiv:2504.13914 (2025)

work page arXiv 2025
[27]

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al

work page
[28]

OpenAI GPT-5 System Card

Openai gpt-5 system card. arXiv preprint arXiv:2601.03267 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Jan Spörer. 2025. Can AI Read Like a Financial Analyst? A Financial Touchstone for Frontier Language Models Such as Gemini 2.5 Pro, o3, and Grok 4 on Long-Context Annual Report Comprehension. Association for Computing Machinery, New York, NY, USA, 291–298. https://doi.org/10.1145/3768292.3770417

work page doi:10.1145/3768292.3770417 2025
[30]

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. 2025. Gemma 3 technical report. arXiv preprint arXiv:2503.19786 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Kimi Team, Yifan Bai, Yiping Bao, Y. Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Chenxiao Gao, Hongcheng Gao, Peizhong Ga...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

Milana Vinn. 2025. Citadel debuts new AI tool for equities investors, CTO Subramanian says. https://www.reuters.com/business/citadel-debuts-new-ai- tool-equities-investors-cto-subramanian-says-2025-12-03. Accessed: 2026-02- 11

work page 2025
[33]

Yan Wang, Keyi Wang, Shanshan Yang, Jaisal Patel, Jeff Zhao, Fengran Mo, Xueqing Peng, Lingfei Qian, Jimin Huang, Guojun Xiong, et al. 2025. Finauditing: A financial taxonomy-structured multi-document benchmark for evaluating llms. arXiv preprint arXiv:2510.08886 (2025)

work page arXiv 2025
[34]

Pengzuo Wu, Yuhang Yang, Guangcheng Zhu, Chao Ye, Hong Gu, Xu Lu, Ruixuan Xiao, Bowen Bao, Yijing He, Liangyu Zha, Wentao Ye, Junbo Zhao, and Haobo Wang. 2025. RealHiTBench: A Comprehensive Realistic Hierar- chical Table Benchmark for Evaluating LLM-Based Table Analysis. In Find- ings of the Association for Computational Linguistics: ACL 2025 , Wanxiang C...

work page doi:10.18653/v1/2025.findings-acl.371 2025
[35]

Xiaojun Wu, Junxi Liu, Huan-Yi Su, Zhouchi Lin, Yiyan Qi, Chengjin Xu, Jiajun Su, Jiajie Zhong, Fuwei Wang, Saizhuo Wang, Fengrui Hua, Jia Li, and Jian Guo

work page
[36]

InFindings of the Association for Computational Linguistics: EMNLP 2025 , Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (Eds.)

Golden Touchstone: A Comprehensive Bilingual Benchmark for Evaluating Financial Large Language Models. InFindings of the Association for Computational Linguistics: EMNLP 2025 , Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (Eds.). Association for Computational Linguistics, Suzhou, China, 22544–22560. https://doi.org/10.186...

work page doi:10.18653/v1/2025.findings-emnlp 2025
[37]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. 2025. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models. arXiv preprint arXiv:2508.06471 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Mengao Zhang, Jiayu Fu, Tanya Warrier, Yuwen Wang, Tianhui Tan, and Ke-wei Huang. 2025. FAITH: A Framework for Assessing Intrinsic Tabular Hallucinations in Finance. InProceedings of the 6th ACM International Conference on AI in Finance. 159–167

work page 2025
[40]

Fengbin Zhu, Wenqiang Lei, Fuli Feng, Chao Wang, Haozhou Zhang, and Tat-Seng Chua. 2022. Towards complex document understanding by discrete reasoning. In Proceedings of the 30th ACM International Conference on Multimedia . 4857–4866

work page 2022
[41]

Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua. 2021. TAT-QA: A Question An- swering Benchmark on a Hybrid of Tabular and Textual Content in Finance. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural...

work page doi:10.18653/v1/2021.acl-long.254 2021
[42]

Jie Zhu, Qian Chen, Huaixia Dou, Junhui Li, Lifan Guo, Feng Chen, and Chi Zhang. 2025. Dianjin-r1: Evaluating and enhancing financial reasoning in large language models. arXiv preprint arXiv:2504.15716 (2025)

work page arXiv 2025