EvidenceLens: A Claim-Evidence Matrix for Auditing Financial Question Answering

\'Angel F. Garc\'ia-Fern\'andez; Angelos Stefanidis; Fengchen Gu; Huakang Li; Jionglong Su; Mian Zhou; Xiaotian Ren; Zhengyong Jiang; Zhilu Zhang

arxiv: 2606.23724 · v1 · pith:JO4QNPDXnew · submitted 2026-06-19 · 💻 cs.IR · cs.CL· cs.HC

EvidenceLens: A Claim-Evidence Matrix for Auditing Financial Question Answering

Fengchen Gu , Xiaotian Ren , Zhengyong Jiang , Zhilu Zhang , \'Angel F. Garc\'ia-Fern\'andez , Angelos Stefanidis , Mian Zhou , Huakang Li

show 1 more author

Jionglong Su

This is my paper

Pith reviewed 2026-06-26 13:21 UTC · model grok-4.3

classification 💻 cs.IR cs.CLcs.HC

keywords financial question answeringclaim-evidence alignmentLLM auditingvisual analyticsmultimodal matrixatomic claim decompositionreport verificationsupport gap detection

0 comments

The pith

EvidenceLens turns LLM financial answers into a claim-evidence matrix that shows which parts rest on text, tables, or charts and which do not.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EvidenceLens as a visual analytics system that reframes financial question answering as a claim-evidence alignment task. It decomposes model outputs into atomic claims, aligns them with passages, table cells, and chart regions, and renders the alignments in a multimodal matrix. The matrix is intended to make coverage gaps, contradictions, and modality imbalances visible at a glance. The system also supplies a JSON artifact schema, an alignment pipeline, and a review-priority ranking to support reproducible audits. Representative scenarios are used to illustrate how the approach separates grounded statements from unsupported synthesis that flat chat interfaces obscure.

Core claim

EvidenceLens treats financial question answering as a claim-evidence alignment problem whose central visual object is a multimodal claim-evidence matrix; the matrix coordinates atomic claims with their supporting or contradicting sources across narrative text, tables, and charts so that analysts can immediately see support composition, confidence levels, and coverage gaps.

What carries the argument

The multimodal claim-evidence matrix that maps each atomic claim to source passages, table cells, and chart regions while summarizing support composition and modality balance.

If this is right

Analysts can separate directly grounded claims from overconfident synthesis in earnings reports and analyst notes.
Coverage, contradiction, and modality imbalance become visible without manual cross-referencing.
A JSON-based artifact schema and deterministic review ranking make the auditing process reproducible and auditable.
The same decomposition and alignment steps apply across narrative text, tables, and charts in a single view.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The matrix format could be adapted to audit LLM outputs in domains that also combine text, tables, and figures, such as regulatory filings or scientific papers.
If the alignment pipeline is made fully automatic, the system might serve as a backend for real-time verification rather than post-hoc review.
Extending the matrix to track claim-level confidence scores over time could surface how model answers drift when new reports are added.

Load-bearing premise

Atomic claim decomposition and multimodal alignment can be performed reliably enough that the resulting matrix directly reveals coverage, contradiction, and imbalance without adding its own errors.

What would settle it

A controlled audit task in which analysts using the matrix flag fewer false positives or miss fewer unsupported claims than analysts using only the original LLM answer and source documents.

Figures

Figures reproduced from arXiv: 2606.23724 by \'Angel F. Garc\'ia-Fern\'andez, Angelos Stefanidis, Fengchen Gu, Huakang Li, Jionglong Su, Mian Zhou, Xiaotian Ren, Zhengyong Jiang, Zhilu Zhang.

**Figure 1.** Figure 1: EVIDENCELENS converts a generated financial answer into an auditable claim-evidence representation. A The Claim Panel decomposes the answer into atomic claims and summarizes support composition, confidence–support gaps, and review priority. B The central Claim-Evidence Matrix groups evidence columns by modality and source order, making sparse support, cross-modal corroboration, and contradiction visible at… view at source ↗

read the original abstract

Large language models are increasingly used to answer questions over annual reports, earnings decks, and analyst notes, yet their outputs remain difficult to verify in high-stakes financial workflows. A fluent answer can blend directly grounded statements, weak synthesis, and unsupported claims across narrative text, tables, and charts. We present EvidenceLens, a visual analytics prototype that treats financial question answering as a claim-evidence alignment problem. The system decomposes an answer into atomic claims, summarizes support composition and confidence, support gaps, and coordinates claim-level inspection with source passages, table cells, and chart regions. Its core visual representation is a multimodal claim-evidence matrix that makes coverage, contradiction, and modality imbalance immediately visible. To support reproducibility, we also specify a JSON-based artifact schema, a lightweight multimodal alignment pipeline, and a deterministic review-priority ranking that maps backend signals into an auditable visual structure. Through representative report-auditing scenarios, we show how EvidenceLens helps analysts distinguish grounded claims from overconfident synthesis that conventional chat interfaces flatten.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EvidenceLens is a clean prototype for claim-evidence matrices in financial QA auditing that makes support gaps visible, but it offers only scenarios with no measurements or comparisons.

read the letter

The paper's main contribution is a multimodal claim-evidence matrix plus a JSON artifact schema that decomposes LLM answers into atomic claims, aligns them to text passages, table cells, and chart regions, and surfaces coverage, contradictions, and modality imbalance in one view. The deterministic review-priority ranking and lightweight alignment pipeline are specified clearly enough for others to reproduce the artifact structure.

This design does address a practical issue: conventional chat outputs flatten the difference between directly supported statements and overconfident synthesis, and the matrix layout makes that distinction quicker to scan. The scenarios walk through report-auditing cases in a way that shows the intended workflow.

The limitation is straightforward. All claims about reduced verification effort rest on representative examples; there are no user studies, no timing data, no error rates on claim decomposition or alignment, and no comparison against simpler baselines like sentence highlighting or existing fact-checking interfaces. The assumption that atomic claims can be extracted and aligned reliably enough to avoid introducing new errors is stated but not checked.

The work is aimed at teams building audit tools for financial or other high-stakes LLM applications. A reader who needs concrete design patterns for claim-level visualization will get usable ideas; someone looking for measured improvements will not.

I would send it to peer review with the expectation that the authors add at least a small evaluation or ablation, because the core framing is sound even if the current support is thin.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces EvidenceLens, a visual analytics prototype for auditing LLM answers to financial questions over reports and related documents. It decomposes generated answers into atomic claims, aligns them with multimodal evidence (text, tables, charts) via a specified lightweight pipeline and JSON artifact schema, and visualizes alignments in a claim-evidence matrix to expose coverage gaps, contradictions, and modality imbalances. The utility is demonstrated through representative report-auditing scenarios rather than quantitative experiments.

Significance. If the decomposition and alignment steps prove reliable in practice, the matrix visualization and auditable ranking could meaningfully improve verification workflows in high-stakes financial QA by surfacing evidence composition that chat interfaces obscure. The explicit JSON schema and deterministic review-priority ranking are concrete strengths that support reproducibility and extension by others.

major comments (2)

[Abstract] Abstract: The central claim that EvidenceLens 'helps analysts distinguish grounded claims from overconfident synthesis' rests entirely on illustrative scenarios; no quantitative metrics (e.g., alignment precision, inter-annotator agreement on claim decomposition, or task-completion time with/without the tool) or baseline comparisons are reported, leaving the reliability of the multimodal alignment pipeline unmeasured.
[Pipeline description (inferred from abstract and system overview)] The description of the 'lightweight multimodal alignment pipeline': The paper assumes atomic claim decomposition and evidence alignment can be performed reliably enough to reveal coverage/contradiction without substantial manual correction, yet provides no error analysis, failure modes, or handling strategy for ambiguous cases (e.g., chart regions or synthesized claims), which is load-bearing for the matrix's claimed immediate visibility of issues.

minor comments (1)

[Abstract] The abstract and system overview would benefit from an explicit limitations paragraph stating the scope of the scenario-based demonstration and the current maturity of the alignment pipeline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and positive assessment of the JSON schema and ranking mechanism. We address the major comments point by point below.

read point-by-point responses

Referee: [Abstract] The central claim that EvidenceLens 'helps analysts distinguish grounded claims from overconfident synthesis' rests entirely on illustrative scenarios; no quantitative metrics (e.g., alignment precision, inter-annotator agreement on claim decomposition, or task-completion time with/without the tool) or baseline comparisons are reported, leaving the reliability of the multimodal alignment pipeline unmeasured.

Authors: The manuscript is framed as a visual analytics prototype paper, with utility demonstrated through representative scenarios rather than controlled experiments or quantitative benchmarks. This is consistent with many system and design papers in the visual analytics community. The central claim is supported by the scenarios showing how the matrix exposes issues that chat interfaces obscure. We will revise the abstract to clarify that the distinction is illustrated via scenarios and include an explicit limitations paragraph noting the lack of quantitative evaluation of the pipeline. revision: partial
Referee: The description of the 'lightweight multimodal alignment pipeline': The paper assumes atomic claim decomposition and evidence alignment can be performed reliably enough to reveal coverage/contradiction without substantial manual correction, yet provides no error analysis, failure modes, or handling strategy for ambiguous cases (e.g., chart regions or synthesized claims), which is load-bearing for the matrix's claimed immediate visibility of issues.

Authors: We agree that additional discussion of the pipeline's assumptions and limitations would be beneficial. The design intent is that the matrix and inspection views enable analysts to detect and address alignment issues, rather than assuming perfect automation. We will expand the pipeline section to include a discussion of potential failure modes for ambiguous cases such as chart interpretations and synthesized claims, along with the strategy of human-in-the-loop verification. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a system description of a visual analytics prototype (EvidenceLens) for claim-evidence alignment in financial QA. It contains no equations, derivations, fitted parameters, or mathematical claims. The core contributions are an engineering artifact (JSON schema, alignment pipeline, matrix visualization) and illustrative scenarios; the argument does not reduce any result to its own inputs by construction or via self-citation chains. All load-bearing elements are scoped as descriptive rather than predictive or theorem-based.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is a system description with no mathematical model, fitted parameters, or new entities postulated. No free parameters, axioms, or invented entities are required or introduced.

pith-pipeline@v0.9.1-grok · 5749 in / 1152 out tokens · 11869 ms · 2026-06-26T13:21:27.466194+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 21 canonical work pages · 1 internal anchor

[2]

H. Aida, K. Takahashi, and T. Omi. Enhancing large vision-language models with layout modality for table question answering on japanese annual securities reports, 2025. doi: 10.48550/arXiv.2505.17625 1

work page doi:10.48550/arxiv.2505.17625 2025
[3]

Appleby, M

G. Appleby, M. Hassanaly, J. Rogers, J. Mueller, and K. Potter. BN- NVis: Towards visual analytics for bayesian neural networks. In2025 IEEE Visualization and Visual Analytics (VIS), pp. 146–150. IEEE,
[4]

doi: 10.1109/VIS60296.2025.00035 1

work page doi:10.1109/vis60296.2025.00035 2025
[5]

Beregovyi and T

K. Beregovyi and T. Butkiewicz. Visual integrity in the age of AI: An evaluation of DLSS and DLAA in geospatial visualization. In2025 IEEE Visualization and Visual Analytics (VIS), pp. 291–295. IEEE,
[6]

doi: 10.1109/VIS60296.2025.00064 1

work page doi:10.1109/vis60296.2025.00064 2025
[8]

Z. Chen, W. Chen, C. Smiley, S. Shah, I. Borova, D. Langdon, R. Moussa, M. Beane, T.-H. Huang, B. Routledge, and W. Y . Wang. Finqa: A dataset of numerical reasoning over financial data. InPro- ceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3697–3711. Association for Computational Linguistics, Online and Punta Can...

work page doi:10.18653/v1/2021.emnlp-main.300 2021
[9]

Foroutan, A

N. Foroutan, A. Romanou, M. Ansaripour, J. M. Eisenschlos, K. Aberer, and R. Lebret. Wikimixqa: A multimodal benchmark for question answering over tables and charts. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 24941–24958. Association for Computational Linguistics, Vienna, Austria, 2025. 1

2025
[10]

MultiFinRAG: An Optimized Multimodal Retrieval-Augmented Generation (RAG) Framework for Financial Question Answering

C. Gondhalekar, U. Patel, and F.-C. Yeh. Multifinrag: An optimized multimodal retrieval-augmented generation (RAG) framework for fi- nancial question answering, 2025. doi: 10.48550/arXiv.2506.20821 1

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.20821 2025
[11]

A. Kale. Toward a logic of generalization about visualization as a decision aid. In2025 IEEE Visualization and Visual Analytics (VIS), pp. 1–5. IEEE, 2025. doi: 10.1109/VIS60296.2025.00005 1

work page doi:10.1109/vis60296.2025.00005 2025
[12]

S. G. Kim, J. Y . Choi, Y . Lee, J. Chung, R. Rossi, J. Kil, E. Koh, and T. Y . Lee. Grounded generation of embellished bar chart ensuring chart integrity. In2025 IEEE Visualization and Visual Analytics (VIS), pp. 101–105. IEEE, 2025. doi: 10.1109/VIS60296.2025.00026 1

work page doi:10.1109/vis60296.2025.00026 2025
[13]

P.-M. Law, A. Endert, and J. T. Stasko. Characterizing automated data insights. In2020 IEEE Visualization Conference (VIS), pp. 171–175. IEEE, 2020. doi: 10.1109/VIS47514.2020.00041 1

work page doi:10.1109/vis47514.2020.00041 2020
[14]

H. Li, Y . Wang, and H. Qu. Reflection on data storytelling tools in the generative AI era from the human-AI collaboration perspective, 2025. doi: 10.48550/arXiv.2503.02631 1

work page doi:10.48550/arxiv.2503.02631 2025
[15]

V . R. Li, J. L. Sun, and M. Wattenberg. Does visualization help AI understand data? In2025 IEEE Visualization and Visual Analytics (VIS), pp. 51–55. IEEE, 2025. doi: 10.1109/VIS60296.2025.00016 1

work page doi:10.1109/vis60296.2025.00016 2025
[16]

C. Liu, C. Da, X. Long, Y . Yang, Y . Zhang, and Y . Wang. Simvecvis: A dataset for enhancing MLLMs in visualization understanding. In 2025 IEEE Visualization and Visual Analytics (VIS), pp. 26–30. IEEE,

2025
[17]

doi: 10.1109/VIS60296.2025.00010 1

work page doi:10.1109/vis60296.2025.00010 2025
[18]

L. Y .-H. Lo and H. Qu. How good (or bad) are LLMs at detecting misleading visualizations?IEEE Transactions on Visualization and Computer Graphics, 31(1):1116–1125, 2025. doi: 10.1109/TVCG. 2024.3456333 1

work page doi:10.1109/tvcg 2025
[19]

Mahbub, M

R. Mahbub, M. S. Islam, M. T. R. Laskar, M. Rahman, M. T. Nayeem, and E. Hoque. The perils of chart deception: How misleading visualiza- tions affect vision-language models. In2025 IEEE Visualization and Visual Analytics (VIS), pp. 6–10. IEEE, 2025. doi: 10.1109/VIS60296. 2025.00006 1

work page doi:10.1109/vis60296 2025
[20]

URL https://proceedings.mlr

A. Masry, D. X. Long, J. Q. Tan, S. Joty, and E. Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the Association for Computational Linguistics: ACL 2022, pp. 2263–2279. Association for Computational Linguistics, Dublin, Ireland, 2022. doi: 10.18653/v1/2022.findings-acl .177 1

work page doi:10.18653/v1/2022.findings-acl 2022
[21]

Mukhopadhyay, A

S. Mukhopadhyay, A. Qidwai, A. Garimella, P. Ramu, V . Gupta, and D. Roth. Unraveling the truth: Do VLMs really understand charts? a deep dive into consistency and robustness. InFindings of the Associa- tion for Computational Linguistics: EMNLP 2024, pp. 16696–16717. Association for Computational Linguistics, Miami, Florida, USA, 2024. doi: 10.18653/v1/20...

work page doi:10.18653/v1/2024.findings-emnlp.973 2024
[22]

Ahn and N

A. Nuthalapati, N. Hinds, B. Y . Lim, and Q. Wang. Enhancing XAI interpretation through a reverse mapping from insights to visualizations. In2025 IEEE Visualization and Visual Analytics (VIS), pp. 41–45. IEEE, 2025. doi: 10.1109/VIS60296.2025.00013 1

work page doi:10.1109/vis60296.2025.00013 2025
[23]

X. Peng, L. Qian, Y . Wang, et al. Multifinben: A multilingual, multi- modal, and difficulty-aware benchmark for financial LLM evaluation,
[24]

doi: 10.48550/arXiv.2506.14028 1

work page doi:10.48550/arxiv.2506.14028
[25]

L. S. Snyder, C. Wang, and S. M. Drucker. Challenges & opportunities with LLM-assisted visualization retargeting. In2025 IEEE Visualiza- tion and Visual Analytics (VIS), pp. 141–145. IEEE, 2025. doi: 10. 1109/VIS60296.2025.00034 1

arXiv 2025
[26]

Vaidya and A

S. Vaidya and A. Dasgupta. Knowing what to look for: A fact-evidence reasoning framework for decoding communicative visualization. In 2020 IEEE Visualization Conference (VIS), pp. 231–235. IEEE, 2020. doi: 10.1109/VIS47514.2020.00053 1

work page doi:10.1109/vis47514.2020.00053 2020
[27]

H. W. Wang, J. Hoffswell, S. M. Thane, V . S. Bursztyn, and C. Xiong Bearfield. How aligned are human chart takeaways and LLM predictions? a case study on bar charts with varying layouts.IEEE Transactions on Visualization and Computer Graphics, 31(1):536–546,
[28]

doi: 10.1109/TVCG.2024.3456378 1

work page doi:10.1109/tvcg.2024.3456378 2024
[29]

X. Wang, J. Chi, Z. Tai, et al. Finsage: A multi-aspect RAG system for financial filings question answering, 2025. doi: 10.48550/arXiv.2504. 14493 1

work page doi:10.48550/arxiv.2504 2025
[30]

Y . Yu, L. Shen, F. Long, H. Qu, and H. Chen. Pygwalker: On-the-fly assistant for exploratory visual data analysis, 2024. doi: 10.48550/ arXiv.2406.11637 1

arXiv 2024
[31]

F. Zhu, W. Lei, Y . Huang, C. Wang, S. Zhang, J. Lv, F. Feng, and T.-S. Chua. TAT-QA: A question answering benchmark on a hybrid of tabu- lar and textual content in finance. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Vol- ume 1: Long ...

work page doi:10.18653/v1/2021.acl-long.254 2021

[1] [2]

H. Aida, K. Takahashi, and T. Omi. Enhancing large vision-language models with layout modality for table question answering on japanese annual securities reports, 2025. doi: 10.48550/arXiv.2505.17625 1

work page doi:10.48550/arxiv.2505.17625 2025

[2] [3]

Appleby, M

G. Appleby, M. Hassanaly, J. Rogers, J. Mueller, and K. Potter. BN- NVis: Towards visual analytics for bayesian neural networks. In2025 IEEE Visualization and Visual Analytics (VIS), pp. 146–150. IEEE,

[3] [4]

doi: 10.1109/VIS60296.2025.00035 1

work page doi:10.1109/vis60296.2025.00035 2025

[4] [5]

Beregovyi and T

K. Beregovyi and T. Butkiewicz. Visual integrity in the age of AI: An evaluation of DLSS and DLAA in geospatial visualization. In2025 IEEE Visualization and Visual Analytics (VIS), pp. 291–295. IEEE,

[5] [6]

doi: 10.1109/VIS60296.2025.00064 1

work page doi:10.1109/vis60296.2025.00064 2025

[6] [8]

Z. Chen, W. Chen, C. Smiley, S. Shah, I. Borova, D. Langdon, R. Moussa, M. Beane, T.-H. Huang, B. Routledge, and W. Y . Wang. Finqa: A dataset of numerical reasoning over financial data. InPro- ceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3697–3711. Association for Computational Linguistics, Online and Punta Can...

work page doi:10.18653/v1/2021.emnlp-main.300 2021

[7] [9]

Foroutan, A

N. Foroutan, A. Romanou, M. Ansaripour, J. M. Eisenschlos, K. Aberer, and R. Lebret. Wikimixqa: A multimodal benchmark for question answering over tables and charts. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 24941–24958. Association for Computational Linguistics, Vienna, Austria, 2025. 1

2025

[8] [10]

MultiFinRAG: An Optimized Multimodal Retrieval-Augmented Generation (RAG) Framework for Financial Question Answering

C. Gondhalekar, U. Patel, and F.-C. Yeh. Multifinrag: An optimized multimodal retrieval-augmented generation (RAG) framework for fi- nancial question answering, 2025. doi: 10.48550/arXiv.2506.20821 1

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.20821 2025

[9] [11]

A. Kale. Toward a logic of generalization about visualization as a decision aid. In2025 IEEE Visualization and Visual Analytics (VIS), pp. 1–5. IEEE, 2025. doi: 10.1109/VIS60296.2025.00005 1

work page doi:10.1109/vis60296.2025.00005 2025

[10] [12]

S. G. Kim, J. Y . Choi, Y . Lee, J. Chung, R. Rossi, J. Kil, E. Koh, and T. Y . Lee. Grounded generation of embellished bar chart ensuring chart integrity. In2025 IEEE Visualization and Visual Analytics (VIS), pp. 101–105. IEEE, 2025. doi: 10.1109/VIS60296.2025.00026 1

work page doi:10.1109/vis60296.2025.00026 2025

[11] [13]

P.-M. Law, A. Endert, and J. T. Stasko. Characterizing automated data insights. In2020 IEEE Visualization Conference (VIS), pp. 171–175. IEEE, 2020. doi: 10.1109/VIS47514.2020.00041 1

work page doi:10.1109/vis47514.2020.00041 2020

[12] [14]

H. Li, Y . Wang, and H. Qu. Reflection on data storytelling tools in the generative AI era from the human-AI collaboration perspective, 2025. doi: 10.48550/arXiv.2503.02631 1

work page doi:10.48550/arxiv.2503.02631 2025

[13] [15]

V . R. Li, J. L. Sun, and M. Wattenberg. Does visualization help AI understand data? In2025 IEEE Visualization and Visual Analytics (VIS), pp. 51–55. IEEE, 2025. doi: 10.1109/VIS60296.2025.00016 1

work page doi:10.1109/vis60296.2025.00016 2025

[14] [16]

C. Liu, C. Da, X. Long, Y . Yang, Y . Zhang, and Y . Wang. Simvecvis: A dataset for enhancing MLLMs in visualization understanding. In 2025 IEEE Visualization and Visual Analytics (VIS), pp. 26–30. IEEE,

2025

[15] [17]

doi: 10.1109/VIS60296.2025.00010 1

work page doi:10.1109/vis60296.2025.00010 2025

[16] [18]

L. Y .-H. Lo and H. Qu. How good (or bad) are LLMs at detecting misleading visualizations?IEEE Transactions on Visualization and Computer Graphics, 31(1):1116–1125, 2025. doi: 10.1109/TVCG. 2024.3456333 1

work page doi:10.1109/tvcg 2025

[17] [19]

Mahbub, M

R. Mahbub, M. S. Islam, M. T. R. Laskar, M. Rahman, M. T. Nayeem, and E. Hoque. The perils of chart deception: How misleading visualiza- tions affect vision-language models. In2025 IEEE Visualization and Visual Analytics (VIS), pp. 6–10. IEEE, 2025. doi: 10.1109/VIS60296. 2025.00006 1

work page doi:10.1109/vis60296 2025

[18] [20]

URL https://proceedings.mlr

A. Masry, D. X. Long, J. Q. Tan, S. Joty, and E. Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the Association for Computational Linguistics: ACL 2022, pp. 2263–2279. Association for Computational Linguistics, Dublin, Ireland, 2022. doi: 10.18653/v1/2022.findings-acl .177 1

work page doi:10.18653/v1/2022.findings-acl 2022

[19] [21]

Mukhopadhyay, A

S. Mukhopadhyay, A. Qidwai, A. Garimella, P. Ramu, V . Gupta, and D. Roth. Unraveling the truth: Do VLMs really understand charts? a deep dive into consistency and robustness. InFindings of the Associa- tion for Computational Linguistics: EMNLP 2024, pp. 16696–16717. Association for Computational Linguistics, Miami, Florida, USA, 2024. doi: 10.18653/v1/20...

work page doi:10.18653/v1/2024.findings-emnlp.973 2024

[20] [22]

Ahn and N

A. Nuthalapati, N. Hinds, B. Y . Lim, and Q. Wang. Enhancing XAI interpretation through a reverse mapping from insights to visualizations. In2025 IEEE Visualization and Visual Analytics (VIS), pp. 41–45. IEEE, 2025. doi: 10.1109/VIS60296.2025.00013 1

work page doi:10.1109/vis60296.2025.00013 2025

[21] [23]

X. Peng, L. Qian, Y . Wang, et al. Multifinben: A multilingual, multi- modal, and difficulty-aware benchmark for financial LLM evaluation,

[22] [24]

doi: 10.48550/arXiv.2506.14028 1

work page doi:10.48550/arxiv.2506.14028

[23] [25]

L. S. Snyder, C. Wang, and S. M. Drucker. Challenges & opportunities with LLM-assisted visualization retargeting. In2025 IEEE Visualiza- tion and Visual Analytics (VIS), pp. 141–145. IEEE, 2025. doi: 10. 1109/VIS60296.2025.00034 1

arXiv 2025

[24] [26]

Vaidya and A

S. Vaidya and A. Dasgupta. Knowing what to look for: A fact-evidence reasoning framework for decoding communicative visualization. In 2020 IEEE Visualization Conference (VIS), pp. 231–235. IEEE, 2020. doi: 10.1109/VIS47514.2020.00053 1

work page doi:10.1109/vis47514.2020.00053 2020

[25] [27]

H. W. Wang, J. Hoffswell, S. M. Thane, V . S. Bursztyn, and C. Xiong Bearfield. How aligned are human chart takeaways and LLM predictions? a case study on bar charts with varying layouts.IEEE Transactions on Visualization and Computer Graphics, 31(1):536–546,

[26] [28]

doi: 10.1109/TVCG.2024.3456378 1

work page doi:10.1109/tvcg.2024.3456378 2024

[27] [29]

X. Wang, J. Chi, Z. Tai, et al. Finsage: A multi-aspect RAG system for financial filings question answering, 2025. doi: 10.48550/arXiv.2504. 14493 1

work page doi:10.48550/arxiv.2504 2025

[28] [30]

Y . Yu, L. Shen, F. Long, H. Qu, and H. Chen. Pygwalker: On-the-fly assistant for exploratory visual data analysis, 2024. doi: 10.48550/ arXiv.2406.11637 1

arXiv 2024

[29] [31]

F. Zhu, W. Lei, Y . Huang, C. Wang, S. Zhang, J. Lv, F. Feng, and T.-S. Chua. TAT-QA: A question answering benchmark on a hybrid of tabu- lar and textual content in finance. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Vol- ume 1: Long ...

work page doi:10.18653/v1/2021.acl-long.254 2021