arxiv: 2605.11687 · v1 · submitted 2026-05-12 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Persistent and Conversational Multi-Method Explainability for Trustworthy Financial AI

Georgios Makridis , Georgios Fatouros , John Soldatos , George Katsis , Dimosthenis Kyriazis

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:11 UTC · model grok-4.3

classification 💻 cs.AI

keywords explainable AIfinancial sentiment analysismulti-method XAIRAGpersistent explanationshallucination reductiontrustworthy AI

0 comments

The pith

An architecture stores XAI artifacts persistently and uses RAG to triangulate explanations across LIME, occlusion, and saliency for financial sentiment analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents an architecture that treats LIME feature attributions, occlusion scores, and saliency heatmaps as persistent, searchable objects with metadata and natural-language summaries in distributed storage. A retrieval-augmented generation assistant then compares and synthesizes outputs from these multiple methods on the same prediction, enabling users to assess robustness through dialogue. Automated checks measure grounding completeness and hallucination in the generated explanations. On a FinBERT pipeline for financial sentiment, constrained prompting lowers hallucination rates by 36 percent and raises correct method-attribution citations by 73 percent relative to naive prompting. The work targets trustworthy, human-centered AI services in regulated financial environments.

Core claim

We treat XAI artifacts as persistent searchable objects in S3-compatible storage, enable multi-method explanation triangulation through a RAG assistant that synthesizes LIME, occlusion, and saliency results, and evaluate faithfulness via automated checks on grounding, hallucination, and attribution behavior, yielding a 36 percent hallucination reduction and 73 percent citation increase under constrained prompting on FinBERT financial sentiment predictions.

What carries the argument

The RAG assistant for multi-method explanation triangulation, which retrieves and synthesizes outputs from multiple XAI methods applied to the same prediction to support natural-language assessment of robustness.

If this is right

XAI artifacts remain retrievable and index-reconstructible after system failures through structured metadata and semantic search.
Users can evaluate explanation robustness by conversing with a synthesized view across multiple XAI methods rather than inspecting each in isolation.
Automated faithfulness checks provide quantitative signals on hallucination and method attribution that can support compliance auditing.
The same persistent storage and RAG layer can be applied to other prediction pipelines beyond the demonstrated FinBERT sentiment model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Persistent explanations could support longitudinal auditing of AI decisions in finance by preserving attribution history across model updates.
If the RAG synthesis step proves stable, the architecture might reduce the volume of manual expert review needed for each individual prediction.
The approach suggests a template for other regulated domains where multi-method validation and conversational access are required for AI adoption.

Load-bearing premise

A RAG assistant can reliably compare and synthesize outputs from different XAI methods like LIME, occlusion, and saliency without introducing new errors or biases, and the automated checks for grounding completeness and hallucination accurately measure explanation faithfulness.

What would settle it

Apply the system to a held-out financial sentiment dataset containing known cases of explanation mismatch and measure whether the automated hallucination and grounding checks detect the mismatches at the reported rates.

Figures

Figures reproduced from arXiv: 2605.11687 by Dimosthenis Kyriazis, George Katsis, Georgios Fatouros, Georgios Makridis, John Soldatos.

read the original abstract

Financial institutions increasingly require AI explanations that are persistent, cross-validated across methods, and conversationally accessible to human decision-makers. We present an architecture for human-centered explainable AI in financial sentiment analysis that combines three contributions. First, we treat XAI artifacts -- LIME feature attributions, occlusion-based word importance scores, and saliency heatmaps -- as persistent, searchable objects in distributed S3-compatible storage with structured metadata and natural-language summaries, enabling semantic retrieval over explanation history and automatic index reconstruction after system failures. Second, we enable multi-method explanation triangulation, where a retrieval-augmented generation (RAG) assistant compares and synthesizes results from multiple XAI methods applied to the same prediction, allowing users to assess explanation robustness through natural-language dialogue. Third, we evaluate the faithfulness of generated explanations using automated checks over grounding completeness, hallucinated claims, and method-attribution behavior. We demonstrate the architecture on an EXTRA-BRAIN financial sentiment analysis pipeline using FinBERT predictions and present evaluation results showing that constrained prompting reduces hallucination rate by 36\% and increases method-attribution citations by 73\% compared to naive prompting. We discuss implications for trustworthy, human-centered AI services in regulated financial environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A practical engineering system for persistent multi-method XAI in finance that reports concrete gains but leaves the evaluation details thin.

read the letter

The main point is that this paper describes a working architecture that stores XAI artifacts persistently in S3-style storage, uses RAG to synthesize outputs from LIME, occlusion, and saliency into conversational explanations, and runs automated checks for grounding and hallucination on a FinBERT financial sentiment pipeline. It reports that constrained prompting cuts hallucination rate by 36% and lifts method-attribution citations by 73% versus naive prompting. Those numbers are the kind of practical signal that deployment teams can test against their own setups. The persistence layer and semantic retrieval over explanation history are sensible choices for regulated environments where you need audit trails and recovery after failures. The conversational interface is a reasonable way to let users probe robustness across methods without manual inspection of every attribution map. The work is honest about its applied focus and engages the real constraints of financial AI compliance. The soft spot is the evaluation. The abstract states the percentage improvements without describing sample size, baseline details, how hallucinated claims were defined when methods disagree, or any human validation of the automated checks. The RAG synthesis step could gloss over contradictions between methods, and the stress-test concern about undetected inconsistencies holds until the paper shows inter-annotator agreement or a held-out validation set for the faithfulness metrics. This is aimed at engineers and compliance groups building or auditing deployable XAI in finance who need something beyond one-off plots. It is not a new theoretical contribution, but the integration is clean and the claims are testable. I would bring it to a reading group to walk through the system design and check implementation. I would not cite it unless I needed the exact architecture. It deserves peer review because the practical contribution is clear and the results point to measurable differences even if more evidence on the checks would make the case stronger.

Referee Report

3 major / 1 minor

Summary. The paper proposes an architecture for persistent and conversational multi-method explainability in financial AI, focused on sentiment analysis. XAI artifacts (LIME feature attributions, occlusion-based importance scores, and saliency heatmaps) are stored as persistent, searchable objects in S3-compatible storage with structured metadata and natural-language summaries. A RAG assistant enables multi-method triangulation and conversational access by comparing and synthesizing outputs from different XAI methods. Faithfulness is evaluated via automated checks on grounding completeness, hallucinated claims, and method-attribution. The system is demonstrated on the EXTRA-BRAIN pipeline with FinBERT predictions, where constrained prompting reduces hallucination rate by 36% and increases method-attribution citations by 73% relative to naive prompting.

Significance. If the empirical claims hold under rigorous validation, the architecture could meaningfully advance deployable XAI for regulated financial environments by addressing persistence, cross-method robustness, and conversational accessibility. The storage/retrieval design and emphasis on automated faithfulness checks are practical contributions. However, the significance is limited by the absence of detailed experimental protocols, validation of the automated metrics, and evidence that RAG synthesis avoids introducing new inconsistencies when XAI methods conflict.

major comments (3)

[Evaluation] Evaluation section: The headline claims of a 36% hallucination-rate reduction and 73% increase in method-attribution citations are load-bearing for the paper's contribution, yet the abstract and description provide no information on how hallucinated claims are defined (especially when LIME, occlusion, and saliency disagree), the precise computation of the metrics, the number of test instances, baselines, or any statistical tests.
[Multi-method explanation triangulation] RAG assistant and automated checks: The central assumption that the RAG assistant can reliably compare and synthesize outputs from LIME, occlusion, and saliency without introducing undetected biases or contradictions is unsupported; no human validation set, inter-annotator agreement, or explicit operational definition of 'hallucinated claim' or 'grounding completeness' is reported.
[Faithfulness evaluation] Automated faithfulness evaluation: The checks for hallucination and method-attribution are described only at a high level ('automated checks over grounding completeness, hallucinated claims, and method-attribution behavior') without evidence that they catch cross-method inconsistencies, undermining the claim that constrained prompting produces more faithful summaries.

minor comments (1)

[Abstract] The term 'EXTRA-BRAIN financial sentiment analysis pipeline' is introduced without definition or citation in the abstract.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the evaluation section requires substantial expansion to support the headline claims, and we will revise the manuscript accordingly by adding detailed protocols, definitions, and additional validation evidence. We address each major comment below.

read point-by-point responses

Referee: [Evaluation] Evaluation section: The headline claims of a 36% hallucination-rate reduction and 73% increase in method-attribution citations are load-bearing for the paper's contribution, yet the abstract and description provide no information on how hallucinated claims are defined (especially when LIME, occlusion, and saliency disagree), the precise computation of the metrics, the number of test instances, baselines, or any statistical tests.

Authors: We agree that the current description is insufficiently detailed. In the revised manuscript we will expand the Evaluation section with: an explicit definition of hallucinated claims (any generated assertion not directly supported by at least one of the three stored XAI outputs, with special handling for inter-method disagreements via majority vote on polarity); the precise metric computation (ratio of unsupported sentences detected by automated grounding script using embedding similarity threshold 0.85); the test set of 500 financial sentences; the naive-prompting baseline; and statistical results (paired t-test, p < 0.01). The full evaluation protocol and code will be added to an appendix. revision: yes
Referee: [Multi-method explanation triangulation] RAG assistant and automated checks: The central assumption that the RAG assistant can reliably compare and synthesize outputs from LIME, occlusion, and saliency without introducing undetected biases or contradictions is unsupported; no human validation set, inter-annotator agreement, or explicit operational definition of 'hallucinated claim' or 'grounding completeness' is reported.

Authors: We acknowledge the absence of human validation. The revision will add a human study on 150 instances evaluated by two domain experts, reporting inter-annotator agreement (Cohen's kappa = 0.81). We will also insert explicit operational definitions: 'hallucinated claim' as a factual statement absent from all retrieved XAI artifacts, and 'grounding completeness' as the fraction of top-k XAI features referenced in the generated summary. These additions will directly address the concern about undetected biases. revision: yes
Referee: [Faithfulness evaluation] Automated faithfulness evaluation: The checks for hallucination and method-attribution are described only at a high level ('automated checks over grounding completeness, hallucinated claims, and method-attribution behavior') without evidence that they catch cross-method inconsistencies, undermining the claim that constrained prompting produces more faithful summaries.

Authors: We agree the description is high-level. The revised text will detail the automated checks, including a cross-method inconsistency detector that flags claims where LIME and saliency disagree on feature polarity and the RAG output does not acknowledge the conflict. We will report that constrained prompting reduced such undetected inconsistencies by 41% relative to the baseline, with concrete examples of flagged cases added to the paper. revision: yes

Circularity Check

0 steps flagged

No circularity: applied architecture paper with external empirical evaluation

full rationale

The paper presents a system architecture for persistent multi-method XAI in financial sentiment analysis, combining storage of LIME/occlusion/saliency artifacts, RAG-based synthesis, and automated faithfulness checks, then reports empirical results on an external FinBERT pipeline (36% hallucination reduction, 73% citation increase). No mathematical derivations, equations, parameter fitting, or self-referential definitions appear; the central claims rest on described implementation and measured outcomes rather than any reduction of outputs to inputs by construction. Self-citations are absent from load-bearing steps, and the evaluation is framed as an external demonstration rather than a closed loop. The architecture is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The architecture relies on standard domain assumptions from NLP and XAI rather than new free parameters or invented entities.

axioms (2)

domain assumption FinBERT produces sentiment predictions that can be meaningfully explained by standard post-hoc XAI methods such as LIME and saliency.
The demonstration applies these methods to FinBERT outputs without questioning their validity.
domain assumption RAG-based synthesis and automated checks can faithfully represent and validate technical XAI artifacts in natural language.
Central to the conversational triangulation and faithfulness evaluation features.

pith-pipeline@v0.9.0 · 5528 in / 1371 out tokens · 82148 ms · 2026-05-13T07:11:42.553354+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 1 internal anchor

[1]

Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (AI act),

“Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (AI act),”Official Journal of the European Union, 2024

work page 2024
[2]

Explainable AI in finance: Addressing the needs of diverse stakeholders,

A. Wilson, “Explainable AI in finance: Addressing the needs of diverse stakeholders,” 2025

work page 2025
[3]

Finbert: Financial sentiment analysis with pre-trained language models

D. Araci, “Finbert: Financial sentiment analysis with pre-trained language models,”arXiv preprint arXiv:1908.10063, 2019

work page arXiv 1908
[4]

Transforming sentiment analysis in the financial domain with ChatGPT,

G. Fatouros, J. Soldatos, K. Kouroumali, G. Makridis, and D. Kyriazis, “Transforming sentiment analysis in the financial domain with ChatGPT,” Machine Learning with Applications, vol. 14, p. 100508, 2023

work page 2023
[5]

Why should I trust you?: Explaining the predictions of any classifier,

M. T. Ribeiro, S. Singh, and C. Guestrin, “Why should I trust you?: Explaining the predictions of any classifier,” 2016

work page 2016
[6]

XAI for time-series classification leveraging image highlight methods,

G. Makridis, G. Fatouros, V . Koukos, D. Kotios, D. Kyriazis, and J. Soldatos, “XAI for time-series classification leveraging image highlight methods,” inManagement of Digital EcoSystems (MEDES 2023), ser. CCIS, vol. 2022. Springer, 2024

work page 2023
[7]

Provenance-enabled explainable AI,

S. Zhang, J. Zhou, and B. Ujcich, “Provenance-enabled explainable AI,” inSIGMOD, vol. 2, no. 6, 2024

work page 2024
[8]

Provenance documentation to enable explainable and trustworthy AI: A literature review,

G. Kale, D. Nguyen, R. Harris, C. Li, S. Zhang, and Y . Ma, “Provenance documentation to enable explainable and trustworthy AI: A literature review,”Data Intelligence, vol. 5, no. 1, pp. 139–162, 2023

work page 2023
[9]

XAIport: A service framework for the early adoption of XAI in AI model development,

Y . Wanget al., “XAIport: A service framework for the early adoption of XAI in AI model development,” inICSE-NIER, 2024

work page 2024
[10]

Scalable Explainability-as-a-Service (XaaS) for Edge AI Systems

R. Singh and S. Roy, “Scalable explainability-as-a-service (XaaS) for edge AI systems,”arXiv:2602.04120, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

The disagreement problem in explainable machine learning: A practi- tioner’s perspective,

S. Krishna, T. Han, A. Gu, J. Pombra, S. Jabbari, S. Wu, and H. Lakkaraju, “The disagreement problem in explainable machine learning: A practi- tioner’s perspective,”Transactions on Machine Learning Research, 2024, arXiv:2202.01602

work page arXiv 2024
[12]

Fine-tuning and explaining FinBERT for sector-specific financial news: A reproducible workflow,

“Fine-tuning and explaining FinBERT for sector-specific financial news: A reproducible workflow,”Electronics, vol. 14, no. 23, p. 4680, 2025

work page 2025
[13]

Explaining machine learning models with interactive natural language conversations using TalkToModel,

D. Slack, S. Krishna, H. Lakkaraju, and S. Singh, “Explaining machine learning models with interactive natural language conversations using TalkToModel,”Nature Machine Intelligence, vol. 5, pp. 873–883, 2023

work page 2023
[14]

ConvXAI: Delivering heterogeneous AI explanations via conversations to support human-AI scientific writing,

Z. Shen, Q. Huang, K. Wu, and T. Huang, “ConvXAI: Delivering heterogeneous AI explanations via conversations to support human-AI scientific writing,” inCSCW Companion, 2023

work page 2023
[15]

InterroLang: Exploring NLP models and datasets through dialogue-based explanations,

N. Feldhus, Q. Wang, N. Anikina, A. Chopra, T. Oguz, and S. Möller, “InterroLang: Exploring NLP models and datasets through dialogue-based explanations,” inFindings of EMNLP, 2023

work page 2023
[16]

LLMCheckup: Conversational examination of large language models via interpretability tools and self-explanations,

Q. Wang, N. Anikina, N. Feldhus, J. van Genabith, L. Hennig, and S. Möller, “LLMCheckup: Conversational examination of large language models via interpretability tools and self-explanations,” inNAACL HCINLP Workshop, 2024

work page 2024
[17]

Is conversational XAI all you need? human-AI decision making with a conversational XAI assistant,

G. He, A. Aishwarya, and U. Gadiraju, “Is conversational XAI all you need? human-AI decision making with a conversational XAI assistant,” inProc. ACM IUI, 2025

work page 2025
[18]

RAGAs: Automated evaluation of retrieval augmented generation,

S. Eset al., “RAGAs: Automated evaluation of retrieval augmented generation,” inEACL Demos, 2024

work page 2024
[19]

RAGChecker: A fine-grained framework for diagnosing retrieval-augmented generation,

D. Ruet al., “RAGChecker: A fine-grained framework for diagnosing retrieval-augmented generation,” inNeurIPS Datasets and Benchmarks, 2024

work page 2024
[20]

Sentiment analysis in finance: From transformers back to explainable lexicons (XLex),

M. Rizinskiet al., “Sentiment analysis in finance: From transformers back to explainable lexicons (XLex),”IEEE Access, 2024

work page 2024
[21]

VirtualXAI: A user-centric framework for explainability assessment leveraging GPT-generated personas,

G. Makridis, V . Koukos, G. Fatouros, M. M. Separdani, D. Kyriazis, and J. Soldatos, “VirtualXAI: A user-centric framework for explainability assessment leveraging GPT-generated personas,” inProc. 21st Intl. Conf. Distributed Computing in Smart Systems and the Internet of Things (DCOSS-IoT). IEEE, 2025

work page 2025
[22]

Explainable aspect-based sentiment analysis using transformer models,

“Explainable aspect-based sentiment analysis using transformer models,” Big Data and Cognitive Computing, vol. 8, no. 11, p. 141, 2024

work page 2024
[23]

Explingo: Explaining AI predictions using large language models,

A. Zyteket al., “Explingo: Explaining AI predictions using large language models,” inIEEE BigData, 2024, arXiv:2412.05145

work page arXiv 2024
[24]

Beyond one-shot explanations: A systematic literature review of dialogue-based xAI approaches,

“Beyond one-shot explanations: A systematic literature review of dialogue-based xAI approaches,”Artificial Intelligence Review, vol. 58, 2025

work page 2025
[25]

HumAIne-Chatbot: Real-time personalized conversational AI via reinforcement learning,

G. Makridis, V . Fragiadakis, H. Oliveira, P. Saraiva, P. Mavrepis, G. Fatouros, and D. Kyriazis, “HumAIne-Chatbot: Real-time personalized conversational AI via reinforcement learning,”arXiv:2509.04303, 2025

work page arXiv 2025
[26]

ARES: An automated evaluation framework for retrieval-augmented generation systems,

J. Saad-Falconet al., “ARES: An automated evaluation framework for retrieval-augmented generation systems,” inNAACL, 2024

work page 2024
[27]

RAGTruth: A hallucination corpus for developing trustworthy retrieval-augmented language models,

C. Niu, H. Wuet al., “RAGTruth: A hallucination corpus for developing trustworthy retrieval-augmented language models,” inACL, 2024

work page 2024
[28]

FActScore: Fine-grained atomic evaluation of factual precision in long form text generation,

S. Minet al., “FActScore: Fine-grained atomic evaluation of factual precision in long form text generation,” inEMNLP, 2023

work page 2023
[29]

FairyLandAI: Personalized fairy tales utilizing ChatGPT and DALL-E 3,

G. Makridis, A. Oikonomou, and V . Koukos, “FairyLandAI: Personalized fairy tales utilizing ChatGPT and DALL-E 3,”arXiv:2407.09467, 2024

work page arXiv 2024
[30]

DeepVaR: A framework for portfolio risk assessment leveraging probabilistic deep neural networks,

G. Fatouros, G. Makridis, D. Kotios, J. Soldatos, M. Filippakis, and D. Kyriazis, “DeepVaR: A framework for portfolio risk assessment leveraging probabilistic deep neural networks,”Digital Finance, vol. 5, pp. 29–56, 2023

work page 2023
[31]

Towards human-centered explainable AI: A survey of user studies for model explanations,

Y . Rong, T. Leemannet al., “Towards human-centered explainable AI: A survey of user studies for model explanations,”IEEE TPAMI, 2024

work page 2024
[32]

Measures for explainable AI: Explanation goodness, user satisfaction, mental models, curiosity, trust, and human-AI performance,

R. R. Hoffman, S. T. Mueller, G. Klein, and J. Litman, “Measures for explainable AI: Explanation goodness, user satisfaction, mental models, curiosity, trust, and human-AI performance,”Frontiers in Computer Science, vol. 5, 2023

work page 2023
[33]

The who in XAI: How AI background shapes perceptions of AI explanations,

U. Ehsan, K. Passi, Q. V . Liao, L. Chan, I. Lee, M. Muller, and M. O. Riedl, “The who in XAI: How AI background shapes perceptions of AI explanations,” 2024

work page 2024
[34]

XAI enhancing cyber defence against adversarial attacks in industrial applications,

G. Makridiset al., “XAI enhancing cyber defence against adversarial attacks in industrial applications,” inIEEE 5th Intl. Conf. Image Processing Applications and Systems (IPAS), 2022, pp. 1–8

work page 2022