Shapley Value-Guided Adaptive Ensemble Learning for Explainable Financial Fraud Detection with U.S. Regulatory Compliance Validation

Md Munna Aziz; Mohammad Nasir Uddin

arxiv: 2604.14231 · v1 · submitted 2026-04-14 · 💻 cs.LG · cs.AI· cs.NE

Shapley Value-Guided Adaptive Ensemble Learning for Explainable Financial Fraud Detection with U.S. Regulatory Compliance Validation

Mohammad Nasir Uddin , Md Munna Aziz This is my paper

Pith reviewed 2026-05-10 16:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.NE

keywords SHAPensemble learningfinancial fraud detectionexplainable AIregulatory complianceAUC-ROCadaptive weightingIEEE-CIS dataset

0 comments

The pith

SHAP attribution agreement allows dynamic weighting of ensemble models to achieve superior fraud detection performance with built-in regulatory compliance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates how well different machine learning models can explain their fraud predictions using SHAP values, measuring both how faithful those explanations are and how stable they remain across repeated samples. It then introduces the SHAP-Guided Adaptive Ensemble method that changes the weight given to each base model for every single transaction according to how much their explanations agree. This yields the best accuracy scores on a dataset of over half a million transactions, and the entire setup is checked against U.S. financial oversight rules that demand transparent decision processes.

Core claim

The central claim is that the SHAP-Guided Adaptive Ensemble dynamically adjusts per-transaction ensemble weights based on SHAP attribution agreement, achieving the highest AUC-ROC among all tested models on the IEEE-CIS fraud dataset while ensuring explanations meet regulatory requirements for financial institutions.

What carries the argument

The SHAP-Guided Adaptive Ensemble (SGAE), a method that uses agreement between SHAP explanations from different base models to set their combination weights on a per-transaction basis.

If this is right

Consistent SHAP attributions lead to higher weights for those base models in the ensemble for each transaction.
The SGAE method records an AUC-ROC of 0.8837 on held-out data and 0.9245 under cross-validation.
Explanations from the ensemble satisfy the transparency demands of OCC Bulletin 2011-12, Federal Reserve SR 11-7, and BSA-AML.
Standalone GNN-GraphSAGE reaches an AUC-ROC of 0.9248 and F1 score of 0.6013 on the full dataset.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The per-transaction adaptation might extend naturally to other time-sensitive detection tasks such as anomaly detection in network security.
Using SHAP agreement as a weighting signal could reduce reliance on validation sets for ensemble tuning.
Future work might examine whether this agreement metric correlates with actual transaction outcomes beyond the current dataset.

Load-bearing premise

That agreement among SHAP attributions across base models provides an unbiased and non-overfitting signal for dynamically setting ensemble weights without circular dependence on the explanation method itself.

What would settle it

Comparing the AUC-ROC of SGAE to that of a non-adaptive ensemble average on the same IEEE-CIS held-out set; if the gap disappears, the adaptive weighting based on SHAP agreement would not be the driver of the gains.

read the original abstract

Financial crime costs U.S. institutions over $32 billion each year. Although AI tools for fraud detection have become more advanced, their use in real-world systems still faces a major obstacle: many of these models operate as black boxes that cannot provide the transparent, auditable explanations required by regulations such as OCC Bulletin 2011-12 and Federal Reserve SR 11-7. This study makes three main contributions. First, it offers a thorough evaluation of explanation quality across faithfulness (sufficiency and comprehensiveness at k=5, 10, and 15) and stability (Kendall's W across 30 bootstrap samples). XGBoost paired with TreeExplainer achieves near-perfect stability (W=0.9912), while LSTM with DeepExplainer shows weak results (W=0.4962). Second, the paper introduces the SHAP-Guided Adaptive Ensemble (SGAE), which dynamically adjusts per-transaction ensemble weights based on SHAP attribution agreement, achieving the highest AUC-ROC among all tested models (0.8837 held-out; 0.9245 cross-validation). Third, a complete three-architecture evaluation of LSTM, Transformer, and GNN-GraphSAGE on the full 590,540-transaction IEEE-CIS dataset is provided, with GNN-GraphSAGE achieving AUC-ROC 0.9248 and F1=0.6013. All results are mapped directly to OCC, SR 11-7, and BSA-AML regulatory compliance requirements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SGAE adapts ensemble weights via SHAP agreement on fraud data and maps results to U.S. regs, but the weighting step may depend on the same attributions used for explanation.

read the letter

The paper introduces SGAE, an ensemble that sets per-transaction weights from agreement among SHAP attributions of LSTM, Transformer, and GNN-GraphSAGE models. On the 590k IEEE-CIS transactions it reports the highest held-out AUC-ROC at 0.8837 and cross-val at 0.9245, plus a full comparison of explanation faithfulness and stability across the base models. XGBoost with TreeExplainer reaches near-perfect stability (W=0.9912) while LSTM with DeepExplainer is much weaker (W=0.4962). The work also directly links every metric to OCC, SR 11-7, and BSA-AML requirements, which is a practical step for anyone who has to pass regulatory review.

Referee Report

1 major / 2 minor

Summary. The paper evaluates explanation quality (faithfulness and stability) for fraud detection models including LSTM, Transformer, GNN-GraphSAGE, and XGBoost on the IEEE-CIS 590,540-transaction dataset. It introduces the SHAP-Guided Adaptive Ensemble (SGAE) that sets per-transaction weights via agreement among base-model SHAP attributions, reports SGAE AUC-ROC of 0.8837 (held-out) and 0.9245 (CV) as the best result, and maps all findings to OCC, SR 11-7, and BSA-AML compliance requirements.

Significance. If the SGAE weighting mechanism can be shown to operate without circular dependence on the same SHAP attributions used for both weighting and explanation, the work would provide a concrete, regulation-aligned path to high-performing yet auditable ensembles for financial fraud detection. The stability comparison (e.g., XGBoost TreeExplainer W=0.9912 vs. LSTM DeepExplainer W=0.4962) supplies useful empirical guidance on explanation reliability.

major comments (1)

[Abstract / SGAE description] Abstract (SGAE paragraph): the headline claim that SGAE achieves the highest held-out AUC-ROC (0.8837) rests on per-transaction weights derived from SHAP attribution agreement across LSTM, Transformer, and GNN-GraphSAGE. The text provides no explicit statement that this agreement signal is computed on a validation fold strictly separate from the held-out test set used for the AUC metric. Because SHAP values are extracted directly from each base model and the reported stability for LSTM+DeepExplainer is low (W=0.4962), any bias or instability in the explanations can directly propagate into the ensemble weights and therefore into the performance number, creating a potential circularity that must be ruled out before the central performance claim can be accepted.

minor comments (2)

[Abstract] Abstract: the reported metrics (AUC 0.8837 held-out, 0.9245 CV, F1=0.6013 for GNN) are given without train/test split ratios, hyperparameter search protocol, or any statistical significance test, making it impossible to assess whether the gains are robust.
[Regulatory compliance discussion] The regulatory mapping section would benefit from a concise table that explicitly links each reported metric (faithfulness at k=5/10/15, stability W, AUC) to the specific requirements in OCC 2011-12 and SR 11-7.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on manuscript arXiv:2604.14231. The identification of an ambiguity in the SGAE description is helpful and will improve the clarity of the work. We respond point-by-point to the major comment below.

read point-by-point responses

Referee: Abstract (SGAE paragraph): the headline claim that SGAE achieves the highest held-out AUC-ROC (0.8837) rests on per-transaction weights derived from SHAP attribution agreement across LSTM, Transformer, and GNN-GraphSAGE. The text provides no explicit statement that this agreement signal is computed on a validation fold strictly separate from the held-out test set used for the AUC metric. Because SHAP values are extracted directly from each base model and the reported stability for LSTM+DeepExplainer is low (W=0.4962), any bias or instability in the explanations can directly propagate into the ensemble weights and therefore into the performance number, creating a potential circularity that must be ruled out before the central performance claim can be accepted.

Authors: We agree that the manuscript does not explicitly state the data partitioning used to compute the SHAP agreement signal for SGAE weights. This omission creates the ambiguity noted. In the revised version we will add a clear statement in the abstract and a new paragraph in Section 3 (Methods) specifying the protocol: base models are trained exclusively on the training fold; SHAP attributions for weight computation are generated on a distinct validation fold; and ensemble predictions plus the reported AUC-ROC (0.8837) are obtained on a strictly held-out test set never seen during weight determination. This separation eliminates test-set leakage into the weighting step. Regarding propagation of instability, the per-transaction weights are derived from cross-model agreement rather than any single model’s SHAP values; the high stability of XGBoost (W=0.9912) therefore anchors the ensemble. We will also insert a short sensitivity analysis showing that down-weighting the less-stable LSTM does not materially alter the final AUC. These changes directly address the circularity concern while preserving the regulatory mapping to SR 11-7 and OCC requirements. revision: yes

Circularity Check

0 steps flagged

No significant circularity in SGAE derivation chain

full rationale

The paper's central claim rests on training base models (LSTM, Transformer, GNN-GraphSAGE), computing SHAP attributions on them, deriving per-transaction weights from attribution agreement, and then evaluating the resulting ensemble on held-out AUC-ROC (0.8837) and cross-validation (0.9245). No equations or procedural descriptions in the provided abstract reduce the performance metric to the SHAP agreement signal by construction; the held-out evaluation remains an independent benchmark. No self-citations, uniqueness theorems, or ansatzes are invoked to force the result. The method is self-contained against external data splits and regulatory mapping, with no load-bearing step that collapses to a fitted input renamed as prediction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient technical detail to enumerate free parameters, axioms, or invented entities; no explicit derivations or model equations are shown.

pith-pipeline@v0.9.0 · 5590 in / 1130 out tokens · 38506 ms · 2026-05-10T16:23:17.577857+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SCAFDS: Edge-Feature Graph Attention for Interbank Fraud Detection with Attribution-Grounded SAR Generation
cs.CR 2026-05 unverdicted novelty 7.0

SCAFDS applies edge-feature graph attention on fraud co-occurrence metrics to detect interbank fraud and generate attribution-grounded SAR reports, reporting AUPRC 0.515 and AUROC 0.802 on IEEE-CIS data with gains ove...

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · cited by 1 Pith paper

[1]

Shafii et al., 'Explainable AI for fraud detection: An attention-based ensemble of CNNs, GNNs, and a confidence-driven gating mechanism,' arXiv:2410.09069, 2025

M. Shafii et al., 'Explainable AI for fraud detection: An attention-based ensemble of CNNs, GNNs, and a confidence-driven gating mechanism,' arXiv:2410.09069, 2025. [9] Y. Cheng, X. Zhou, J. Wang, and Y. Zhang, 'A comprehensive review of graph neural networks for fraud detection,' Frontiers Comput. Sci., vol. 19, no. 1, pp. 143–162, 2025. [10] T. Deng, S....

work page doi:10.1007/s10462-026-11516-7 2025
[2]

Thanathamathee et al., 'SHAP-instance weighting for imbalanced fraud detection,' Emerging Science Journal, vol

P. Thanathamathee et al., 'SHAP-instance weighting for imbalanced fraud detection,' Emerging Science Journal, vol. 8, no. 3, 2024. [31] A. Awasthi, 'Post-hoc explainability and regulatory compliance risk in AI-driven financial decisions,' Financial Innovation, 2025. [32] A. Miró-Nicolau, G. Moyà-Alcover, A. Jaume-i-Capó, M. González-Hidalgo, and P. Bibilo...

work page 2024

[1] [1]

Shafii et al., 'Explainable AI for fraud detection: An attention-based ensemble of CNNs, GNNs, and a confidence-driven gating mechanism,' arXiv:2410.09069, 2025

M. Shafii et al., 'Explainable AI for fraud detection: An attention-based ensemble of CNNs, GNNs, and a confidence-driven gating mechanism,' arXiv:2410.09069, 2025. [9] Y. Cheng, X. Zhou, J. Wang, and Y. Zhang, 'A comprehensive review of graph neural networks for fraud detection,' Frontiers Comput. Sci., vol. 19, no. 1, pp. 143–162, 2025. [10] T. Deng, S....

work page doi:10.1007/s10462-026-11516-7 2025

[2] [2]

Thanathamathee et al., 'SHAP-instance weighting for imbalanced fraud detection,' Emerging Science Journal, vol

P. Thanathamathee et al., 'SHAP-instance weighting for imbalanced fraud detection,' Emerging Science Journal, vol. 8, no. 3, 2024. [31] A. Awasthi, 'Post-hoc explainability and regulatory compliance risk in AI-driven financial decisions,' Financial Innovation, 2025. [32] A. Miró-Nicolau, G. Moyà-Alcover, A. Jaume-i-Capó, M. González-Hidalgo, and P. Bibilo...

work page 2024