Could Large Language Models work as Post-hoc Explainability Tools in Credit Risk Models?

Dingyuan Liu; Liya Li; Wenxi Geng; Yiqing Wang

arxiv: 2602.18895 · v2 · pith:QMKIA63Rnew · submitted 2026-02-21 · 💱 q-fin.RM · cs.LG

Could Large Language Models work as Post-hoc Explainability Tools in Credit Risk Models?

Wenxi Geng , Dingyuan Liu , Liya Li , Yiqing Wang This is my paper

Pith reviewed 2026-05-21 11:51 UTC · model grok-4.3

classification 💱 q-fin.RM cs.LG

keywords large language modelspost-hoc explainabilitycredit riskSHAPfeature importanceLendingClubautonomous explanationsmodel governance

0 comments

The pith

Large language models reliably reproduce credit risk feature rankings only when given controlled prompts that enforce reference attributions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLMs can act as post-hoc explainability tools for credit risk models by checking if they keep the same feature importance order as SHAP values or regression coefficients and whether they can produce good explanations without extra guidance. Experiments on LendingClub loan data compare outputs from GPT-4-turbo, Claude-Sonnet-4.5, and Gemini-2.5-Flash against standard attribution methods. A reader would care because banks and regulators need clear reasons for credit decisions, and LLMs might turn technical model outputs into readable text. The results show strong performance only when prompts force the model to follow existing rankings, but weaker results when the LLM generates explanations freely.

Core claim

Using a LendingClub dataset and comparing outputs from three major LLMs to SHAP and coefficient-based attributions, the study finds that LLMs reliably reproduce reference rankings under controlled prompts but show limited alignment when generating explanations autonomously. These findings suggest that LLMs are best deployed as narrative interfaces rather than substitutes for formal attribution methods in credit risk governance.

What carries the argument

The split between controlled prompts that instruct LLMs to follow given reference rankings and autonomous generation where LLMs produce explanations without such constraints, serving as a test of fidelity to model attributions.

Load-bearing premise

That preserving feature-importance rankings from SHAP or coefficients serves as a valid and sufficient proxy for the quality of LLM-generated explanations in credit risk settings.

What would settle it

A blind rating by credit risk experts showing that autonomous LLM explanations align better with actual model behavior or human judgment than controlled-prompt versions on a new set of loan applications would falsify the central distinction.

read the original abstract

Large language models (LLMs) have shown promise in translating model-based explanations into human-readable narratives. This study evaluates whether LLMs can serve as post-hoc explainability interfaces for credit risk models, focusing on their ability to preserve feature-importance rankings and generate autonomous explanations. Using a LendingClub dataset, we compare LLM outputs with SHAP and coefficient-based attributions on three major LLMs, including GPT-4-turbo, Claude-Sonnet-4.5, and Gemini-2.5-Flash. Results indicate that LLMs reliably reproduce reference rankings under controlled prompts but show limited alignment when generating explanations autonomously. These findings suggest that LLMs are best deployed as narrative interfaces rather than substitutes for formal attribution methods in credit risk governance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Controlled prompts let LLMs match SHAP rankings on this credit dataset while free-form outputs drift, but ranking match is a thin proxy for whether the narratives are actually faithful or useful.

read the letter

The key point is that controlled prompts let LLMs reproduce feature importance rankings from SHAP and logistic coefficients on LendingClub data, while autonomous generation produces weaker alignment. This is shown across GPT-4-turbo, Claude, and Gemini. What stands out is the direct comparison of prompt styles in a real credit risk setting. The authors run the same models on the same data and highlight how much the output depends on how the prompt is written. That practical observation is worth noting for teams trying to use LLMs as explanation layers. The softer part is the choice of success measure. Matching an ordered list of features does not confirm that the narrative explanation is faithful to the underlying model or free of invented reasons, especially when input variables like income and debt are correlated. The abstract gives directional results without numbers on agreement rates, variance across runs, or checks against human judgments, so it is difficult to gauge how reliable the positive finding really is. Readers who work on model governance in banking or fintech will find the most value, as it flags the need for careful prompt design rather than treating LLMs as drop-in explainers. The work engages honestly with the limits of current LLMs and does not overclaim. I would send this to peer review. The core experiment is straightforward and the domain is relevant, even if the evaluation could be tightened with more quantitative checks and discussion of regulatory standards.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates whether large language models can serve as post-hoc explainability tools for credit risk models. It tests three LLMs (GPT-4-turbo, Claude-Sonnet-4.5, Gemini-2.5-Flash) on a LendingClub dataset by comparing their outputs to SHAP values and logistic regression coefficients. The central findings are that LLMs reliably reproduce reference feature-importance rankings when given controlled prompts but exhibit limited alignment when generating explanations autonomously; the authors therefore recommend LLMs as narrative interfaces rather than substitutes for formal attribution methods.

Significance. If the empirical results hold under more rigorous evaluation, the work would provide timely practical guidance on deploying LLMs in regulated credit-risk settings where explainability is required. The multi-model comparison and distinction between controlled versus autonomous prompting are useful contributions. However, the reliance on ranking preservation as the primary validation metric limits the strength of the conclusions for regulatory or governance applications.

major comments (2)

[Results] Results section: The abstract and main findings state that LLMs 'reliably reproduce reference rankings' and show 'limited alignment' in autonomous mode, yet no quantitative metrics (e.g., Kendall tau, top-k overlap percentages, or average rank correlation), statistical tests, sample sizes, or data-split details are reported. This absence leaves the directional claims only modestly supported.
[Methodology / Results] Evaluation framework (implicit in Methodology and Results): Preservation of global feature-importance rankings is treated as the primary positive evidence for LLM utility. In the presence of correlated features typical of LendingClub data, this metric can be satisfied while local explanations remain unfaithful to the model's decision surface or introduce hallucinated causal claims; the paper does not test or discuss this distinction, which is load-bearing for the central claim that LLMs can function as post-hoc explainability tools.

minor comments (2)

[Prompt Design] The exact prompt templates for the controlled and autonomous conditions are not reproduced; providing them (or an appendix) would improve reproducibility.
[Experimental Setup] Model versions and access dates (e.g., exact GPT-4-turbo checkpoint) should be stated explicitly to allow future replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects for improving the rigor of our evaluation. We address each major comment below, indicating the revisions we plan to make to the manuscript.

read point-by-point responses

Referee: [Results] Results section: The abstract and main findings state that LLMs 'reliably reproduce reference rankings' and show 'limited alignment' in autonomous mode, yet no quantitative metrics (e.g., Kendall tau, top-k overlap percentages, or average rank correlation), statistical tests, sample sizes, or data-split details are reported. This absence leaves the directional claims only modestly supported.

Authors: We agree that the absence of quantitative metrics weakens the support for our claims. In the revised manuscript, we will add Kendall's tau rank correlations, top-k overlap percentages, average rank correlations, and report the sample sizes and data-split details used in our experiments. Where applicable, we will include statistical tests to assess the significance of the observed alignments. revision: yes
Referee: [Methodology / Results] Evaluation framework (implicit in Methodology and Results): Preservation of global feature-importance rankings is treated as the primary positive evidence for LLM utility. In the presence of correlated features typical of LendingClub data, this metric can be satisfied while local explanations remain unfaithful to the model's decision surface or introduce hallucinated causal claims; the paper does not test or discuss this distinction, which is load-bearing for the central claim that LLMs can function as post-hoc explainability tools.

Authors: This is a valid concern. Global ranking preservation does not necessarily imply faithful local explanations, especially given the multicollinearity present in credit risk datasets. Our primary focus was on the ability of LLMs to serve as narrative interfaces by reproducing reference rankings under controlled conditions. We did not claim that LLMs provide faithful local attributions. In the revision, we will add a dedicated subsection in the Discussion to address this limitation, explicitly noting the risks of correlated features, potential hallucinations in causal language, and the distinction between global and local fidelity. We will also reinforce that our recommendation positions LLMs as complementary narrative tools rather than replacements for established attribution methods. revision: partial

Circularity Check

0 steps flagged

No significant circularity in this empirical evaluation

full rationale

The paper conducts a direct empirical comparison of LLM-generated feature rankings and explanations against SHAP values and logistic regression coefficients on LendingClub data, with no mathematical derivations, fitted parameters presented as independent predictions, self-referential definitions, or load-bearing self-citations that reduce claims to their own inputs. Results are evaluated against external reference attributions rather than constructed from the study's own outputs, rendering the analysis self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that SHAP and coefficient-based attributions constitute reliable ground truth for feature importance. No free parameters or invented entities are introduced; the work is an empirical comparison rather than a derivation.

axioms (1)

domain assumption SHAP values and model coefficients provide faithful reference attributions against which LLM outputs can be benchmarked.
The evaluation framework treats these methods as the standard for comparison.

pith-pipeline@v0.9.0 · 5659 in / 1101 out tokens · 34070 ms · 2026-05-21T11:51:30.373778+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Results indicate that LLMs reliably reproduce reference rankings under controlled prompts but show limited alignment when generating explanations autonomously.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We compare LLM outputs with SHAP and coefficient-based attributions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Signal or Noise in Multi-Agent LLM-based Stock Recommendations?
q-fin.PM 2026-04 unverdicted novelty 6.0

A multi-agent LLM equity system produces statistically significant outperformance on S&P 500 stocks, with strong-buy portfolios returning +2.18% monthly versus +1.15% for the equal-weight benchmark over 19 months.