pith. sign in

arxiv: 2602.18895 · v2 · pith:QMKIA63Rnew · submitted 2026-02-21 · 💱 q-fin.RM · cs.LG

Could Large Language Models work as Post-hoc Explainability Tools in Credit Risk Models?

Pith reviewed 2026-05-21 11:51 UTC · model grok-4.3

classification 💱 q-fin.RM cs.LG
keywords large language modelspost-hoc explainabilitycredit riskSHAPfeature importanceLendingClubautonomous explanationsmodel governance
0
0 comments X

The pith

Large language models reliably reproduce credit risk feature rankings only when given controlled prompts that enforce reference attributions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLMs can act as post-hoc explainability tools for credit risk models by checking if they keep the same feature importance order as SHAP values or regression coefficients and whether they can produce good explanations without extra guidance. Experiments on LendingClub loan data compare outputs from GPT-4-turbo, Claude-Sonnet-4.5, and Gemini-2.5-Flash against standard attribution methods. A reader would care because banks and regulators need clear reasons for credit decisions, and LLMs might turn technical model outputs into readable text. The results show strong performance only when prompts force the model to follow existing rankings, but weaker results when the LLM generates explanations freely.

Core claim

Using a LendingClub dataset and comparing outputs from three major LLMs to SHAP and coefficient-based attributions, the study finds that LLMs reliably reproduce reference rankings under controlled prompts but show limited alignment when generating explanations autonomously. These findings suggest that LLMs are best deployed as narrative interfaces rather than substitutes for formal attribution methods in credit risk governance.

What carries the argument

The split between controlled prompts that instruct LLMs to follow given reference rankings and autonomous generation where LLMs produce explanations without such constraints, serving as a test of fidelity to model attributions.

Load-bearing premise

That preserving feature-importance rankings from SHAP or coefficients serves as a valid and sufficient proxy for the quality of LLM-generated explanations in credit risk settings.

What would settle it

A blind rating by credit risk experts showing that autonomous LLM explanations align better with actual model behavior or human judgment than controlled-prompt versions on a new set of loan applications would falsify the central distinction.

read the original abstract

Large language models (LLMs) have shown promise in translating model-based explanations into human-readable narratives. This study evaluates whether LLMs can serve as post-hoc explainability interfaces for credit risk models, focusing on their ability to preserve feature-importance rankings and generate autonomous explanations. Using a LendingClub dataset, we compare LLM outputs with SHAP and coefficient-based attributions on three major LLMs, including GPT-4-turbo, Claude-Sonnet-4.5, and Gemini-2.5-Flash. Results indicate that LLMs reliably reproduce reference rankings under controlled prompts but show limited alignment when generating explanations autonomously. These findings suggest that LLMs are best deployed as narrative interfaces rather than substitutes for formal attribution methods in credit risk governance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates whether large language models can serve as post-hoc explainability tools for credit risk models. It tests three LLMs (GPT-4-turbo, Claude-Sonnet-4.5, Gemini-2.5-Flash) on a LendingClub dataset by comparing their outputs to SHAP values and logistic regression coefficients. The central findings are that LLMs reliably reproduce reference feature-importance rankings when given controlled prompts but exhibit limited alignment when generating explanations autonomously; the authors therefore recommend LLMs as narrative interfaces rather than substitutes for formal attribution methods.

Significance. If the empirical results hold under more rigorous evaluation, the work would provide timely practical guidance on deploying LLMs in regulated credit-risk settings where explainability is required. The multi-model comparison and distinction between controlled versus autonomous prompting are useful contributions. However, the reliance on ranking preservation as the primary validation metric limits the strength of the conclusions for regulatory or governance applications.

major comments (2)
  1. [Results] Results section: The abstract and main findings state that LLMs 'reliably reproduce reference rankings' and show 'limited alignment' in autonomous mode, yet no quantitative metrics (e.g., Kendall tau, top-k overlap percentages, or average rank correlation), statistical tests, sample sizes, or data-split details are reported. This absence leaves the directional claims only modestly supported.
  2. [Methodology / Results] Evaluation framework (implicit in Methodology and Results): Preservation of global feature-importance rankings is treated as the primary positive evidence for LLM utility. In the presence of correlated features typical of LendingClub data, this metric can be satisfied while local explanations remain unfaithful to the model's decision surface or introduce hallucinated causal claims; the paper does not test or discuss this distinction, which is load-bearing for the central claim that LLMs can function as post-hoc explainability tools.
minor comments (2)
  1. [Prompt Design] The exact prompt templates for the controlled and autonomous conditions are not reproduced; providing them (or an appendix) would improve reproducibility.
  2. [Experimental Setup] Model versions and access dates (e.g., exact GPT-4-turbo checkpoint) should be stated explicitly to allow future replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects for improving the rigor of our evaluation. We address each major comment below, indicating the revisions we plan to make to the manuscript.

read point-by-point responses
  1. Referee: [Results] Results section: The abstract and main findings state that LLMs 'reliably reproduce reference rankings' and show 'limited alignment' in autonomous mode, yet no quantitative metrics (e.g., Kendall tau, top-k overlap percentages, or average rank correlation), statistical tests, sample sizes, or data-split details are reported. This absence leaves the directional claims only modestly supported.

    Authors: We agree that the absence of quantitative metrics weakens the support for our claims. In the revised manuscript, we will add Kendall's tau rank correlations, top-k overlap percentages, average rank correlations, and report the sample sizes and data-split details used in our experiments. Where applicable, we will include statistical tests to assess the significance of the observed alignments. revision: yes

  2. Referee: [Methodology / Results] Evaluation framework (implicit in Methodology and Results): Preservation of global feature-importance rankings is treated as the primary positive evidence for LLM utility. In the presence of correlated features typical of LendingClub data, this metric can be satisfied while local explanations remain unfaithful to the model's decision surface or introduce hallucinated causal claims; the paper does not test or discuss this distinction, which is load-bearing for the central claim that LLMs can function as post-hoc explainability tools.

    Authors: This is a valid concern. Global ranking preservation does not necessarily imply faithful local explanations, especially given the multicollinearity present in credit risk datasets. Our primary focus was on the ability of LLMs to serve as narrative interfaces by reproducing reference rankings under controlled conditions. We did not claim that LLMs provide faithful local attributions. In the revision, we will add a dedicated subsection in the Discussion to address this limitation, explicitly noting the risks of correlated features, potential hallucinations in causal language, and the distinction between global and local fidelity. We will also reinforce that our recommendation positions LLMs as complementary narrative tools rather than replacements for established attribution methods. revision: partial

Circularity Check

0 steps flagged

No significant circularity in this empirical evaluation

full rationale

The paper conducts a direct empirical comparison of LLM-generated feature rankings and explanations against SHAP values and logistic regression coefficients on LendingClub data, with no mathematical derivations, fitted parameters presented as independent predictions, self-referential definitions, or load-bearing self-citations that reduce claims to their own inputs. Results are evaluated against external reference attributions rather than constructed from the study's own outputs, rendering the analysis self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that SHAP and coefficient-based attributions constitute reliable ground truth for feature importance. No free parameters or invented entities are introduced; the work is an empirical comparison rather than a derivation.

axioms (1)
  • domain assumption SHAP values and model coefficients provide faithful reference attributions against which LLM outputs can be benchmarked.
    The evaluation framework treats these methods as the standard for comparison.

pith-pipeline@v0.9.0 · 5659 in / 1101 out tokens · 34070 ms · 2026-05-21T11:51:30.373778+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Signal or Noise in Multi-Agent LLM-based Stock Recommendations?

    q-fin.PM 2026-04 unverdicted novelty 6.0

    A multi-agent LLM equity system produces statistically significant outperformance on S&P 500 stocks, with strong-buy portfolios returning +2.18% monthly versus +1.15% for the equal-weight benchmark over 19 months.