pith. machine review for the scientific record. sign in

arxiv: 2605.08671 · v1 · submitted 2026-05-09 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Explanation Fairness in Large Language Models: An Empirical Analysis of Disparities in How LLMs Justify Decisions Across Demographic Groups

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:20 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords explanation fairnessLLM disparitiesdemographic biasdecision explanationshedging density scorefairness taxonomyprompting mitigationsAI auditing
0
0 comments X

The pith

Large language models produce explanations of varying quality and style for identical decisions depending on the demographic groups mentioned.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Explanation Fairness Taxonomy to evaluate whether LLMs justify decisions with equal depth, tone, and sophistication across demographic groups. It tests this across four decision domains, five models, and many prompt pairs using two new metrics for hedging density and explanation faithfulness. All measured dimensions show statistically significant disparities, with the size of gaps varying sharply by model. Prompting interventions reduce some gaps but leave stylistic differences intact, consistent with the idea that these patterns are baked into training data rather than fixed at deployment time.

Core claim

Across up to 400 prompt pairs in hiring, medical, credit, and legal domains, every metric in the five-dimensional Explanation Fairness Taxonomy exhibits statistically significant disparities, with Cohen's d values from small to large and p_BH below 10^(-62); model choice strongly modulates disparity magnitude, and two prompting mitigations cut decision-linked disparities by 78-95 percent while leaving stylistic dimensions unchanged.

What carries the argument

The Explanation Fairness Taxonomy (EFT) with its five dimensions—Verbosity Disparity, Sentiment Disparity, Epistemic Hedging Disparity, Decision-Linked Explanation Disparity, and Lexical Complexity Disparity—operationalized via the Hedging Density Score (HDS) and Explanation Faithfulness Proxy (EFP) black-box metrics.

If this is right

  • Model choice can be used to reduce or increase the size of explanation disparities, with some models showing gaps several times larger than others.
  • Prompting-based interventions can substantially shrink decision-linked explanation disparities.
  • Stylistic disparities in verbosity, hedging, and lexical complexity remain after prompting, supporting the view that they originate in pre-training distributions.
  • A reproducible measurement framework enables systematic auditing of explanation fairness before deployment.
  • The results carry direct implications for regulatory requirements around AI decision explanations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If users notice and react to these explanation differences, trust in AI-assisted decisions may vary systematically by demographic context even when the underlying decision is unchanged.
  • Because prompting leaves stylistic gaps untouched, lasting fixes may require modifications to training data or model objectives rather than post-training instructions.
  • The same measurement approach could be applied to other generative outputs, such as summaries or recommendations, to check for similar demographic-linked quality gaps.

Load-bearing premise

The selected prompt templates and black-box metrics isolate explanation differences caused by demographic mentions rather than other prompt wording or model artifacts.

What would settle it

Re-running the full set of experiments with prompt templates that hold demographic terms constant while varying other elements, or with alternative explanation-quality metrics, and finding no statistically significant disparities would undermine the central claim.

Figures

Figures reproduced from arXiv: 2605.08671 by Gautam Veldanda.

Figure 1
Figure 1. Figure 1: Effect Sizes for Explanation Fairness Disparities (RQ1) [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Mean Disparity by Model × Domain (RQ3) [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
read the original abstract

Large language models (LLMs) are increasingly deployed not only to make decisions but to explain them. While AI decision fairness has been studied extensively, the fairness of AI explanations (whether LLMs justify decisions with equal quality, depth, tone, and linguistic sophistication across demographic groups) has received little attention. This paper introduces the Explanation Fairness Taxonomy (EFT), a framework comprising five formally defined, operationalizable dimensions: Verbosity Disparity, Sentiment Disparity, Epistemic Hedging Disparity, Decision-Linked Explanation Disparity, and Lexical Complexity Disparity. The taxonomy is instantiated in a controlled empirical study across 80 prompt templates, four consequential decision domains (hiring, medical triage, credit assessment, legal judgment), and five LLMs: GPT-4.1, Claude Sonnet, LLaMA 3.3 70B, GPT-OSS 120B, and Qwen3 32B. Two novel black-box metrics are introduced: the Hedging Density Score (HDS) and the Explanation Faithfulness Proxy (EFP), a heuristic indicator of decision-linked explanation variation. Across up to 400 prompt pairs, all eight EFT metrics show statistically significant disparities (Cohen's d ranging from small to large, all p_BH < 10^(-62)). Model choice is strongly associated with disparity magnitude: Qwen3 32B exhibits verbosity disparities 5.9x larger than LLaMA 3.3 70B. Two prompting-based mitigations show significant reductions in EFP disparity (78-95%) but no significant effect on stylistic dimensions, consistent with the hypothesis that stylistic explanation inequalities are encoded in pre-training distributions and are not resolvable through deployment-level instruction alone. A reproducible measurement framework is offered for explanation-level fairness auditing, with implications for AI regulation and deployment practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Explanation Fairness Taxonomy (EFT) with five dimensions (Verbosity Disparity, Sentiment Disparity, Epistemic Hedging Disparity, Decision-Linked Explanation Disparity, Lexical Complexity Disparity) and two novel black-box metrics (Hedging Density Score (HDS) and Explanation Faithfulness Proxy (EFP)). It reports an empirical study using 80 prompt templates across four decision domains (hiring, medical triage, credit assessment, legal judgment) and five LLMs, finding statistically significant disparities in all eight EFT metrics (Cohen's d from small to large, all p_BH < 10^{-62}) across up to 400 prompt pairs. Model choice affects disparity magnitude, and two prompting mitigations reduce EFP disparity substantially but not stylistic dimensions.

Significance. If the metrics validly capture explanation disparities attributable to demographics, the work would offer a useful auditing framework for LLM explanations with implications for regulation and deployment. Strengths include the scale of the study, inclusion of multiple open and closed models, and the reproducible measurement approach. However, the absence of metric validation limits the immediate significance for the AI fairness literature.

major comments (2)
  1. [§3 (Metric Definitions)] §3 (Metric Definitions): The Hedging Density Score (HDS) and Explanation Faithfulness Proxy (EFP) are introduced as novel black-box heuristics without reported validation against human judgments, inter-rater reliability, or correlations with established measures. This is load-bearing for the central claim, as the reported statistically significant disparities (all p_BH < 10^{-62}) and Cohen's d values depend on these metrics correctly quantifying demographic effects on explanations rather than surface artifacts or prompt properties.
  2. [§4 (Experimental Setup)] §4 (Experimental Setup): The 80 prompt templates are asserted to isolate the demographic variable, but no ablation studies, controls, or analyses for potential confounds (e.g., prompt length, lexical overlap, syntactic structure, or domain-specific phrasing) are described. This undermines attribution of the observed disparities specifically to demographic mentions in the explanations.
minor comments (2)
  1. [Abstract and Results] The abstract and results refer to 'up to 400 prompt pairs' and 'eight EFT metrics' without an explicit table mapping the five taxonomy dimensions to the eight metrics or providing per-model/domain breakdowns.
  2. [Methods] Notation for the two novel metrics (HDS, EFP) could be clarified with explicit formulas or pseudocode in the methods to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment point by point below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [§3 (Metric Definitions)] The Hedging Density Score (HDS) and Explanation Faithfulness Proxy (EFP) are introduced as novel black-box heuristics without reported validation against human judgments, inter-rater reliability, or correlations with established measures. This is load-bearing for the central claim, as the reported statistically significant disparities (all p_BH < 10^{-62}) and Cohen's d values depend on these metrics correctly quantifying demographic effects on explanations rather than surface artifacts or prompt properties.

    Authors: We acknowledge that human validation would strengthen the metrics. HDS is derived from established linguistic markers of epistemic hedging documented in prior NLP and linguistics research, while EFP functions as a transparent proxy for decision-linked content by measuring explicit references to the model decision. The scale of the study (thousands of explanations) precluded full human annotation in this work. We have revised the manuscript to include expanded justification of the metrics with citations to supporting literature, a dedicated limitations subsection on the absence of human validation, and explicit plans for future human studies. The large effect sizes, statistical significance, and consistency across five models and four domains provide convergent evidence that the disparities are not merely surface artifacts. revision: partial

  2. Referee: [§4 (Experimental Setup)] The 80 prompt templates are asserted to isolate the demographic variable, but no ablation studies, controls, or analyses for potential confounds (e.g., prompt length, lexical overlap, syntactic structure, or domain-specific phrasing) are described. This undermines attribution of the observed disparities specifically to demographic mentions in the explanations.

    Authors: We agree that explicit controls for confounds are necessary to support attribution. In the revised version we have added analyses that control for prompt length, lexical overlap, and syntactic structure across the template set. We further include ablation results obtained by systematically varying prompt phrasing and domain-specific elements while holding the demographic variable fixed; the disparities remain statistically significant under these controls. These additional experiments and results are now reported in Section 4 and the appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with externally defined metrics and no derivation chain

full rationale

The paper presents an empirical study introducing the EFT taxonomy and two black-box metrics (HDS, EFP) applied to LLM outputs across fixed prompt templates and models. No equations, first-principles derivations, predictions from fitted parameters, or self-citations are invoked to support the central claims; the reported disparities are direct statistical comparisons of measured quantities against external LLMs and prompt sets. The metrics are operationally defined as heuristics without reducing to the disparities they quantify, and the analysis contains no self-referential steps that equate outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim depends on the validity of newly introduced metrics and the assumption that prompt-based demographic variation isolates explanation bias. No free parameters are fitted to data; the work relies on standard statistical tests.

axioms (1)
  • standard math Standard multiple-comparison correction (Benjamini-Hochberg) is appropriate for the reported p-values.
    Used to support the claim of statistical significance across eight metrics.
invented entities (3)
  • Explanation Fairness Taxonomy (EFT) no independent evidence
    purpose: Framework defining five operationalizable dimensions of explanation disparity
    Newly proposed in the paper to structure the empirical analysis.
  • Hedging Density Score (HDS) no independent evidence
    purpose: Black-box metric quantifying epistemic hedging disparity
    Novel metric introduced for one of the five dimensions.
  • Explanation Faithfulness Proxy (EFP) no independent evidence
    purpose: Heuristic indicator of decision-linked explanation variation
    New proxy metric introduced to measure one disparity dimension.

pith-pipeline@v0.9.0 · 5645 in / 1380 out tokens · 61798 ms · 2026-05-12T01:20:54.368625+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 1 internal anchor

  1. [1]

    BBQ: A hand-built bias benchmark for question answering,

    A. Parrish, A. Chen, N. Nangia, V . Padmakumar, J. Phang, J. Thompson, P. M. Htut, and S. R. Bowman, “BBQ: A hand-built bias benchmark for question answering,” in Findings of the Association for Computational Linguis- tics: ACL 2022, pp. 2086–2105, 2022

  2. [2]

    Bias and fairness in large language models: A survey,

    I. O. Gallegos, R. A. Rossi, J. Barrow, M. M. Tanjim, S. Kim, F. Dernoncourt, T. Yu, R. Zhang, and N. K. Ahmed, “Bias and fairness in large language models: A survey,”Computational Linguistics, vol. 50, no. 3, 2024

  3. [3]

    Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting,

    M. Turpin, J. Michael, E. Perez, and S. R. Bowman, “Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 36, 2023

  4. [4]

    Bowman, Ethan Perez, and Evan Hubinger

    C. Denison, M. MacDiarmid, F. Barez, D. Duvenaud, S. Kravec, S. Marks, N. Schiefer, R. Soklaski, A. Tamkin, J. Kaplan,et al., “Sycophancy to subterfuge: Investigat- ing reward tampering in language models,”arXiv preprint arXiv:2406.10162, 2024

  5. [5]

    Regulation (EU) 2024/1689 on artificial intelligence (AI act),

    European Parliament and Council of the European Union, “Regulation (EU) 2024/1689 on artificial intelligence (AI act),” tech. rep., Official Journal of the European Union, 2024

  6. [6]

    Gender bias in coreference resolution: Evalua- tion and debiasing methods,

    J. Zhao, T. Wang, M. Yatskar, V . Ordonez, and K.-W. Chang, “Gender bias in coreference resolution: Evalua- tion and debiasing methods,” inProceedings of NAACL- HLT 2018, pp. 15–20, 2018

  7. [7]

    HolisticBias: A large-scale text corpus for equitable language,

    E. M. Smith, M. Hall, M. Kambadur, E. Presani, and A. Williams, “HolisticBias: A large-scale text corpus for equitable language,” inProceedings of EMNLP 2022, pp. 9180–9211, 2022

  8. [8]

    StereoSet: Mea- suring stereotypical bias in pretrained language models,

    M. Nadeem, A. Bethke, and S. Reddy, “StereoSet: Mea- suring stereotypical bias in pretrained language models,” inProceedings of ACL 2021, pp. 5356–5371, 2021

  9. [9]

    CrowS-Pairs: A challenge dataset for measuring social biases in masked language models,

    N. Nangia, C. Vania, R. Bhalerao, and S. R. Bowman, “CrowS-Pairs: A challenge dataset for measuring social biases in masked language models,” inProceedings of EMNLP 2020, pp. 1953–1967, 2020

  10. [10]

    Auditing large language models for race & gender disparities: Im- plications for artificial intelligence–based hiring,

    J. D. Gaebler, S. Goel, A. Huq, and P. Tambe, “Auditing large language models for race & gender disparities: Im- plications for artificial intelligence–based hiring,”Behav- ioral Science & Policy, 2025

  11. [11]

    Unmasking implicit bias: Evaluating persona-prompted LLM responses in power- disparate social scenarios,

    B. C. Z. Tan and R. K.-W. Lee, “Unmasking implicit bias: Evaluating persona-prompted LLM responses in power- disparate social scenarios,” inProceedings of the 2025 Conference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Language Technologies (NAACL-HLT 2025), 2025

  12. [12]

    Equal access, un- equal interaction: A counterfactual audit of LLM fair- ness,

    A. Amiri-Margavi, A. Gharagozlou, A. G. Davodi, S. P. M. Davoudi, and H. H. Balyani, “Equal access, un- equal interaction: A counterfactual audit of LLM fair- ness,”arXiv preprint arXiv:2602.02932, 2026

  13. [13]

    Fairness in large language models: A taxonomic survey,

    Z. Li, Z. Wang,et al., “Fairness in large language models: A taxonomic survey,”arXiv preprint arXiv:2404.01349, 2024

  14. [14]

    Measuring Faithfulness in Chain-of-Thought Reasoning

    T. Lanham, A. Chen, A. Radhakrishnan, N. Steiner, C. Denison, D. Hernandez, D. Li, E. Durmus, E. Hub- inger, J. Kernion,et al., “Measuring faithfulness in chain- of-thought reasoning,”arXiv preprint arXiv:2307.13702, 2023

  15. [15]

    Towards under- standing sycophancy in language models,

    M. Sharma, M. Tong, T. Korbak, D. Duvenaud, A. Conmy, E. Durmus, J. Steinhardt, and E. Perez, “Towards under- standing sycophancy in language models,” inFindings of ACL 2024, 2024

  16. [16]

    The road to explainabil- ity is paved with bias: Measuring the fairness of expla- nations,

    A. Balagopalan, H. Zhang, K. Hamidieh, T. Hartvigsen, F. Rudzicz, and M. Ghassemi, “The road to explainabil- ity is paved with bias: Measuring the fairness of expla- nations,” inProceedings of FAccT 2022, pp. 1194–1206, 2022. 9

  17. [17]

    Fairness via explanation quality: Evalu- ating disparities in the quality of post hoc explanations,

    J. Dai, S. Upadhyay, U. Aivodji, S. H. Bach, and H. Lakkaraju, “Fairness via explanation quality: Evalu- ating disparities in the quality of post hoc explanations,” inProceedings of AIES 2022, 2022

  18. [18]

    Fairness and explainabil- ity: Bridging the gap towards fair model explanations,

    Y . Zhao, Y . Wang, and T. Derr, “Fairness and explainabil- ity: Bridging the gap towards fair model explanations,” arXiv preprint arXiv:2212.03840, 2022

  19. [19]

    Understanding disparities in post hoc machine learning explanation,

    V . Mhasawade, S. Rahman, Z. Haskell-Craig, and R. Chu- nara, “Understanding disparities in post hoc machine learning explanation,” inProceedings of the 2024 ACM Conference on Fairness, Accountability, and Trans- parency (FAccT ’24), 2024

  20. [20]

    The right to explanation in the AI act,

    S. Wachter, B. Mittelstadt, and C. Russell, “The right to explanation in the AI act,”SSRN Electronic Journal, 2025. Available at SSRN: ssrn.com/abstract=5194301

  21. [21]

    Sentence-BERT: Sentence embeddings using siamese BERT-networks,

    N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using siamese BERT-networks,” inProceed- ings of EMNLP 2019, pp. 3982–3992, 2019

  22. [22]

    V ADER: A parsimonious rule- based model for sentiment analysis of social media text,

    C. J. Hutto and E. Gilbert, “V ADER: A parsimonious rule- based model for sentiment analysis of social media text,” inProceedings of the 8th International AAAI Conference on Weblogs and Social Media (ICWSM), 2014

  23. [23]

    Derivation of new readability formulas for navy enlisted personnel,

    J. P. Kincaid, R. P. Fishburne, R. L. Rogers, and B. S. Chissom, “Derivation of new readability formulas for navy enlisted personnel,” Tech. Rep. Research Branch Re- port 8-75, Naval Air Station Memphis, 1975

  24. [24]

    Cohen,Statistical Power Analysis for the Behavioral Sciences

    J. Cohen,Statistical Power Analysis for the Behavioral Sciences. Hillsdale, NJ: Lawrence Erlbaum Associates, 2nd ed., 1988

  25. [25]

    Controlling the false dis- covery rate: A practical and powerful approach to multiple testing,

    Y . Benjamini and Y . Hochberg, “Controlling the false dis- covery rate: A practical and powerful approach to multiple testing,”Journal of the Royal Statistical Society: Series B, vol. 57, no. 1, pp. 289–300, 1995. 10