Recognition: 2 theorem links
· Lean TheoremExplanation Fairness in Large Language Models: An Empirical Analysis of Disparities in How LLMs Justify Decisions Across Demographic Groups
Pith reviewed 2026-05-12 01:20 UTC · model grok-4.3
The pith
Large language models produce explanations of varying quality and style for identical decisions depending on the demographic groups mentioned.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across up to 400 prompt pairs in hiring, medical, credit, and legal domains, every metric in the five-dimensional Explanation Fairness Taxonomy exhibits statistically significant disparities, with Cohen's d values from small to large and p_BH below 10^(-62); model choice strongly modulates disparity magnitude, and two prompting mitigations cut decision-linked disparities by 78-95 percent while leaving stylistic dimensions unchanged.
What carries the argument
The Explanation Fairness Taxonomy (EFT) with its five dimensions—Verbosity Disparity, Sentiment Disparity, Epistemic Hedging Disparity, Decision-Linked Explanation Disparity, and Lexical Complexity Disparity—operationalized via the Hedging Density Score (HDS) and Explanation Faithfulness Proxy (EFP) black-box metrics.
If this is right
- Model choice can be used to reduce or increase the size of explanation disparities, with some models showing gaps several times larger than others.
- Prompting-based interventions can substantially shrink decision-linked explanation disparities.
- Stylistic disparities in verbosity, hedging, and lexical complexity remain after prompting, supporting the view that they originate in pre-training distributions.
- A reproducible measurement framework enables systematic auditing of explanation fairness before deployment.
- The results carry direct implications for regulatory requirements around AI decision explanations.
Where Pith is reading between the lines
- If users notice and react to these explanation differences, trust in AI-assisted decisions may vary systematically by demographic context even when the underlying decision is unchanged.
- Because prompting leaves stylistic gaps untouched, lasting fixes may require modifications to training data or model objectives rather than post-training instructions.
- The same measurement approach could be applied to other generative outputs, such as summaries or recommendations, to check for similar demographic-linked quality gaps.
Load-bearing premise
The selected prompt templates and black-box metrics isolate explanation differences caused by demographic mentions rather than other prompt wording or model artifacts.
What would settle it
Re-running the full set of experiments with prompt templates that hold demographic terms constant while varying other elements, or with alternative explanation-quality metrics, and finding no statistically significant disparities would undermine the central claim.
Figures
read the original abstract
Large language models (LLMs) are increasingly deployed not only to make decisions but to explain them. While AI decision fairness has been studied extensively, the fairness of AI explanations (whether LLMs justify decisions with equal quality, depth, tone, and linguistic sophistication across demographic groups) has received little attention. This paper introduces the Explanation Fairness Taxonomy (EFT), a framework comprising five formally defined, operationalizable dimensions: Verbosity Disparity, Sentiment Disparity, Epistemic Hedging Disparity, Decision-Linked Explanation Disparity, and Lexical Complexity Disparity. The taxonomy is instantiated in a controlled empirical study across 80 prompt templates, four consequential decision domains (hiring, medical triage, credit assessment, legal judgment), and five LLMs: GPT-4.1, Claude Sonnet, LLaMA 3.3 70B, GPT-OSS 120B, and Qwen3 32B. Two novel black-box metrics are introduced: the Hedging Density Score (HDS) and the Explanation Faithfulness Proxy (EFP), a heuristic indicator of decision-linked explanation variation. Across up to 400 prompt pairs, all eight EFT metrics show statistically significant disparities (Cohen's d ranging from small to large, all p_BH < 10^(-62)). Model choice is strongly associated with disparity magnitude: Qwen3 32B exhibits verbosity disparities 5.9x larger than LLaMA 3.3 70B. Two prompting-based mitigations show significant reductions in EFP disparity (78-95%) but no significant effect on stylistic dimensions, consistent with the hypothesis that stylistic explanation inequalities are encoded in pre-training distributions and are not resolvable through deployment-level instruction alone. A reproducible measurement framework is offered for explanation-level fairness auditing, with implications for AI regulation and deployment practice.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Explanation Fairness Taxonomy (EFT) with five dimensions (Verbosity Disparity, Sentiment Disparity, Epistemic Hedging Disparity, Decision-Linked Explanation Disparity, Lexical Complexity Disparity) and two novel black-box metrics (Hedging Density Score (HDS) and Explanation Faithfulness Proxy (EFP)). It reports an empirical study using 80 prompt templates across four decision domains (hiring, medical triage, credit assessment, legal judgment) and five LLMs, finding statistically significant disparities in all eight EFT metrics (Cohen's d from small to large, all p_BH < 10^{-62}) across up to 400 prompt pairs. Model choice affects disparity magnitude, and two prompting mitigations reduce EFP disparity substantially but not stylistic dimensions.
Significance. If the metrics validly capture explanation disparities attributable to demographics, the work would offer a useful auditing framework for LLM explanations with implications for regulation and deployment. Strengths include the scale of the study, inclusion of multiple open and closed models, and the reproducible measurement approach. However, the absence of metric validation limits the immediate significance for the AI fairness literature.
major comments (2)
- [§3 (Metric Definitions)] §3 (Metric Definitions): The Hedging Density Score (HDS) and Explanation Faithfulness Proxy (EFP) are introduced as novel black-box heuristics without reported validation against human judgments, inter-rater reliability, or correlations with established measures. This is load-bearing for the central claim, as the reported statistically significant disparities (all p_BH < 10^{-62}) and Cohen's d values depend on these metrics correctly quantifying demographic effects on explanations rather than surface artifacts or prompt properties.
- [§4 (Experimental Setup)] §4 (Experimental Setup): The 80 prompt templates are asserted to isolate the demographic variable, but no ablation studies, controls, or analyses for potential confounds (e.g., prompt length, lexical overlap, syntactic structure, or domain-specific phrasing) are described. This undermines attribution of the observed disparities specifically to demographic mentions in the explanations.
minor comments (2)
- [Abstract and Results] The abstract and results refer to 'up to 400 prompt pairs' and 'eight EFT metrics' without an explicit table mapping the five taxonomy dimensions to the eight metrics or providing per-model/domain breakdowns.
- [Methods] Notation for the two novel metrics (HDS, EFP) could be clarified with explicit formulas or pseudocode in the methods to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each major comment point by point below, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [§3 (Metric Definitions)] The Hedging Density Score (HDS) and Explanation Faithfulness Proxy (EFP) are introduced as novel black-box heuristics without reported validation against human judgments, inter-rater reliability, or correlations with established measures. This is load-bearing for the central claim, as the reported statistically significant disparities (all p_BH < 10^{-62}) and Cohen's d values depend on these metrics correctly quantifying demographic effects on explanations rather than surface artifacts or prompt properties.
Authors: We acknowledge that human validation would strengthen the metrics. HDS is derived from established linguistic markers of epistemic hedging documented in prior NLP and linguistics research, while EFP functions as a transparent proxy for decision-linked content by measuring explicit references to the model decision. The scale of the study (thousands of explanations) precluded full human annotation in this work. We have revised the manuscript to include expanded justification of the metrics with citations to supporting literature, a dedicated limitations subsection on the absence of human validation, and explicit plans for future human studies. The large effect sizes, statistical significance, and consistency across five models and four domains provide convergent evidence that the disparities are not merely surface artifacts. revision: partial
-
Referee: [§4 (Experimental Setup)] The 80 prompt templates are asserted to isolate the demographic variable, but no ablation studies, controls, or analyses for potential confounds (e.g., prompt length, lexical overlap, syntactic structure, or domain-specific phrasing) are described. This undermines attribution of the observed disparities specifically to demographic mentions in the explanations.
Authors: We agree that explicit controls for confounds are necessary to support attribution. In the revised version we have added analyses that control for prompt length, lexical overlap, and syntactic structure across the template set. We further include ablation results obtained by systematically varying prompt phrasing and domain-specific elements while holding the demographic variable fixed; the disparities remain statistically significant under these controls. These additional experiments and results are now reported in Section 4 and the appendix. revision: yes
Circularity Check
No circularity: purely empirical measurements with externally defined metrics and no derivation chain
full rationale
The paper presents an empirical study introducing the EFT taxonomy and two black-box metrics (HDS, EFP) applied to LLM outputs across fixed prompt templates and models. No equations, first-principles derivations, predictions from fitted parameters, or self-citations are invoked to support the central claims; the reported disparities are direct statistical comparisons of measured quantities against external LLMs and prompt sets. The metrics are operationally defined as heuristics without reducing to the disparities they quantify, and the analysis contains no self-referential steps that equate outputs to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard multiple-comparison correction (Benjamini-Hochberg) is appropriate for the reported p-values.
invented entities (3)
-
Explanation Fairness Taxonomy (EFT)
no independent evidence
-
Hedging Density Score (HDS)
no independent evidence
-
Explanation Faithfulness Proxy (EFP)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
introduces the Explanation Fairness Taxonomy (EFT) ... Hedging Density Score (HDS) and the Explanation Faithfulness Proxy (EFP)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
all eight EFT metrics show statistically significant disparities (Cohen’s d ... p_BH < 10^{-62})
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
BBQ: A hand-built bias benchmark for question answering,
A. Parrish, A. Chen, N. Nangia, V . Padmakumar, J. Phang, J. Thompson, P. M. Htut, and S. R. Bowman, “BBQ: A hand-built bias benchmark for question answering,” in Findings of the Association for Computational Linguis- tics: ACL 2022, pp. 2086–2105, 2022
work page 2022
-
[2]
Bias and fairness in large language models: A survey,
I. O. Gallegos, R. A. Rossi, J. Barrow, M. M. Tanjim, S. Kim, F. Dernoncourt, T. Yu, R. Zhang, and N. K. Ahmed, “Bias and fairness in large language models: A survey,”Computational Linguistics, vol. 50, no. 3, 2024
work page 2024
-
[3]
M. Turpin, J. Michael, E. Perez, and S. R. Bowman, “Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 36, 2023
work page 2023
-
[4]
Bowman, Ethan Perez, and Evan Hubinger
C. Denison, M. MacDiarmid, F. Barez, D. Duvenaud, S. Kravec, S. Marks, N. Schiefer, R. Soklaski, A. Tamkin, J. Kaplan,et al., “Sycophancy to subterfuge: Investigat- ing reward tampering in language models,”arXiv preprint arXiv:2406.10162, 2024
-
[5]
Regulation (EU) 2024/1689 on artificial intelligence (AI act),
European Parliament and Council of the European Union, “Regulation (EU) 2024/1689 on artificial intelligence (AI act),” tech. rep., Official Journal of the European Union, 2024
work page 2024
-
[6]
Gender bias in coreference resolution: Evalua- tion and debiasing methods,
J. Zhao, T. Wang, M. Yatskar, V . Ordonez, and K.-W. Chang, “Gender bias in coreference resolution: Evalua- tion and debiasing methods,” inProceedings of NAACL- HLT 2018, pp. 15–20, 2018
work page 2018
-
[7]
HolisticBias: A large-scale text corpus for equitable language,
E. M. Smith, M. Hall, M. Kambadur, E. Presani, and A. Williams, “HolisticBias: A large-scale text corpus for equitable language,” inProceedings of EMNLP 2022, pp. 9180–9211, 2022
work page 2022
-
[8]
StereoSet: Mea- suring stereotypical bias in pretrained language models,
M. Nadeem, A. Bethke, and S. Reddy, “StereoSet: Mea- suring stereotypical bias in pretrained language models,” inProceedings of ACL 2021, pp. 5356–5371, 2021
work page 2021
-
[9]
CrowS-Pairs: A challenge dataset for measuring social biases in masked language models,
N. Nangia, C. Vania, R. Bhalerao, and S. R. Bowman, “CrowS-Pairs: A challenge dataset for measuring social biases in masked language models,” inProceedings of EMNLP 2020, pp. 1953–1967, 2020
work page 2020
-
[10]
J. D. Gaebler, S. Goel, A. Huq, and P. Tambe, “Auditing large language models for race & gender disparities: Im- plications for artificial intelligence–based hiring,”Behav- ioral Science & Policy, 2025
work page 2025
-
[11]
B. C. Z. Tan and R. K.-W. Lee, “Unmasking implicit bias: Evaluating persona-prompted LLM responses in power- disparate social scenarios,” inProceedings of the 2025 Conference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Language Technologies (NAACL-HLT 2025), 2025
work page 2025
-
[12]
Equal access, un- equal interaction: A counterfactual audit of LLM fair- ness,
A. Amiri-Margavi, A. Gharagozlou, A. G. Davodi, S. P. M. Davoudi, and H. H. Balyani, “Equal access, un- equal interaction: A counterfactual audit of LLM fair- ness,”arXiv preprint arXiv:2602.02932, 2026
-
[13]
Fairness in large language models: A taxonomic survey,
Z. Li, Z. Wang,et al., “Fairness in large language models: A taxonomic survey,”arXiv preprint arXiv:2404.01349, 2024
-
[14]
Measuring Faithfulness in Chain-of-Thought Reasoning
T. Lanham, A. Chen, A. Radhakrishnan, N. Steiner, C. Denison, D. Hernandez, D. Li, E. Durmus, E. Hub- inger, J. Kernion,et al., “Measuring faithfulness in chain- of-thought reasoning,”arXiv preprint arXiv:2307.13702, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Towards under- standing sycophancy in language models,
M. Sharma, M. Tong, T. Korbak, D. Duvenaud, A. Conmy, E. Durmus, J. Steinhardt, and E. Perez, “Towards under- standing sycophancy in language models,” inFindings of ACL 2024, 2024
work page 2024
-
[16]
The road to explainabil- ity is paved with bias: Measuring the fairness of expla- nations,
A. Balagopalan, H. Zhang, K. Hamidieh, T. Hartvigsen, F. Rudzicz, and M. Ghassemi, “The road to explainabil- ity is paved with bias: Measuring the fairness of expla- nations,” inProceedings of FAccT 2022, pp. 1194–1206, 2022. 9
work page 2022
-
[17]
Fairness via explanation quality: Evalu- ating disparities in the quality of post hoc explanations,
J. Dai, S. Upadhyay, U. Aivodji, S. H. Bach, and H. Lakkaraju, “Fairness via explanation quality: Evalu- ating disparities in the quality of post hoc explanations,” inProceedings of AIES 2022, 2022
work page 2022
-
[18]
Fairness and explainabil- ity: Bridging the gap towards fair model explanations,
Y . Zhao, Y . Wang, and T. Derr, “Fairness and explainabil- ity: Bridging the gap towards fair model explanations,” arXiv preprint arXiv:2212.03840, 2022
-
[19]
Understanding disparities in post hoc machine learning explanation,
V . Mhasawade, S. Rahman, Z. Haskell-Craig, and R. Chu- nara, “Understanding disparities in post hoc machine learning explanation,” inProceedings of the 2024 ACM Conference on Fairness, Accountability, and Trans- parency (FAccT ’24), 2024
work page 2024
-
[20]
The right to explanation in the AI act,
S. Wachter, B. Mittelstadt, and C. Russell, “The right to explanation in the AI act,”SSRN Electronic Journal, 2025. Available at SSRN: ssrn.com/abstract=5194301
work page 2025
-
[21]
Sentence-BERT: Sentence embeddings using siamese BERT-networks,
N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using siamese BERT-networks,” inProceed- ings of EMNLP 2019, pp. 3982–3992, 2019
work page 2019
-
[22]
V ADER: A parsimonious rule- based model for sentiment analysis of social media text,
C. J. Hutto and E. Gilbert, “V ADER: A parsimonious rule- based model for sentiment analysis of social media text,” inProceedings of the 8th International AAAI Conference on Weblogs and Social Media (ICWSM), 2014
work page 2014
-
[23]
Derivation of new readability formulas for navy enlisted personnel,
J. P. Kincaid, R. P. Fishburne, R. L. Rogers, and B. S. Chissom, “Derivation of new readability formulas for navy enlisted personnel,” Tech. Rep. Research Branch Re- port 8-75, Naval Air Station Memphis, 1975
work page 1975
-
[24]
Cohen,Statistical Power Analysis for the Behavioral Sciences
J. Cohen,Statistical Power Analysis for the Behavioral Sciences. Hillsdale, NJ: Lawrence Erlbaum Associates, 2nd ed., 1988
work page 1988
-
[25]
Controlling the false dis- covery rate: A practical and powerful approach to multiple testing,
Y . Benjamini and Y . Hochberg, “Controlling the false dis- covery rate: A practical and powerful approach to multiple testing,”Journal of the Royal Statistical Society: Series B, vol. 57, no. 1, pp. 289–300, 1995. 10
work page 1995
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.