arxiv: 2604.17316 · v1 · submitted 2026-04-19 · 💻 cs.CL · cs.AI

Recognition: unknown

Calibrated? Not for Everyone: How Sexual Orientation and Religious Markers Distort LLM Accuracy and Confidence in Medical QA

Alberto Testoni , Iacer Calixto

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:48 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM calibrationmedical question answeringidentity biassexual orientationreligious affiliationAI fairnessclinical deploymentmodel uncertainty

0 comments

The pith

Social identity markers distort LLM accuracy and calibration on medical questions

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that descriptors of sexual orientation and religious affiliation in medical questions cause large language models to lose accuracy and produce unreliable confidence estimates. Homosexual markers trigger consistent performance declines across models, while combinations of identities create distinct, non-additive patterns of miscalibration rather than simple additive effects. This matters because medical applications depend on models correctly signaling uncertainty so clinicians can override them, yet these identity-driven shifts could produce uneven reliability that varies by patient description. The authors reach this conclusion by comparing nine models on 2,364 questions and their controlled variants, then confirming the pattern holds in open-ended generation checked by clinicians.

Core claim

The authors demonstrate that the inclusion of social identity cues in medical QA prompts causes a calibration crisis in LLMs, where accuracy and uncertainty estimates are systematically altered in non-uniform ways depending on the specific markers, particularly harming performance for homosexual descriptors and producing idiosyncratic intersectional effects.

What carries the argument

Counterfactual variants of 2,364 medical questions that differ only by the addition of sexual orientation or religious markers, used to isolate effects on accuracy and calibration metrics across nine LLMs plus a clinician-validated open-ended case study.

Load-bearing premise

The counterfactual question variants cleanly isolate the causal effect of the identity markers without introducing other uncontrolled differences in wording, difficulty, or model training exposure.

What would settle it

Re-testing the same models on a new set of question pairs where wording length, semantic content, and phrasing difficulty are matched more tightly beyond the identity marker itself, and finding no systematic change in accuracy or calibration.

Figures

Figures reproduced from arXiv: 2604.17316 by Alberto Testoni, Iacer Calixto.

**Figure 2.** Figure 2: Illustrative failure cases. Top: In a multiple-choice setting, Bio-Medical-Llama-3-8B flips from the correct diagnosis (osteoarthritis) to an unrelated one (gout) when the +homo marker is inserted. Bottom: In openended QA, GPT-5.1 shifts from the correct mechanism (hepatorenal syndrome from cirrhosis) to a stereotype-driven explanation (HIV-associated nephropathy) under the +homo condition, while maintain… view at source ↗

read the original abstract

Safe clinical deployment of Large Language Models (LLMs) requires not only high accuracy but also robust uncertainty calibration to ensure models defer to clinicians when appropriate. Our paper investigates how social descriptors of a patient (specifically sexual orientation and religious affiliation) distort these uncertainty signals and model accuracy. Evaluating nine general-purpose and biomedical LLMs on 2,364 medical questions and their counterfactual variants, we demonstrate that identity markers cause a "calibration crisis". "Homosexual" markers consistently trigger performance drops, and intersectional identities produce idiosyncratic, non-additive harms to calibration. Moreover, a clinician-validated case study in an open-ended generation setting confirms that these failures are not an artifact of the multiple-choice format. Our results demonstrate that the presence of social identity cues does not merely shift predictions; it affects the reliability of confidence signals, posing a significant risk to equitable care and safe deployment in confidence-based clinical workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper evaluates nine general-purpose and biomedical LLMs on 2,364 medical questions and their counterfactual variants that insert sexual orientation or religious affiliation markers. It reports that these markers, especially 'homosexual', produce consistent drops in accuracy and uncertainty calibration, with intersectional identities yielding idiosyncratic, non-additive harms; a clinician-validated open-ended case study is presented to show the effect is not an artifact of multiple-choice format.

Significance. If the counterfactual controls are sound, the results would be significant for clinical LLM deployment: they indicate that identity cues can degrade the reliability of confidence signals, raising risks for equitable care and for any workflow that relies on model abstention or uncertainty thresholds.

major comments (2)

[Methods] Methods section on counterfactual construction: the central causal claim requires that each variant differs from its base solely by the inserted marker while preserving medical content, lexical difficulty, and syntactic structure. The manuscript must supply explicit post-generation balance checks (sentence length, Flesch readability, medical-term frequency, embedding cosine similarity) and the exact edit protocol; absent these, systematic confounds cannot be ruled out.
[Results] Results, calibration and accuracy tables: the reported 'calibration crisis' and 'non-additive' intersectional effects rest on aggregate metrics across 2,364 items. The paper should report per-marker sample sizes, exact statistical tests (e.g., paired t-tests or Wilcoxon with correction), and confidence intervals; without them the 'consistently trigger' and 'idiosyncratic' claims remain difficult to evaluate for robustness.

minor comments (2)

[Abstract] Abstract: the phrase 'calibration crisis' is used without a quantitative definition (e.g., ECE threshold or Brier-score delta); a brief operational definition would improve clarity.
[Figures/Tables] Figure captions and table legends should explicitly state the number of questions per identity category and the exact calibration metric (ECE, MCE, or Brier) used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important areas for strengthening the causal claims and statistical reporting in our work. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [Methods] Methods section on counterfactual construction: the central causal claim requires that each variant differs from its base solely by the inserted marker while preserving medical content, lexical difficulty, and syntactic structure. The manuscript must supply explicit post-generation balance checks (sentence length, Flesch readability, medical-term frequency, embedding cosine similarity) and the exact edit protocol; absent these, systematic confounds cannot be ruled out.

Authors: We agree that quantitative balance checks are necessary to support the claim that variants differ only in the inserted social marker. The current manuscript describes the generation process at a high level but omits these metrics. In the revised version we will add an explicit subsection on counterfactual construction that details the edit protocol (template-based marker insertion followed by manual review for medical fidelity) and reports post-generation balance statistics: mean sentence lengths, Flesch-Kincaid readability scores, counts of medical terms drawn from a standard lexicon, and mean cosine similarity of sentence embeddings between each base question and its variants. These will appear in a new table or appendix so readers can directly evaluate potential confounds. revision: yes
Referee: [Results] Results, calibration and accuracy tables: the reported 'calibration crisis' and 'non-additive' intersectional effects rest on aggregate metrics across 2,364 items. The paper should report per-marker sample sizes, exact statistical tests (e.g., paired t-tests or Wilcoxon with correction), and confidence intervals; without them the 'consistently trigger' and 'idiosyncratic' claims remain difficult to evaluate for robustness.

Authors: We accept that aggregate metrics alone make it harder to assess the robustness of the per-marker and intersectional patterns. The 2,364 items comprise base questions plus multiple counterfactual variants, but per-marker breakdowns and inferential statistics were not presented. In the revision we will add per-marker sample sizes to the results tables or a supplementary table, report exact paired tests (McNemar’s test for accuracy and appropriate calibration-difference tests) with multiplicity corrections, and include 95 % confidence intervals for the key accuracy and calibration deltas. These additions will allow readers to evaluate the strength of the “consistently trigger” and “idiosyncratic” claims directly. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical evaluation

full rationale

The paper conducts an empirical evaluation of LLMs on a fixed set of 2,364 medical questions and their counterfactual variants, measuring accuracy and calibration metrics directly against external models. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains are present that would reduce the central claims to inputs by construction. The results are benchmarked against independent question sets and models, satisfying the criteria for a self-contained empirical study with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions in NLP bias evaluation: that counterfactual variants isolate marker effects and that medical QA benchmarks generalize to clinical safety concerns.

axioms (2)

domain assumption Counterfactual variants isolate the causal effect of identity markers on model behavior
Invoked to attribute accuracy and calibration drops directly to the added descriptors.
domain assumption Performance on the chosen medical QA set is indicative of behavior in real clinical confidence-based workflows
Used to extrapolate the calibration crisis to safe deployment risks.

pith-pipeline@v0.9.0 · 5457 in / 1306 out tokens · 47076 ms · 2026-05-10T06:48:34.018386+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 17 canonical work pages · 2 internal anchors

[1]

arXiv preprint arXiv:2507.02799 , year=

Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models , author=. arXiv preprint arXiv:2507.02799 , year=

work page arXiv
[2]

International conference on machine learning , pages=

On calibration of modern neural networks , author=. International conference on machine learning , pages=. 2017 , organization=

2017
[3]

, author=

The meaning and use of the area under a receiver operating characteristic (ROC) curve. , author=. Radiology , volume=
[4]

Beyond Accuracy: Behavioral Testing of

Ribeiro, Marco Tulio and Wu, Tongshuang and Guestrin, Carlos and Singh, Sameer. Beyond Accuracy: Behavioral Testing of NLP Models with C heck L ist. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.442

work page doi:10.18653/v1/2020.acl-main.442 2020
[5]

American Journal of Public Health , volume=

Sexual orientation and gender identity data collection: clinical and public health importance , author=. American Journal of Public Health , volume=. 2020 , publisher=

2020
[6]

Scientific reports , volume=

De-identification is not enough: a comparison between de-identified and synthetic clinical notes , author=. Scientific reports , volume=. 2024 , publisher=

2024
[7]

Hugging Face Datasets , howpublished =

Baker, George , title =. Hugging Face Datasets , howpublished =. 2023 , publisher =

2023
[8]

Hugging Face repository , howpublished =

Ankit Pal and Malaikannan Sankarasubbu , title =. Hugging Face repository , howpublished =. 2024 , publisher =

2024
[9]

2024 , howpublished =

ContactDoctor , title =. 2024 , howpublished =

2024
[10]

Qwen3 Technical Report

Qwen3 Technical Report , author=. arXiv preprint arXiv:2505.09388 , year=. 2505.09388 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

The Llama 3 Herd of Models

The Llama 3 Herd of Models , author=. 2024 , journal=. 2407.21783 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Meta AI Blog

Llama 3.2: Revolutionizing edge ai and vision with open, customizable models , author=. Meta AI Blog. Retrieved December , volume=
[13]

Acta Psychiatrica Scandinavica , volume=

Mental health in people with minority sexual orientations: A meta-analysis of population-based studies , author=. Acta Psychiatrica Scandinavica , volume=. 2022 , publisher=

2022
[14]

BMC psychiatry , volume=

A systematic review of mental disorder, suicide, and deliberate self harm in lesbian, gay and bisexual people , author=. BMC psychiatry , volume=. 2008 , publisher=

2008
[15]

Does Reasoning Introduce Bias?

Wu, Xuyang and Nian, Jinming and Wei, Ting-Ruen and Tao, Zhiqiang and Wu, Hsin-Tai and Fang, Yi. Does Reasoning Introduce Bias? A Study of Social Bias Evaluation and Mitigation in LLM Reasoning. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.1006

work page doi:10.18653/v1/2025.findings-emnlp.1006 2025
[16]

RA c QUE t: Unveiling the Dangers of Overlooked Referential Ambiguity in Visual LLM s

Testoni, Alberto and Plank, Barbara and Fern \'a ndez, Raquel. RA c QUE t: Unveiling the Dangers of Overlooked Referential Ambiguity in Visual LLM s. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1206

work page doi:10.18653/v1/2025.emnlp-main.1206 2025
[17]

Implicit Bias in

Hirsch, Micaela and Elichiry, Marina and Radi, Blas and Quiroga, Tamara and Restrepo, David and Benotti, Luciana and Xhardez, Veronica and Dunstan, Jocelyn and Ferrante, Enzo , journal=. Implicit Bias in
[18]

Reading Between the Prompts: How Stereotypes Shape LLM ' s Implicit Personalization

Neplenbroek, Vera and Bisazza, Arianna and Fern \'a ndez, Raquel. Reading Between the Prompts: How Stereotypes Shape LLM ' s Implicit Personalization. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1029

work page doi:10.18653/v1/2025.emnlp-main.1029 2025
[19]

HESEIA : A community-based dataset for evaluating social biases in large language models, co-designed in real school settings in L atin A merica

Ivetta, Guido and Gomez, Marcos J and Martinelli, Sof \'i a and Palombini, Pietro and Echeveste, M Emilia and Mazzeo, Nair Carolina and Busaniche, Beatriz and Benotti, Luciana. HESEIA : A community-based dataset for evaluating social biases in large language models, co-designed in real school settings in L atin A merica. Proceedings of the 2025 Conference...

work page doi:10.18653/v1/2025.emnlp-main.1275 2025
[20]

LLM s instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

Bavaresco, Anna and Bernardi, Raffaella and Bertolazzi, Leonardo and Elliott, Desmond and Fern \'a ndez, Raquel and Gatt, Albert and Ghaleb, Esam and Giulianelli, Mario and Hanna, Michael and Koller, Alexander and Martins, Andre and Mondorf, Philipp and Neplenbroek, Vera and Pezzelle, Sandro and Plank, Barbara and Schlangen, David and Suglia, Alessandro a...

work page doi:10.18653/v1/2025.acl-short.20 2025
[21]

PLOS Digital Health , volume=

Evaluating anti-LGBTQIA+ medical bias in large language models , author=. PLOS Digital Health , volume=. 2025 , publisher=

2025
[22]

and Bachmann, Magdalena and Cooke, William R

Penny-Dimri, Jahan C. and Bachmann, Magdalena and Cooke, William R. and Mathewlynn, Sam and Dockree, Samuel and Tolladay, John and Kossen, Jannik and Li, Lin and Gal, Yarin and Davis Jones, Gabriel , title=. The Lancet Obstetrics, Gynaecology,. 2025 , month=. doi:10.1016/j.lanogw.2025.100005 , url=

work page doi:10.1016/j.lanogw.2025.100005 2025
[23]

Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society , pages=

Persistent anti-muslim bias in large language models , author=. Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society , pages=

2021
[24]

Advances in Neural Information Processing Systems , volume=

Benchmarking llms via uncertainty quantification , author=. Advances in Neural Information Processing Systems , volume=
[25]

2020 IEEE International Conference on Big Data (Big Data) , pages=

A prototype application to identify LGBT patients in clinical notes , author=. 2020 IEEE International Conference on Big Data (Big Data) , pages=. 2020 , organization=

2020
[26]

International Review of Psychiatry , volume=

Ensuring equitable, inclusive and meaningful gender identity-and sexual orientation-related data collection in the healthcare sector: insights from a critical, pragmatic systematic review of the literature , author=. International Review of Psychiatry , volume=. 2022 , publisher=

2022
[27]

American Journal of Preventive Medicine , volume=

The utility of clinical notes for sexual minority health research , author=. American Journal of Preventive Medicine , volume=. 2020 , publisher=

2020
[28]

Journal of Pain and Symptom Management , volume=

Avoid Stigmatizing Language About Atheist Patients , author=. Journal of Pain and Symptom Management , volume=. 2020 , publisher=

2020
[29]

Treat Me Like a Person

“Treat Me Like a Person”: Unveiling Healthcare Narratives of Muslim Women who Wear Islamic Head Coverings Through a Poststructural Narrative Study , author=. Canadian Journal of Nursing Research , volume=. 2024 , publisher=

2024
[30]

Journal for the scientific study of religion , volume=

The association between religious discrimination and health: disaggregating by types of discrimination experiences, religious tradition, and forms of health , author=. Journal for the scientific study of religion , volume=. 2023 , publisher=

2023
[31]

Bioethics , volume=

Accommodating religion and belief in healthcare: Political threats, agonistic democracy and established religion , author=. Bioethics , volume=. 2023 , publisher=

2023
[32]

International journal of medical informatics , volume=

Using sexual orientation and gender identity data in electronic health records to assess for disparities in preventive health screening services , author=. International journal of medical informatics , volume=. 2020 , publisher=

2020
[33]

Monthly Weather Review , volume=

Verification of Forecasts Expressed in Terms of Probability , author=. Monthly Weather Review , volume=
[34]

American journal of public health , volume=

Barriers to health care among adults identifying as sexual minorities: A US national study , author=. American journal of public health , volume=. 2016 , publisher=

2016
[35]

2011 , publisher=

The health of lesbian, gay, bisexual, and transgender people: Building a foundation for better understanding , author=. 2011 , publisher=

2011
[36]

, author=

Prevalence of mental disorders, psychological distress, and mental health services use among lesbian, gay, and bisexual adults in the United States. , author=. Journal of consulting and clinical psychology , volume=. 2003 , publisher=

2003
[37]

2019 , publisher=

Intersectionality as Critical Social Theory , author=. 2019 , publisher=

2019
[38]

ISBN 979-8-89176-380-7

Testoni, Alberto and Calixto, Iacer. Mind the Gap: Benchmarking LLM Uncertainty and Calibration with Specialty-Aware Clinical QA and Reasoning-Based Behavioural Features. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 1: Long Papers). 2026. doi:10.18653/v1/2026.eacl-long.106

work page doi:10.18653/v1/2026.eacl-long.106 2026
[39]

Applied Sciences , volume=

What disease does this patient have? a large-scale open domain question answering dataset from medical exams , author=. Applied Sciences , volume=. 2021 , publisher=

2021
[40]

Beyond Semantic Entropy: Boosting LLM Uncertainty Quantification with Pairwise Semantic Similarity

Nguyen, Dang and Payani, Ali and Mirzasoleiman, Baharan. Beyond Semantic Entropy: Boosting LLM Uncertainty Quantification with Pairwise Semantic Similarity. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.234

work page doi:10.18653/v1/2025.findings-acl.234 2025
[41]

Proceedings of the Forty-First Conference on Uncertainty in Artificial Intelligence , articleno =

Li, Zhaoye and Shen, Siyuan and Yang, Wenjing and Jin, Ruochun and Chen, Huan and Cao, Ligong and Ren, Jing , title =. Proceedings of the Forty-First Conference on Uncertainty in Artificial Intelligence , articleno =. 2025 , publisher =

2025
[42]

Benchmarking Uncertainty Quantification Methods for Large Language Models with LM -Polygraph

Vashurin, Roman and Fadeeva, Ekaterina and Vazhentsev, Artem and Rvanova, Lyudmila and Vasilev, Daniil and Tsvigun, Akim and Petrakov, Sergey and Xing, Rui and Sadallah, Abdelrahman and Grishchenkov, Kirill and Panchenko, Alexander and Baldwin, Timothy and Nakov, Preslav and Panov, Maxim and Shelmanov, Artem. Benchmarking Uncertainty Quantification Method...

work page doi:10.1162/tacl_a_00737 2025
[43]

Transactions on Machine Learning Research , volume=

Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models , author=. Transactions on Machine Learning Research , volume=. 2024 , publisher=

2024
[44]

The Eleventh International Conference on Learning Representations , year=

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation , author=. The Eleventh International Conference on Learning Representations , year=
[45]

Journal of Artificial Intelligence Research , volume=

Uncertainty as a fairness measure , author=. Journal of Artificial Intelligence Research , volume=
[46]

Nature , volume=

Detecting hallucinations in large language models using semantic entropy , author=. Nature , volume=. 2024 , publisher=

2024
[47]

D iversity M ed QA : A Benchmark for Assessing Demographic Biases in Medical Diagnosis using Large Language Models

Rawat, Rajat and McBride, Hudson and Ghosh, Rajarshi and Nirmal, Dhiyaan and Moon, Jong and Alamuri, Dhruv and O'Brien, Sean and Zhu, Kevin. D iversity M ed QA : A Benchmark for Assessing Demographic Biases in Medical Diagnosis using Large Language Models. Proceedings of the Third Workshop on NLP for Positive Impact. 2024. doi:10.18653/v1/2024.nlp4pi-1.29

work page doi:10.18653/v1/2024.nlp4pi-1.29 2024
[48]

npj Digital Medicine , volume=

Evaluation and mitigation of cognitive biases in medical language models , author=. npj Digital Medicine , volume=. 2024 , publisher=

2024
[49]

npj Digital Medicine , volume=

Mitigating the risk of health inequity exacerbated by large language models , author=. npj Digital Medicine , volume=. 2025 , publisher=

2025
[50]

Nature Medicine , year=

Dvijotham, Krishnamurthy and Winkens, Jim and Barsbey, Melih and Ghaisas, Sumedh and Stanforth, Robert and Pawlowski, Nick and Strachan, Patricia and Ahmed, Zahra and Azizi, Shekoofeh and Bachrach, Yoram and Culp, Laura and Daswani, Mayank and Freyberg, Jan and Kelly, Christopher and Kiraly, Atilla and Kohlberger, Timo and McKinney, Scott and Mustafa, Bas...

work page doi:10.1038/s41591-023-02437-x
[51]

Frontiers in Digital Health , volume=

Large language models in real-world clinical workflows: a systematic review of applications and implementation , author=. Frontiers in Digital Health , volume=. 2025 , publisher=

2025
[52]

and Celi, Leo Anthony and Gichoya, Judy and Jurafsky, Dan and Szolovits, Peter and Bates, David W

Zack, Travis and Lehman, Eric and Suzgun, Mirac and Rodriguez, Jorge A. and Celi, Leo Anthony and Gichoya, Judy and Jurafsky, Dan and Szolovits, Peter and Bates, David W. and Abdulnour, Raja-Elie E. and Butte, Atul J. and Alsentzer, Emily , title=. The Lancet Digital Health , year=. doi:10.1016/S2589-7500(23)00225-X , url=

work page doi:10.1016/s2589-7500(23)00225-x
[53]

Nature medicine , volume=

AI in health and medicine , author=. Nature medicine , volume=. 2022 , publisher=

2022
[54]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972
[55]

Publications Manual , year = "1983", publisher =

1983
[56]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[57]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[58]

Dan Gusfield , title =. 1997

1997
[59]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[60]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =