Omissive Bias in Religious Representation: Benchmarking LLM Answers to Everyday Ethical Decision-making

Brett Israelson; Dallin Jacobs; Daniel Feldman; David Wingate; Elisabeth Kincaid; Gavin Mobley; John Paul Kimes; Jonathan Karr; Joshua Coates; Larry Howell

arxiv: 2605.24319 · v1 · pith:ELB5NCKFnew · submitted 2026-05-23 · 💻 cs.LG

Omissive Bias in Religious Representation: Benchmarking LLM Answers to Everyday Ethical Decision-making

David Wingate , Sheryl Carty , Joshua Coates , Daniel Feldman , Nancy Fulda , Larry Howell , Brett Israelson , Dallin Jacobs

show 8 more authors

Jonathan Karr John Paul Kimes Elisabeth Kincaid Paul Martens Gavin Mobley Suzana Pinheiro Lindsay Slemboski Peter Whiting

This is my paper

Pith reviewed 2026-06-30 14:00 UTC · model grok-4.3

classification 💻 cs.LG

keywords omissive biasreligious representationLLM evaluationethical decision-makingvalue alignmentAllFaith benchmarkAI bias

0 comments

The pith

LLMs omit religious perspectives in answers to everyday ethical questions more than humans expect.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models draw on religious frameworks when responding to personal ethical dilemmas such as grief, family conflict, or addiction. It introduces the AllFaith benchmark of 150 questions sourced from chat transcripts and faith communities, then evaluates 27 models against a human survey baseline. Models consistently mention religion less often than expected, with the gap larger for concrete practical situations than for abstract existential ones. This pattern defines omissive bias as the systematic absence of religious representation in value-laden responses. The authors conclude that current models overlook opportunities to reflect frameworks many people use for personal decisions.

Core claim

When posed 150 ethically salient questions that are not explicitly about religion, 27 evaluated LLMs invoke religion, religious practices, or religious leaders at rates below those expected by human survey participants. The underrepresentation is asymmetric: models include religious content more often on abstract topics like meaning or death and less often on practical personal matters like marriage, addiction, or grief. The AllFaith benchmark measures this omission through an LLM-as-judge rubric that awards credit for any relevant mention, establishing omissive bias as a distinct dimension of value alignment.

What carries the argument

The AllFaith Religious Representation Benchmark, a set of 150 open-ended ethical questions paired with an LLM-as-judge rubric that credits any mention of religion.

If this is right

Models deliver responses that overlook religious frameworks in the practical situations where many users most rely on them.
The asymmetry indicates that training data or alignment processes favor abstract reasoning over concrete personal guidance.
Users seeking advice on grief or family issues receive outputs less aligned with the full range of human value systems.
The benchmark provides a repeatable method to track whether future models close the gap with human expectations on religious representation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the asymmetry persists, religious users may turn to other sources for advice on daily challenges, reducing AI utility in those domains.
The result suggests alignment techniques could be extended to explicitly include diverse cultural and religious worldviews rather than relying on omission.
Developers might test whether fine-tuning on faith-community transcripts increases religious invocation rates without degrading other performance metrics.
The benchmark could be adapted to measure omission of other cultural frameworks such as philosophical traditions or indigenous knowledge systems.

Load-bearing premise

The 150 questions represent situations where religious perspectives are commonly valued by users and the LLM-as-judge rubric plus human survey accurately measure expected religious representation.

What would settle it

A new human survey of the same model responses to the 150 questions finds no statistical difference between LLM invocation rates and participant expectations.

read the original abstract

As large language models become a default source of guidance on personal, moral, and existential questions, it matters whether they draw on the religious frameworks that have historically shaped such reasoning, or systematically omit them. In this paper, we ask a deliberately narrow question: when posed an everyday ethical question for which religious perspectives may be valuable, do LLMs invoke religion at all? In contrast to benchmarks that look for the presence of political leanings or social bias, we look for the absence of religious representation as a dimension of value alignment and bias in LLMs. We term this ``omissive bias.'' To measure omissive bias, we contribute the AllFaith Religious Representation Benchmark: 150 ethically and personally salient questions, sourced from in-the-wild chat transcripts and faith-community contributors, paired with an LLM-as-judge rubric that gives full credit for any mention of a religion, a religious practice, or a religious leader. The questions are not themselves about religion--they are open-ended questions about grief, forgiveness, relationships, purpose, and honesty, where religion is one valuable perspective among several. We also run a human-subjects survey to compare LLM behavior against human expectations. Evaluating 27 models, we find that LLMs consistently underrepresent religion relative to human expectations. The omission is asymmetric: models invoke religion more readily for abstract existential questions (meaning, death, truth) than for the practical personal situations--grief, marriage, family conflict, addiction--where many people most rely on it. It is not our purpose to adjudicate which values LLMs should hold. We argue, more modestly, that current LLM responses overlook critical opportunities to reflect religious frameworks that many people draw on when navigating personal and ethical challenges.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a benchmark showing LLMs omit religious references in ethical advice more than humans expect, with a practical-vs-abstract asymmetry, but the question set's representativeness is the main untested piece.

read the letter

The one thing to know is that this work measures omission of religion in LLM answers to non-religious ethical questions and reports that 27 models do it less than a human survey baseline would predict, especially on practical topics like grief or family conflict rather than abstract ones.

What is new is the AllFaith benchmark itself: 150 open-ended questions drawn from chat transcripts and faith-community input, scored by an LLM judge that gives credit for any mention of religion, practice, or leader. Pairing that with a human expectation survey is a clean way to create an external reference point instead of just counting outputs. Running the same set across many models makes the pattern visible at scale, and the paper keeps its claim narrow by not arguing what the right level of religious content should be.

The execution details that matter most are still light. The sourcing method for the 150 questions is described at a high level, so it is not yet clear how well they represent the situations where users actually turn to LLMs for advice or how much selection bias exists toward contexts already friendly to religious framing. The split into abstract versus practical questions drives the asymmetry result, but without reported inter-rater checks or pre-registration that classification could shift the finding. The LLM judge rubric is simple, yet no agreement numbers or error analysis appear in the available description.

This is aimed at people working on value alignment, conversational agents for personal domains, or bias measurement more generally. The idea is direct enough and the data collection effort is real, so it deserves a serious referee to examine the methodology and results rather than a desk reject.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces the AllFaith Religious Representation Benchmark consisting of 150 ethically and personally salient questions (sourced from in-the-wild chat transcripts and faith-community contributors) to measure whether LLMs invoke religious perspectives when responding to everyday ethical questions where religion is one relevant framework among others. An LLM-as-judge rubric awards credit for any mention of religion, practice, or leader. The authors also conduct a human-subjects survey to establish an external baseline and evaluate 27 models, claiming consistent underrepresentation of religion relative to human expectations, with an asymmetric pattern (more invocation on abstract existential questions than on practical personal ones such as grief or family conflict).

Significance. If the benchmark construction and human baseline prove robust, the work identifies a previously under-examined dimension of value alignment—omissive bias toward religious frameworks—that affects a large user population. The asymmetry result, if replicable, would point to concrete failure modes in how current models handle personal versus existential queries and supply a reusable benchmark for future mitigation studies.

major comments (3)

[Benchmark description] AllFaith benchmark construction: the central underrepresentation claim requires that the 150 questions are representative of situations in which religious perspectives are commonly valued by users, yet the sourcing method (chat transcripts plus faith-community contributors) is described only at a high level with no explicit selection criteria, bias checks, or validation that the set is not already tilted toward religiously salient contexts.
[Human-subjects survey] Human survey baseline: the comparison to 'human expectations' is load-bearing for the headline result, but the abstract supplies no information on sample size, participant demographics, recruitment, quantification of expectations, or inter-rater reliability for the abstract-vs-practical classification; without these the measured gap cannot be interpreted.
[Results and analysis] Asymmetry claim: the reported difference between abstract existential and practical personal questions depends on a stable, reliable partition of the 150 items, yet no pre-registration, classification protocol, or agreement statistics are mentioned, directly undermining the asymmetry finding.

minor comments (1)

[Abstract] The abstract would be strengthened by including at least one concrete numerical result (e.g., average invocation rate or effect size) to convey the magnitude of the reported omission.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on the AllFaith benchmark, human survey, and asymmetry analysis. We will revise the manuscript to provide greater methodological transparency while maintaining the core findings.

read point-by-point responses

Referee: [Benchmark description] AllFaith benchmark construction: the central underrepresentation claim requires that the 150 questions are representative of situations in which religious perspectives are commonly valued by users, yet the sourcing method (chat transcripts plus faith-community contributors) is described only at a high level with no explicit selection criteria, bias checks, or validation that the set is not already tilted toward religiously salient contexts.

Authors: We agree that the current description is high-level and will strengthen it in revision. The revised manuscript will include a detailed 'Benchmark Construction' subsection specifying: selection criteria for chat transcripts (e.g., filtering for personal/ethical topics from public sources with privacy considerations), contributor instructions for faith communities to suggest questions where religion is relevant but not central, bias checks such as diversity in question topics and avoidance of leading phrasing, and validation through review by an independent panel or pilot user study to confirm representativeness. This addresses the concern directly. revision: yes
Referee: [Human-subjects survey] Human survey baseline: the comparison to 'human expectations' is load-bearing for the headline result, but the abstract supplies no information on sample size, participant demographics, recruitment, quantification of expectations, or inter-rater reliability for the abstract-vs-practical classification; without these the measured gap cannot be interpreted.

Authors: The abstract indeed omits these details for brevity. The full manuscript's methods section describes the survey, but to improve accessibility, we will update the abstract with a concise summary of key parameters and expand the main text with full information on sample size, demographics, recruitment platform and strategy, how expectations were quantified (e.g., proportion of humans expecting religious content in responses), and any reliability measures for the abstract vs practical classification. We will also report limitations if certain metrics like inter-rater reliability were not computed. revision: yes
Referee: [Results and analysis] Asymmetry claim: the reported difference between abstract existential and practical personal questions depends on a stable, reliable partition of the 150 items, yet no pre-registration, classification protocol, or agreement statistics are mentioned, directly undermining the asymmetry finding.

Authors: We recognize that transparency on the classification is essential. In the revision, we will add a description of the classification protocol (criteria for 'abstract existential' vs 'practical personal'), report agreement statistics from multiple independent coders (e.g., Cohen's kappa), include the full categorized question list in the appendix or supplementary materials, and acknowledge the lack of pre-registration as a limitation of the study. These additions will allow readers to evaluate the stability of the asymmetry result. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark and human survey are independent external references with no self-referential reduction.

full rationale

The paper constructs the AllFaith benchmark from in-the-wild transcripts and faith-community contributors, applies an LLM-as-judge rubric that credits any religious mention, and compares outputs to a separate human-subjects survey. The central claim of underrepresentation and asymmetry is measured against these external baselines rather than derived from fitted parameters, self-citations, or definitions that presuppose the result. No equations, uniqueness theorems, or ansatzes appear; the derivation chain does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no methodological details on free parameters, axioms, or invented entities are available.

pith-pipeline@v0.9.1-grok · 5904 in / 1066 out tokens · 31787 ms · 2026-06-30T14:00:40.954344+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 25 canonical work pages · 6 internal anchors

[2]

Navigating the Shift: A Comparative Analysis of Web Search and Generative AI Response Generation

URLhttps://arxiv.org/abs/2601.16858. Long Ouyang, Jeffrey Wu, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

Samuel Gehman, Suchin Gururangan, et al. Realtoxicityprompts: Evaluating neural toxic degeneration in language models.arXiv preprint arXiv:2009.11462,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[5]

URLhttps://doi.org/10.1038/s44387-025-00048-0

doi: 10.1038/s44387-025-00048-0. URLhttps://doi.org/10.1038/s44387-025-00048-0. Pew Research Center. The global religious landscape: A report on the size and distribution of the world’s major religious groups as of

work page doi:10.1038/s44387-025-00048-0
[6]

George Gerbner and Larry Gross

URLhttps://www.pewresearch.org/religion/2012/12/18/ global-religious-landscape-exec/. George Gerbner and Larry Gross. Living with television: The violence profile.Journal of Communication, 26(2):172–199,

2012
[7]

21 Kate Crawford

doi: 10.1111/j.1460-2466.1976.tb01397.x. 21 Kate Crawford. The trouble with bias,

work page doi:10.1111/j.1460-2466.1976.tb01397.x 1976
[8]

Sunipa Dev, Masoud Monajatipoor, Anaelia Ovalle, Arjun Subramonian, Jeff Phillips, and Kai-Wei Chang

doi: 10.18653/v1/2020.acl-main.485. Sunipa Dev, Masoud Monajatipoor, Anaelia Ovalle, Arjun Subramonian, Jeff Phillips, and Kai-Wei Chang. Harms of gender exclusivity and challenges in non-binary representation in language technologies. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1968–1994,

work page doi:10.18653/v1/2020.acl-main.485 2020
[9]

Sunipa Dev, Emily Sheng, Jieyu Zhao, Aubrie Amstutz, Jiao Sun, Yu Hou, Mattie Sanseverino, Jiin Kim, Akihiro Nishi, Nanyun Peng, and Kai-Wei Chang

doi: 10.18653/v1/2021.emnlp-main.150. Sunipa Dev, Emily Sheng, Jieyu Zhao, Aubrie Amstutz, Jiao Sun, Yu Hou, Mattie Sanseverino, Jiin Kim, Akihiro Nishi, Nanyun Peng, and Kai-Wei Chang. On measures of biases and harms in NLP. InFindings of the Association for Computational Linguistics: AACL-IJCNLP 2022, pages 246–267,

work page doi:10.18653/v1/2021.emnlp-main.150 2021
[10]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.arXiv preprint arXiv:2306.05685,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

URLhttps://arxiv.org/abs/2406.07791. OpenAI. Model spec.https://model-spec.openai.com/2025-10-27.html, October

work page arXiv 2025
[12]

First released May 8,

Version 2025-10-27. First released May 8,

2025
[13]

Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R

doi: 10.1162/coli_a_00524. Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R. Bowman. CrowS-Pairs: A challenge dataset for measuring social biases in masked language models. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1953–1967,

work page doi:10.1162/coli_a_00524 2020
[14]

C row S -Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models

doi: 10.18653/v1/2020.emnlp-main.154. Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R. Bowman. BBQ:Ahand-builtbiasbenchmarkforquestionanswering. InFindingsoftheAssociationforComputationalLinguistics: ACL 2022, pages 2086–2105,

work page doi:10.18653/v1/2020.emnlp-main.154 2020
[15]

BBQ : A hand-built bias benchmark for question answering

doi: 10.18653/v1/2022.findings-acl.165. Jwala Dhamala, Tony Sun, et al. Bold: Dataset and metrics for measuring biases in open-ended language generation.Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency,

work page doi:10.18653/v1/2022.findings-acl.165 2022
[16]

MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes

URLhttps://arxiv.org/abs/2510.16380. Michael J. Ryan, William Held, and Diyi Yang. Unintended impacts of LLM alignment on global representation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

acl-long.853/

URLhttps://aclanthology.org/2024. acl-long.853/. Gaye Tuchman. Introduction: The symbolic annihilation of women by the mass media. In Gaye Tuchman, Arlene Kaplan Daniels, and James Benét, editors,Hearth and Home: Images of Women in the Mass Media, pages 3–38. Oxford University Press, New York,

2024
[18]

Pola Schwöbel, Jacek Golebiowski, Michele Donini, Cédric Archambeau, and Danish Pruthi

doi: 10.1007/s00799-018-0261-y. Pola Schwöbel, Jacek Golebiowski, Michele Donini, Cédric Archambeau, and Danish Pruthi. Geographical erasure in language generation. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 12310–12324,

work page doi:10.1007/s00799-018-0261-y 2023
[19]

AgrimaSeth,MonojitChoudhary,SunayanaSitaram,KentaroToyama,AdityaVashistha,andKalikaBali

doi: 10.18653/v1/2023.findings-emnlp.823. AgrimaSeth,MonojitChoudhary,SunayanaSitaram,KentaroToyama,AdityaVashistha,andKalikaBali. Howdeepisrepresentational bias in LLMs? the cases of caste and religion.arXiv preprint arXiv:2508.03712,

work page doi:10.18653/v1/2023.findings-emnlp.823 2023
[20]

doi: 10.1038/s41467-025-68004-9

ISSN 2041-1723. doi: 10.1038/s41467-025-68004-9. URLhttp://dx.doi.org/10.1038/s41467-025-68004-9. AdelKhorramrouzandSharonLevy. Characterizingselectiverefusalbiasinlargelanguagemodels.arXivpreprintarXiv:2510.27087,

work page doi:10.1038/s41467-025-68004-9 2041
[21]

Towards Measuring the Representation of Subjective Global Opinions in Language Models

doi: 10.1073/pnas.2412015122. Esin Durmus, Karina Nyugen, Thomas I. Liao, Nicholas Schiefer, Amanda Askell, Anton Bakhtin, Carol Chen, Zac Hatfield-Dodds, Danny Hernandez, Nicholas Joseph, Liane Lovitt, Sam McCandlish, Orowa Sikder, Alex Tamkin, Janel Thamkul, Jared Kaplan, Jack Clark, and Deep Ganguli. Towards measuring the representation of subjective g...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1073/pnas.2412015122
[22]

Badr AlKhamissi, Muhammad ElNokrashy, Mai AlKhamissi, and Mona Diab

doi: 10.18653/v1/2024.acl-long.862. Badr AlKhamissi, Muhammad ElNokrashy, Mai AlKhamissi, and Mona Diab. Investigating cultural alignment of large language models.arXiv preprint arXiv:2402.13231,

work page doi:10.18653/v1/2024.acl-long.862 2024
[23]

Which humans?

doi: 10.31234/osf.io/5b26t. Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Bean, Katerina Margatina, Juan Ciro, Rafael Mosquera, Max Bartolo, Adina Williams, He He, Bertie Vidgen, and Scott A. Hale. The PRISM alignment dataset: What participatory, representative and individualised human feedback reveals about the subjective and multicultural...

work page doi:10.31234/osf.io/5b26t
[24]

Ronald Fischer, Markus Luczak-Roesch, and Johannes A. Karl. What does ChatGPT return about human values? Exploring value bias in ChatGPT using a descriptive value theory.arXiv preprint arXiv:2304.03612,

work page arXiv
[25]

Persistent anti-muslim bias in large language models

Abubakar Abid, Maheen Farooqi, and James Zou. Persistent anti-muslim bias in large language models. InProceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 298–306,

2021
[26]

Varshney

Babak Hemmatian, Razan Baltaji, and Lav R. Varshney. Muslim-violence bias persists in debiased GPT models.arXiv preprint arXiv:2310.18368,

work page arXiv
[27]

Divine LLaMAs: Bias, stereotypes, stigmatization, and emotion representation of religion in large language models

Flor Miriam Plaza-del Arco, Amanda Cercas Curry, Susanna Paoli, Alba Curry, and Dirk Hovy. Divine LLaMAs: Bias, stereotypes, stigmatization, and emotion representation of religion in large language models. InFindings of the Association for Computational Linguistics: EMNLP 2024,

2024
[28]

KhyatiKhandelwal,ManuelTonneau,AndrewM.Bean,HannahRoseKirk,andScottA.Hale

URLhttps://aclanthology.org/2024.findings-emnlp.251/. KhyatiKhandelwal,ManuelTonneau,AndrewM.Bean,HannahRoseKirk,andScottA.Hale. Indian-BhED:Adatasetformeasuring india-centric biases in large language models. InProceedings of the 2024 International Conference on Information Technology for Social Good (GoodIT),

2024
[29]

Western, religious or spiritual: An evaluation of moral justification in large language models.arXiv preprint arXiv:2311.07792,

Eyup Engin Kucuk and Muhammed Yusuf Kocyigit. Western, religious or spiritual: An evaluation of moral justification in large language models.arXiv preprint arXiv:2311.07792,

work page arXiv
[30]

Religious bias in llms is significantly understudied.arXiv preprint arXiv:2605.4242,

Walter Reade, Sheryl Carty, and Brett Israelson. Religious bias in llms is significantly understudied.arXiv preprint arXiv:2605.4242,

work page arXiv
[31]

SteffenHerbold

URLhttps://arxiv.org/abs/2506.00643. SteffenHerbold. Sortbench: Benchmarkingllmsbasedontheirabilitytosortlists,2025. URL https://arxiv.org/abs/2504.08312. 23 Yusuke Yamauchi, Taro Yano, and Masafumi Oyamada. An empirical study of llm-as-a-judge: How design choices impact evaluation reliability,

work page arXiv 2025
[32]

Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng

URLhttps://arxiv.org/abs/2506.13639. Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. Wildchat: 1m chatGPT interaction logs in the wild. InThe Twelfth International Conference on Learning Representations,

work page arXiv

[1] [2]

Navigating the Shift: A Comparative Analysis of Web Search and Generative AI Response Generation

URLhttps://arxiv.org/abs/2601.16858. Long Ouyang, Jeffrey Wu, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [3]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [4]

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

Samuel Gehman, Suchin Gururangan, et al. Realtoxicityprompts: Evaluating neural toxic degeneration in language models.arXiv preprint arXiv:2009.11462,

work page internal anchor Pith review Pith/arXiv arXiv 2009

[4] [5]

URLhttps://doi.org/10.1038/s44387-025-00048-0

doi: 10.1038/s44387-025-00048-0. URLhttps://doi.org/10.1038/s44387-025-00048-0. Pew Research Center. The global religious landscape: A report on the size and distribution of the world’s major religious groups as of

work page doi:10.1038/s44387-025-00048-0

[5] [6]

George Gerbner and Larry Gross

URLhttps://www.pewresearch.org/religion/2012/12/18/ global-religious-landscape-exec/. George Gerbner and Larry Gross. Living with television: The violence profile.Journal of Communication, 26(2):172–199,

2012

[6] [7]

21 Kate Crawford

doi: 10.1111/j.1460-2466.1976.tb01397.x. 21 Kate Crawford. The trouble with bias,

work page doi:10.1111/j.1460-2466.1976.tb01397.x 1976

[7] [8]

Sunipa Dev, Masoud Monajatipoor, Anaelia Ovalle, Arjun Subramonian, Jeff Phillips, and Kai-Wei Chang

doi: 10.18653/v1/2020.acl-main.485. Sunipa Dev, Masoud Monajatipoor, Anaelia Ovalle, Arjun Subramonian, Jeff Phillips, and Kai-Wei Chang. Harms of gender exclusivity and challenges in non-binary representation in language technologies. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1968–1994,

work page doi:10.18653/v1/2020.acl-main.485 2020

[8] [9]

Sunipa Dev, Emily Sheng, Jieyu Zhao, Aubrie Amstutz, Jiao Sun, Yu Hou, Mattie Sanseverino, Jiin Kim, Akihiro Nishi, Nanyun Peng, and Kai-Wei Chang

doi: 10.18653/v1/2021.emnlp-main.150. Sunipa Dev, Emily Sheng, Jieyu Zhao, Aubrie Amstutz, Jiao Sun, Yu Hou, Mattie Sanseverino, Jiin Kim, Akihiro Nishi, Nanyun Peng, and Kai-Wei Chang. On measures of biases and harms in NLP. InFindings of the Association for Computational Linguistics: AACL-IJCNLP 2022, pages 246–267,

work page doi:10.18653/v1/2021.emnlp-main.150 2021

[9] [10]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.arXiv preprint arXiv:2306.05685,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [11]

URLhttps://arxiv.org/abs/2406.07791. OpenAI. Model spec.https://model-spec.openai.com/2025-10-27.html, October

work page arXiv 2025

[11] [12]

First released May 8,

Version 2025-10-27. First released May 8,

2025

[12] [13]

Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R

doi: 10.1162/coli_a_00524. Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R. Bowman. CrowS-Pairs: A challenge dataset for measuring social biases in masked language models. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1953–1967,

work page doi:10.1162/coli_a_00524 2020

[13] [14]

C row S -Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models

doi: 10.18653/v1/2020.emnlp-main.154. Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R. Bowman. BBQ:Ahand-builtbiasbenchmarkforquestionanswering. InFindingsoftheAssociationforComputationalLinguistics: ACL 2022, pages 2086–2105,

work page doi:10.18653/v1/2020.emnlp-main.154 2020

[14] [15]

BBQ : A hand-built bias benchmark for question answering

doi: 10.18653/v1/2022.findings-acl.165. Jwala Dhamala, Tony Sun, et al. Bold: Dataset and metrics for measuring biases in open-ended language generation.Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency,

work page doi:10.18653/v1/2022.findings-acl.165 2022

[15] [16]

MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes

URLhttps://arxiv.org/abs/2510.16380. Michael J. Ryan, William Held, and Diyi Yang. Unintended impacts of LLM alignment on global representation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [17]

acl-long.853/

URLhttps://aclanthology.org/2024. acl-long.853/. Gaye Tuchman. Introduction: The symbolic annihilation of women by the mass media. In Gaye Tuchman, Arlene Kaplan Daniels, and James Benét, editors,Hearth and Home: Images of Women in the Mass Media, pages 3–38. Oxford University Press, New York,

2024

[17] [18]

Pola Schwöbel, Jacek Golebiowski, Michele Donini, Cédric Archambeau, and Danish Pruthi

doi: 10.1007/s00799-018-0261-y. Pola Schwöbel, Jacek Golebiowski, Michele Donini, Cédric Archambeau, and Danish Pruthi. Geographical erasure in language generation. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 12310–12324,

work page doi:10.1007/s00799-018-0261-y 2023

[18] [19]

AgrimaSeth,MonojitChoudhary,SunayanaSitaram,KentaroToyama,AdityaVashistha,andKalikaBali

doi: 10.18653/v1/2023.findings-emnlp.823. AgrimaSeth,MonojitChoudhary,SunayanaSitaram,KentaroToyama,AdityaVashistha,andKalikaBali. Howdeepisrepresentational bias in LLMs? the cases of caste and religion.arXiv preprint arXiv:2508.03712,

work page doi:10.18653/v1/2023.findings-emnlp.823 2023

[19] [20]

doi: 10.1038/s41467-025-68004-9

ISSN 2041-1723. doi: 10.1038/s41467-025-68004-9. URLhttp://dx.doi.org/10.1038/s41467-025-68004-9. AdelKhorramrouzandSharonLevy. Characterizingselectiverefusalbiasinlargelanguagemodels.arXivpreprintarXiv:2510.27087,

work page doi:10.1038/s41467-025-68004-9 2041

[20] [21]

Towards Measuring the Representation of Subjective Global Opinions in Language Models

doi: 10.1073/pnas.2412015122. Esin Durmus, Karina Nyugen, Thomas I. Liao, Nicholas Schiefer, Amanda Askell, Anton Bakhtin, Carol Chen, Zac Hatfield-Dodds, Danny Hernandez, Nicholas Joseph, Liane Lovitt, Sam McCandlish, Orowa Sikder, Alex Tamkin, Janel Thamkul, Jared Kaplan, Jack Clark, and Deep Ganguli. Towards measuring the representation of subjective g...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1073/pnas.2412015122

[21] [22]

Badr AlKhamissi, Muhammad ElNokrashy, Mai AlKhamissi, and Mona Diab

doi: 10.18653/v1/2024.acl-long.862. Badr AlKhamissi, Muhammad ElNokrashy, Mai AlKhamissi, and Mona Diab. Investigating cultural alignment of large language models.arXiv preprint arXiv:2402.13231,

work page doi:10.18653/v1/2024.acl-long.862 2024

[22] [23]

Which humans?

doi: 10.31234/osf.io/5b26t. Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Bean, Katerina Margatina, Juan Ciro, Rafael Mosquera, Max Bartolo, Adina Williams, He He, Bertie Vidgen, and Scott A. Hale. The PRISM alignment dataset: What participatory, representative and individualised human feedback reveals about the subjective and multicultural...

work page doi:10.31234/osf.io/5b26t

[23] [24]

Ronald Fischer, Markus Luczak-Roesch, and Johannes A. Karl. What does ChatGPT return about human values? Exploring value bias in ChatGPT using a descriptive value theory.arXiv preprint arXiv:2304.03612,

work page arXiv

[24] [25]

Persistent anti-muslim bias in large language models

Abubakar Abid, Maheen Farooqi, and James Zou. Persistent anti-muslim bias in large language models. InProceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 298–306,

2021

[25] [26]

Varshney

Babak Hemmatian, Razan Baltaji, and Lav R. Varshney. Muslim-violence bias persists in debiased GPT models.arXiv preprint arXiv:2310.18368,

work page arXiv

[26] [27]

Divine LLaMAs: Bias, stereotypes, stigmatization, and emotion representation of religion in large language models

Flor Miriam Plaza-del Arco, Amanda Cercas Curry, Susanna Paoli, Alba Curry, and Dirk Hovy. Divine LLaMAs: Bias, stereotypes, stigmatization, and emotion representation of religion in large language models. InFindings of the Association for Computational Linguistics: EMNLP 2024,

2024

[27] [28]

KhyatiKhandelwal,ManuelTonneau,AndrewM.Bean,HannahRoseKirk,andScottA.Hale

URLhttps://aclanthology.org/2024.findings-emnlp.251/. KhyatiKhandelwal,ManuelTonneau,AndrewM.Bean,HannahRoseKirk,andScottA.Hale. Indian-BhED:Adatasetformeasuring india-centric biases in large language models. InProceedings of the 2024 International Conference on Information Technology for Social Good (GoodIT),

2024

[28] [29]

Western, religious or spiritual: An evaluation of moral justification in large language models.arXiv preprint arXiv:2311.07792,

Eyup Engin Kucuk and Muhammed Yusuf Kocyigit. Western, religious or spiritual: An evaluation of moral justification in large language models.arXiv preprint arXiv:2311.07792,

work page arXiv

[29] [30]

Religious bias in llms is significantly understudied.arXiv preprint arXiv:2605.4242,

Walter Reade, Sheryl Carty, and Brett Israelson. Religious bias in llms is significantly understudied.arXiv preprint arXiv:2605.4242,

work page arXiv

[30] [31]

SteffenHerbold

URLhttps://arxiv.org/abs/2506.00643. SteffenHerbold. Sortbench: Benchmarkingllmsbasedontheirabilitytosortlists,2025. URL https://arxiv.org/abs/2504.08312. 23 Yusuke Yamauchi, Taro Yano, and Masafumi Oyamada. An empirical study of llm-as-a-judge: How design choices impact evaluation reliability,

work page arXiv 2025

[31] [32]

Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng

URLhttps://arxiv.org/abs/2506.13639. Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. Wildchat: 1m chatGPT interaction logs in the wild. InThe Twelfth International Conference on Learning Representations,

work page arXiv