Omissive Bias in Religious Representation: Benchmarking LLM Answers to Everyday Ethical Decision-making
Pith reviewed 2026-06-30 14:00 UTC · model grok-4.3
The pith
LLMs omit religious perspectives in answers to everyday ethical questions more than humans expect.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When posed 150 ethically salient questions that are not explicitly about religion, 27 evaluated LLMs invoke religion, religious practices, or religious leaders at rates below those expected by human survey participants. The underrepresentation is asymmetric: models include religious content more often on abstract topics like meaning or death and less often on practical personal matters like marriage, addiction, or grief. The AllFaith benchmark measures this omission through an LLM-as-judge rubric that awards credit for any relevant mention, establishing omissive bias as a distinct dimension of value alignment.
What carries the argument
The AllFaith Religious Representation Benchmark, a set of 150 open-ended ethical questions paired with an LLM-as-judge rubric that credits any mention of religion.
If this is right
- Models deliver responses that overlook religious frameworks in the practical situations where many users most rely on them.
- The asymmetry indicates that training data or alignment processes favor abstract reasoning over concrete personal guidance.
- Users seeking advice on grief or family issues receive outputs less aligned with the full range of human value systems.
- The benchmark provides a repeatable method to track whether future models close the gap with human expectations on religious representation.
Where Pith is reading between the lines
- If the asymmetry persists, religious users may turn to other sources for advice on daily challenges, reducing AI utility in those domains.
- The result suggests alignment techniques could be extended to explicitly include diverse cultural and religious worldviews rather than relying on omission.
- Developers might test whether fine-tuning on faith-community transcripts increases religious invocation rates without degrading other performance metrics.
- The benchmark could be adapted to measure omission of other cultural frameworks such as philosophical traditions or indigenous knowledge systems.
Load-bearing premise
The 150 questions represent situations where religious perspectives are commonly valued by users and the LLM-as-judge rubric plus human survey accurately measure expected religious representation.
What would settle it
A new human survey of the same model responses to the 150 questions finds no statistical difference between LLM invocation rates and participant expectations.
read the original abstract
As large language models become a default source of guidance on personal, moral, and existential questions, it matters whether they draw on the religious frameworks that have historically shaped such reasoning, or systematically omit them. In this paper, we ask a deliberately narrow question: when posed an everyday ethical question for which religious perspectives may be valuable, do LLMs invoke religion at all? In contrast to benchmarks that look for the presence of political leanings or social bias, we look for the absence of religious representation as a dimension of value alignment and bias in LLMs. We term this ``omissive bias.'' To measure omissive bias, we contribute the AllFaith Religious Representation Benchmark: 150 ethically and personally salient questions, sourced from in-the-wild chat transcripts and faith-community contributors, paired with an LLM-as-judge rubric that gives full credit for any mention of a religion, a religious practice, or a religious leader. The questions are not themselves about religion--they are open-ended questions about grief, forgiveness, relationships, purpose, and honesty, where religion is one valuable perspective among several. We also run a human-subjects survey to compare LLM behavior against human expectations. Evaluating 27 models, we find that LLMs consistently underrepresent religion relative to human expectations. The omission is asymmetric: models invoke religion more readily for abstract existential questions (meaning, death, truth) than for the practical personal situations--grief, marriage, family conflict, addiction--where many people most rely on it. It is not our purpose to adjudicate which values LLMs should hold. We argue, more modestly, that current LLM responses overlook critical opportunities to reflect religious frameworks that many people draw on when navigating personal and ethical challenges.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the AllFaith Religious Representation Benchmark consisting of 150 ethically and personally salient questions (sourced from in-the-wild chat transcripts and faith-community contributors) to measure whether LLMs invoke religious perspectives when responding to everyday ethical questions where religion is one relevant framework among others. An LLM-as-judge rubric awards credit for any mention of religion, practice, or leader. The authors also conduct a human-subjects survey to establish an external baseline and evaluate 27 models, claiming consistent underrepresentation of religion relative to human expectations, with an asymmetric pattern (more invocation on abstract existential questions than on practical personal ones such as grief or family conflict).
Significance. If the benchmark construction and human baseline prove robust, the work identifies a previously under-examined dimension of value alignment—omissive bias toward religious frameworks—that affects a large user population. The asymmetry result, if replicable, would point to concrete failure modes in how current models handle personal versus existential queries and supply a reusable benchmark for future mitigation studies.
major comments (3)
- [Benchmark description] AllFaith benchmark construction: the central underrepresentation claim requires that the 150 questions are representative of situations in which religious perspectives are commonly valued by users, yet the sourcing method (chat transcripts plus faith-community contributors) is described only at a high level with no explicit selection criteria, bias checks, or validation that the set is not already tilted toward religiously salient contexts.
- [Human-subjects survey] Human survey baseline: the comparison to 'human expectations' is load-bearing for the headline result, but the abstract supplies no information on sample size, participant demographics, recruitment, quantification of expectations, or inter-rater reliability for the abstract-vs-practical classification; without these the measured gap cannot be interpreted.
- [Results and analysis] Asymmetry claim: the reported difference between abstract existential and practical personal questions depends on a stable, reliable partition of the 150 items, yet no pre-registration, classification protocol, or agreement statistics are mentioned, directly undermining the asymmetry finding.
minor comments (1)
- [Abstract] The abstract would be strengthened by including at least one concrete numerical result (e.g., average invocation rate or effect size) to convey the magnitude of the reported omission.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on the AllFaith benchmark, human survey, and asymmetry analysis. We will revise the manuscript to provide greater methodological transparency while maintaining the core findings.
read point-by-point responses
-
Referee: [Benchmark description] AllFaith benchmark construction: the central underrepresentation claim requires that the 150 questions are representative of situations in which religious perspectives are commonly valued by users, yet the sourcing method (chat transcripts plus faith-community contributors) is described only at a high level with no explicit selection criteria, bias checks, or validation that the set is not already tilted toward religiously salient contexts.
Authors: We agree that the current description is high-level and will strengthen it in revision. The revised manuscript will include a detailed 'Benchmark Construction' subsection specifying: selection criteria for chat transcripts (e.g., filtering for personal/ethical topics from public sources with privacy considerations), contributor instructions for faith communities to suggest questions where religion is relevant but not central, bias checks such as diversity in question topics and avoidance of leading phrasing, and validation through review by an independent panel or pilot user study to confirm representativeness. This addresses the concern directly. revision: yes
-
Referee: [Human-subjects survey] Human survey baseline: the comparison to 'human expectations' is load-bearing for the headline result, but the abstract supplies no information on sample size, participant demographics, recruitment, quantification of expectations, or inter-rater reliability for the abstract-vs-practical classification; without these the measured gap cannot be interpreted.
Authors: The abstract indeed omits these details for brevity. The full manuscript's methods section describes the survey, but to improve accessibility, we will update the abstract with a concise summary of key parameters and expand the main text with full information on sample size, demographics, recruitment platform and strategy, how expectations were quantified (e.g., proportion of humans expecting religious content in responses), and any reliability measures for the abstract vs practical classification. We will also report limitations if certain metrics like inter-rater reliability were not computed. revision: yes
-
Referee: [Results and analysis] Asymmetry claim: the reported difference between abstract existential and practical personal questions depends on a stable, reliable partition of the 150 items, yet no pre-registration, classification protocol, or agreement statistics are mentioned, directly undermining the asymmetry finding.
Authors: We recognize that transparency on the classification is essential. In the revision, we will add a description of the classification protocol (criteria for 'abstract existential' vs 'practical personal'), report agreement statistics from multiple independent coders (e.g., Cohen's kappa), include the full categorized question list in the appendix or supplementary materials, and acknowledge the lack of pre-registration as a limitation of the study. These additions will allow readers to evaluate the stability of the asymmetry result. revision: yes
Circularity Check
No circularity: benchmark and human survey are independent external references with no self-referential reduction.
full rationale
The paper constructs the AllFaith benchmark from in-the-wild transcripts and faith-community contributors, applies an LLM-as-judge rubric that credits any religious mention, and compares outputs to a separate human-subjects survey. The central claim of underrepresentation and asymmetry is measured against these external baselines rather than derived from fitted parameters, self-citations, or definitions that presuppose the result. No equations, uniqueness theorems, or ansatzes appear; the derivation chain does not reduce to its own inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[2]
Navigating the Shift: A Comparative Analysis of Web Search and Generative AI Response Generation
URLhttps://arxiv.org/abs/2601.16858. Long Ouyang, Jeffrey Wu, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
Samuel Gehman, Suchin Gururangan, et al. Realtoxicityprompts: Evaluating neural toxic degeneration in language models.arXiv preprint arXiv:2009.11462,
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[5]
URLhttps://doi.org/10.1038/s44387-025-00048-0
doi: 10.1038/s44387-025-00048-0. URLhttps://doi.org/10.1038/s44387-025-00048-0. Pew Research Center. The global religious landscape: A report on the size and distribution of the world’s major religious groups as of
-
[6]
George Gerbner and Larry Gross
URLhttps://www.pewresearch.org/religion/2012/12/18/ global-religious-landscape-exec/. George Gerbner and Larry Gross. Living with television: The violence profile.Journal of Communication, 26(2):172–199,
2012
-
[7]
doi: 10.1111/j.1460-2466.1976.tb01397.x. 21 Kate Crawford. The trouble with bias,
-
[8]
Sunipa Dev, Masoud Monajatipoor, Anaelia Ovalle, Arjun Subramonian, Jeff Phillips, and Kai-Wei Chang
doi: 10.18653/v1/2020.acl-main.485. Sunipa Dev, Masoud Monajatipoor, Anaelia Ovalle, Arjun Subramonian, Jeff Phillips, and Kai-Wei Chang. Harms of gender exclusivity and challenges in non-binary representation in language technologies. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1968–1994,
-
[9]
doi: 10.18653/v1/2021.emnlp-main.150. Sunipa Dev, Emily Sheng, Jieyu Zhao, Aubrie Amstutz, Jiao Sun, Yu Hou, Mattie Sanseverino, Jiin Kim, Akihiro Nishi, Nanyun Peng, and Kai-Wei Chang. On measures of biases and harms in NLP. InFindings of the Association for Computational Linguistics: AACL-IJCNLP 2022, pages 246–267,
-
[10]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.arXiv preprint arXiv:2306.05685,
work page internal anchor Pith review Pith/arXiv arXiv
- [11]
-
[12]
First released May 8,
Version 2025-10-27. First released May 8,
2025
-
[13]
Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R
doi: 10.1162/coli_a_00524. Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R. Bowman. CrowS-Pairs: A challenge dataset for measuring social biases in masked language models. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1953–1967,
-
[14]
C row S -Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models
doi: 10.18653/v1/2020.emnlp-main.154. Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R. Bowman. BBQ:Ahand-builtbiasbenchmarkforquestionanswering. InFindingsoftheAssociationforComputationalLinguistics: ACL 2022, pages 2086–2105,
-
[15]
BBQ : A hand-built bias benchmark for question answering
doi: 10.18653/v1/2022.findings-acl.165. Jwala Dhamala, Tony Sun, et al. Bold: Dataset and metrics for measuring biases in open-ended language generation.Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency,
-
[16]
URLhttps://arxiv.org/abs/2510.16380. Michael J. Ryan, William Held, and Diyi Yang. Unintended impacts of LLM alignment on global representation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
acl-long.853/
URLhttps://aclanthology.org/2024. acl-long.853/. Gaye Tuchman. Introduction: The symbolic annihilation of women by the mass media. In Gaye Tuchman, Arlene Kaplan Daniels, and James Benét, editors,Hearth and Home: Images of Women in the Mass Media, pages 3–38. Oxford University Press, New York,
2024
-
[18]
Pola Schwöbel, Jacek Golebiowski, Michele Donini, Cédric Archambeau, and Danish Pruthi
doi: 10.1007/s00799-018-0261-y. Pola Schwöbel, Jacek Golebiowski, Michele Donini, Cédric Archambeau, and Danish Pruthi. Geographical erasure in language generation. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 12310–12324,
-
[19]
AgrimaSeth,MonojitChoudhary,SunayanaSitaram,KentaroToyama,AdityaVashistha,andKalikaBali
doi: 10.18653/v1/2023.findings-emnlp.823. AgrimaSeth,MonojitChoudhary,SunayanaSitaram,KentaroToyama,AdityaVashistha,andKalikaBali. Howdeepisrepresentational bias in LLMs? the cases of caste and religion.arXiv preprint arXiv:2508.03712,
-
[20]
doi: 10.1038/s41467-025-68004-9
ISSN 2041-1723. doi: 10.1038/s41467-025-68004-9. URLhttp://dx.doi.org/10.1038/s41467-025-68004-9. AdelKhorramrouzandSharonLevy. Characterizingselectiverefusalbiasinlargelanguagemodels.arXivpreprintarXiv:2510.27087,
-
[21]
Towards Measuring the Representation of Subjective Global Opinions in Language Models
doi: 10.1073/pnas.2412015122. Esin Durmus, Karina Nyugen, Thomas I. Liao, Nicholas Schiefer, Amanda Askell, Anton Bakhtin, Carol Chen, Zac Hatfield-Dodds, Danny Hernandez, Nicholas Joseph, Liane Lovitt, Sam McCandlish, Orowa Sikder, Alex Tamkin, Janel Thamkul, Jared Kaplan, Jack Clark, and Deep Ganguli. Towards measuring the representation of subjective g...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1073/pnas.2412015122
-
[22]
Badr AlKhamissi, Muhammad ElNokrashy, Mai AlKhamissi, and Mona Diab
doi: 10.18653/v1/2024.acl-long.862. Badr AlKhamissi, Muhammad ElNokrashy, Mai AlKhamissi, and Mona Diab. Investigating cultural alignment of large language models.arXiv preprint arXiv:2402.13231,
-
[23]
doi: 10.31234/osf.io/5b26t. Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Bean, Katerina Margatina, Juan Ciro, Rafael Mosquera, Max Bartolo, Adina Williams, He He, Bertie Vidgen, and Scott A. Hale. The PRISM alignment dataset: What participatory, representative and individualised human feedback reveals about the subjective and multicultural...
- [24]
-
[25]
Persistent anti-muslim bias in large language models
Abubakar Abid, Maheen Farooqi, and James Zou. Persistent anti-muslim bias in large language models. InProceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 298–306,
2021
- [26]
-
[27]
Divine LLaMAs: Bias, stereotypes, stigmatization, and emotion representation of religion in large language models
Flor Miriam Plaza-del Arco, Amanda Cercas Curry, Susanna Paoli, Alba Curry, and Dirk Hovy. Divine LLaMAs: Bias, stereotypes, stigmatization, and emotion representation of religion in large language models. InFindings of the Association for Computational Linguistics: EMNLP 2024,
2024
-
[28]
KhyatiKhandelwal,ManuelTonneau,AndrewM.Bean,HannahRoseKirk,andScottA.Hale
URLhttps://aclanthology.org/2024.findings-emnlp.251/. KhyatiKhandelwal,ManuelTonneau,AndrewM.Bean,HannahRoseKirk,andScottA.Hale. Indian-BhED:Adatasetformeasuring india-centric biases in large language models. InProceedings of the 2024 International Conference on Information Technology for Social Good (GoodIT),
2024
-
[29]
Eyup Engin Kucuk and Muhammed Yusuf Kocyigit. Western, religious or spiritual: An evaluation of moral justification in large language models.arXiv preprint arXiv:2311.07792,
-
[30]
Religious bias in llms is significantly understudied.arXiv preprint arXiv:2605.4242,
Walter Reade, Sheryl Carty, and Brett Israelson. Religious bias in llms is significantly understudied.arXiv preprint arXiv:2605.4242,
-
[31]
URLhttps://arxiv.org/abs/2506.00643. SteffenHerbold. Sortbench: Benchmarkingllmsbasedontheirabilitytosortlists,2025. URL https://arxiv.org/abs/2504.08312. 23 Yusuke Yamauchi, Taro Yano, and Masafumi Oyamada. An empirical study of llm-as-a-judge: How design choices impact evaluation reliability,
-
[32]
Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng
URLhttps://arxiv.org/abs/2506.13639. Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. Wildchat: 1m chatGPT interaction logs in the wild. InThe Twelfth International Conference on Learning Representations,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.