EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models
Pith reviewed 2026-05-21 20:34 UTC · model grok-4.3
The pith
Large language models align more closely with Western moral survey responses than non-Western ones.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EvalMORAAL applies chain-of-thought protocols and two scoring methods to twenty LLMs on the World Values Survey across fifty-five countries and the PEW survey across thirty-nine countries. Top models achieve Pearson's r approximately 0.90 with survey responses. Western regions average r equal to 0.82 while non-Western regions average r equal to 0.61, producing a 0.21 absolute gap. The method also runs model-as-judge peer review that identifies 348 conflicts and finds peer agreement correlates with WVS alignment at r of 0.74.
What carries the argument
EvalMORAAL, a chain-of-thought framework that combines log-probability scoring, direct rating scoring, self-consistency checks, and LLM-as-judge peer review to measure moral alignment against survey data.
If this is right
- Dual scoring methods allow direct comparison across all models regardless of size or architecture.
- Structured CoT with self-consistency improves reliability of alignment measurements.
- Model-as-judge peer review can serve as an automated filter because its agreement tracks survey alignment.
- The observed regional gap indicates that current alignment techniques still fall short for global use.
Where Pith is reading between the lines
- Training corpora likely contain heavier representation of Western viewpoints, producing the measured difference.
- The same evaluation pipeline could be applied to other value domains such as fairness or safety to test for similar patterns.
- Targeted fine-tuning on non-Western survey responses might narrow the alignment gap without harming Western performance.
Load-bearing premise
World Values Survey and PEW responses serve as an unbiased, representative target for moral alignment that can be measured fairly across cultures.
What would settle it
A new set of moral questions drawn primarily from non-Western sources or models retrained on balanced regional data would close or reverse the 0.21 regional correlation gap.
Figures
read the original abstract
We present EvalMORAAL, a transparent chain-of-thought (CoT) framework that uses two scoring methods (log-probabilities and direct ratings) plus a model-as-judge peer review to evaluate moral alignment in 20 large language models. We assess models on the World Values Survey (55 countries, 19 topics) and the PEW Global Attitudes Survey (39 countries, 8 topics). With EvalMORAAL, top models align closely with survey responses (Pearson's $r \approx 0.90$ on WVS). Yet we find a clear regional difference: Western regions average $r=0.82$ while non-Western regions average $r=0.61$ (a 0.21 absolute gap), indicating a persistent regional alignment gap. Our framework adds three parts: (1) two scoring methods for all models to enable fair comparison, (2) a structured CoT protocol with self-consistency checks, and (3) a model-as-judge peer review that flags 348 conflicts using a data-driven threshold. Peer agreement relates to WVS survey alignment ($r=0.74$, $p<.001$; PEW $r=0.39$, n.s.), supporting automated quality checks. These results show real progress toward culture-aware AI while highlighting open challenges for use across regions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces EvalMORAAL, a transparent chain-of-thought framework that applies two scoring methods (log-probabilities and direct ratings) plus an LLM-as-judge peer review to assess moral alignment of 20 large language models against the World Values Survey (55 countries, 19 topics) and PEW Global Attitudes Survey (39 countries, 8 topics). It reports that top models achieve high Pearson correlations with survey responses (r ≈ 0.90 on WVS) while identifying a regional gap (Western regions average r=0.82 versus non-Western regions average r=0.61).
Significance. If the regional gap is shown to be robust rather than an artifact of survey wording, response styles, or evaluation skew, the work would be significant for documenting challenges in culture-aware AI alignment. The structured CoT protocol with self-consistency and the model-as-judge component that flags 348 conflicts provide methodological transparency, though the paper supplies neither machine-checked proofs nor fully reproducible code.
major comments (2)
- [Abstract] Abstract: the central claim of a persistent 0.21 regional alignment gap treats WVS and PEW responses as an unbiased, culture-neutral ground truth that both log-probability and direct-rating scores can measure comparably across regions. No evidence is supplied of language-matched prompts, local-expert validation, topic-stratified robustness checks, or controls for item wording and response-style biases, leaving open the possibility that the gap reflects measurement artifacts rather than model differences.
- [Abstract] Abstract: the reported peer-agreement correlations (r=0.74 with WVS, r=0.39 with PEW) are presented separately from the primary alignment scores; it is therefore unclear whether the headline regional gap is derived independently or depends on the same external survey aggregates used to define alignment.
minor comments (2)
- [Abstract] The abstract states a 'data-driven threshold' for flagging conflicts but provides no formula, validation procedure, or sensitivity analysis for this threshold.
- Clarify the exact criteria used to partition countries into 'Western' and 'non-Western' regions and report whether the 0.21 gap remains stable under alternative groupings or topic subsets.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments. We address each major comment below and indicate where revisions have been made to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of a persistent 0.21 regional alignment gap treats WVS and PEW responses as an unbiased, culture-neutral ground truth that both log-probability and direct-rating scores can measure comparably across regions. No evidence is supplied of language-matched prompts, local-expert validation, topic-stratified robustness checks, or controls for item wording and response-style biases, leaving open the possibility that the gap reflects measurement artifacts rather than model differences.
Authors: We agree that the possibility of measurement artifacts is an important consideration and that stronger controls would be desirable. The current work applies a uniform English-language prompt template across all models and regions to maintain methodological consistency, given that the evaluated LLMs are predominantly English-centric. In the revised manuscript we have added an explicit Limitations subsection that acknowledges the lack of language-matched prompts and local-expert validation, and we have moved the topic-stratified robustness checks from the appendix into the main text to demonstrate that the regional gap is not driven by a small number of topics. We view full controls for response-style biases as a valuable direction for follow-up research rather than a requirement that can be fully addressed within the scope of this study. revision: partial
-
Referee: [Abstract] Abstract: the reported peer-agreement correlations (r=0.74 with WVS, r=0.39 with PEW) are presented separately from the primary alignment scores; it is therefore unclear whether the headline regional gap is derived independently or depends on the same external survey aggregates used to define alignment.
Authors: The regional gap is computed solely from the Pearson correlations between each model’s EvalMORAAL scores (log-probability and direct-rating) and the corresponding regional survey aggregates; the peer-agreement analysis is performed afterward as an independent quality check on the LLM-as-judge component. We have revised the abstract and added a clarifying sentence in the Results section to make this separation explicit and to avoid any ambiguity about the source of the reported gap. revision: yes
Circularity Check
No significant circularity; results rest on external survey benchmarks
full rationale
The paper defines EvalMORAAL as an evaluation framework applying CoT, log-probability scoring, direct ratings, and model-as-judge to compare LLM outputs against independent external aggregates from the World Values Survey and PEW Global Attitudes Survey. Primary metrics (e.g., Pearson r ≈ 0.90 overall, regional r=0.82 vs 0.61) are computed directly from these comparisons rather than from any parameters fitted inside the paper or self-referential definitions. The peer-agreement correlation (r=0.74 on WVS) is presented separately as a quality-check validation and does not define or substitute for the alignment scores. No self-citations, uniqueness theorems, or ansatzes are shown to carry the central claims, and the derivation chain remains self-contained against the cited external data sources.
Axiom & Free-Parameter Ledger
free parameters (1)
- data-driven threshold for flagging conflicts
axioms (1)
- domain assumption Responses in the World Values Survey and PEW Global Attitudes Survey constitute a valid, culture-spanning ground truth for moral alignment.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We present EvalMORAAL, a transparent chain-of-thought (CoT) framework that uses two scoring methods (log-probabilities and direct ratings) plus a model-as-judge peer review
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Western regions average r=0.82 while non-Western regions average r=0.61
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
A Survey on LLM-as-a-Judge
A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.
Reference graph
Works this paper leans on
-
[1]
Moral foundations of large language models. InProceedings of the 2024 Conference on Empiri- cal Methods in Natural Language Processing, pages 17737–17752, Miami, Florida, USA. Association for Computational Linguistics. Muhammad Adilazuarda, Sagnik Mukherjee, Prad- hyumna Lavania, Siddhant Singh, Alham Fikri Aji, Jacki O’Neill, Ashutosh Modi, and Monojit C...
work page 2024
-
[2]
Towards measuring and modeling “cul- ture” in llms: A survey. InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, pages 15763–15784. Association for Computational Linguistics. Utkarsh Agarwal, Kumar Tanmay, Aditi Khandelwal, and Monojit Choudhury
work page 2024
-
[3]
Ethical reasoning and moral value alignment of llms depend on the language we prompt them in. InProceedings of the 2024 Joint International Conference on Computa- tional Linguistics, Language Resources and Evalua- tion(LREC-COLING2024),pages6330–6340,Turin, Italy. ELRA and ICCL. LREC-COLING
work page 2024
-
[4]
Assessing llms for moral value pluralism. CoRR, abs/2312.10075. ArXiv preprint. Pablo Biedma, Xiaoyuan Yi, Linus Huang, Maosong Sun, and Xing Xie
-
[5]
ArXiv preprint arXiv:2404.12744
Beyond human norms: Unveiling unique values of large language models through interdisciplinary approaches. ArXiv preprint arXiv:2404.12744. Gaelle Cachat-Rosset and Alain Klarsfeld
-
[6]
ArXiv preprint arXiv:2403.12805
Contextual moral value alignment through context-based aggregation. ArXiv preprint arXiv:2403.12805. Xinrun Du, Zhouliang Yu, Songyang Gao, et al
-
[7]
Chinese tiny llm: Pretraining a chinese-centric large language model
ChinesetinyLLM:PretrainingaChinese-centriclarge language model. ArXiv preprint arXiv:2404.04167. Sualeha Farid, Jayden Lin, Zean Chen, Shivani Ku- mar, and David Jurgens
-
[8]
ArXiv preprint arXiv:2509.21443
One model, many morals: Uncovering cross-linguistic misalignments in computational moral reasoning. ArXiv preprint arXiv:2509.21443. Jesse Graham, Peter Meindl, Erica Beall, et al
-
[9]
Cultural differences in moral judgment and behavior, across and within societies.Current Opinion in Psychology, 8:125–130. Christian W. Haerpfer, Patrick Bernhagen, Ronald F. Inglehart, and Christian Welzel. 2022.World Val- ues Survey: Round Seven - Country-Pooled Datafile Version. Institute for Comparative Survey Research, Vienna. JonathanHaidt.2001. The...
work page 2022
-
[10]
ArXiv preprint arXiv:2203.07785
The ghost in the machine has an american accent: Value conflict in GPT-3. ArXiv preprint arXiv:2203.07785. Kostas Karpouzis
-
[11]
ArXiv preprint arXiv:2406.14805
How well do LLMs represent values across cultures? empirical analysis of LLM responses based on Hofstede cultural dimensions. ArXiv preprint arXiv:2406.14805. Cheng Li, Damien Teney, Linyi Yang, Qingsong Wen, Xing Xie, and Jindong Wang
-
[12]
ArXiv preprint arXiv:2405.15145
Culturepark: Boosting cross-cultural understanding in large lan- guage models. ArXiv preprint arXiv:2405.15145. Chen Liu, Fajri Koto, Timothy Baldwin, and Iryna Gurevych
-
[13]
Are multilingual LLMs culturally- diverse reasoners? an investigation into multicultural proverbs and sayings. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 2016–2039, Mexico City, Mexico. Association for Computational Linguist...
work page 2024
-
[14]
Cultural alignment in large language models: An ex- planatoryanalysisbasedonhofstede’sculturaldimen- sions. InProceedingsofthe31stInternationalConfer- enceonComputationalLinguistics,pages8474–8503, AbuDhabi,UAE.AssociationforComputationalLin- guistics. Hadi Mohammadi, Ayoub Bagheri, Anastasia Gi- achanou, and Daniel L. Oberski. 2025a. Explainabil- ity in ...
-
[15]
Gender bias in transformers: A comprehensive review of detection and mitigation strategies.Natural Language Process- ing Journal, 6:100047. Safiya Umoja Noble. 2018.Algorithms of Oppression: How Search Engines Reinforce Racism. NYU Press, New York. Nedjma Djouhra Ousidhoum, Xinran Zhao, Tianqing Fang,YangqiuSong,andDit-YanYeung.2021. Prob- ingtoxiccontent...
work page 2018
-
[16]
Questions Q84A–Q84H on moral acceptability across 39 countries
Spring 2013 global at- titudes survey. Questions Q84A–Q84H on moral acceptability across 39 countries. Petar Radanliev
work page 2013
-
[17]
A survey on gender bias in natural language processing. CoRR, abs/2112.14168. ArXiv preprint. Yan Tao, Olga Viberg, Ryan S. Baker, and René F. Kizilcec
-
[18]
Association for Computational Linguistics
Exploring multilingual con- cepts of human values in large language models: Is value alignment consistent, transferable and control- lable across languages? InFindings of the Associa- tion for Computational Linguistics: EMNLP 2024, pages 1771–1793, Miami, Florida, USA. Association for Computational Linguistics. Lu Zhou, Yiheng Chen, Xinmin Li, Yanan Li, N...
work page 2024
-
[19]
A new adapter tuning of large language model for Chinese medical named entity recognition.Applied Artificial Intelligence, 38(1):2385268. A Complete Model Specifications Table 2 provides complete specifications for all 20 evaluated models, including exact checkpoint identifiers, release dates, and parameter counts. Table 2: Complete model specifications w...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.