pith. sign in

arxiv: 2510.05942 · v3 · pith:BCH2ZHKInew · submitted 2025-10-07 · 💻 cs.CL · cs.AI

EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models

Pith reviewed 2026-05-21 20:34 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords moral alignmentlarge language modelschain-of-thoughtworld values surveycultural biasllm evaluationpeer review
0
0 comments X

The pith

Large language models align more closely with Western moral survey responses than non-Western ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

EvalMORAAL evaluates moral alignment in twenty large language models by applying chain-of-thought reasoning and dual scoring to questions from the World Values Survey and PEW Global Attitudes Survey. Top models reach Pearson correlation of about 0.90 with overall survey answers. The evaluation reveals a 0.21 gap, with Western regions at average r of 0.82 and non-Western regions at 0.61. The framework adds structured self-consistency checks and a model-as-judge step that flags response conflicts.

Core claim

EvalMORAAL applies chain-of-thought protocols and two scoring methods to twenty LLMs on the World Values Survey across fifty-five countries and the PEW survey across thirty-nine countries. Top models achieve Pearson's r approximately 0.90 with survey responses. Western regions average r equal to 0.82 while non-Western regions average r equal to 0.61, producing a 0.21 absolute gap. The method also runs model-as-judge peer review that identifies 348 conflicts and finds peer agreement correlates with WVS alignment at r of 0.74.

What carries the argument

EvalMORAAL, a chain-of-thought framework that combines log-probability scoring, direct rating scoring, self-consistency checks, and LLM-as-judge peer review to measure moral alignment against survey data.

If this is right

  • Dual scoring methods allow direct comparison across all models regardless of size or architecture.
  • Structured CoT with self-consistency improves reliability of alignment measurements.
  • Model-as-judge peer review can serve as an automated filter because its agreement tracks survey alignment.
  • The observed regional gap indicates that current alignment techniques still fall short for global use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training corpora likely contain heavier representation of Western viewpoints, producing the measured difference.
  • The same evaluation pipeline could be applied to other value domains such as fairness or safety to test for similar patterns.
  • Targeted fine-tuning on non-Western survey responses might narrow the alignment gap without harming Western performance.

Load-bearing premise

World Values Survey and PEW responses serve as an unbiased, representative target for moral alignment that can be measured fairly across cultures.

What would settle it

A new set of moral questions drawn primarily from non-Western sources or models retrained on balanced regional data would close or reverse the 0.21 regional correlation gap.

Figures

Figures reproduced from arXiv: 2510.05942 by Anastasia Giachanou, Hadi Mohammadi, Robert A. Bagheri.

Figure 1
Figure 1. Figure 1: EvalMORAAL Framework Overview. These systems now support social media modera￾tion, conversational assistants, real-time translation, and decision support tools used worldwide. As their use grows in research and practice (Bender et al., 2021), a key concern is whether models can han￾dle the diverse moral norms found across cultures. Modern LLMs, despite strong capabilities, carry over biases from their trai… view at source ↗
Figure 2
Figure 2. Figure 2: Geographic alignment by tier. Cells show [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of score differences with conflict [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Peer-agreement vs. survey alignment. Each point is one model; the x-axis is Pearson 𝑟DIR computed from direct CoT scores. Models are colored by perfor￾mance tiers defined on WVS 𝑟DIR. Within-tier OLS lines with 95% CIs are shown for visualization; given small Top-tier 𝑛, bands are descriptive. 5 Discussion Our evaluation of 20 models shows both progress and open problems in cross-cultural moral rea￾soning.… view at source ↗
Figure 5
Figure 5. Figure 5: Mean absolute error by topic, aggregated [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Absolute error distributions by performance [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Mid [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Lower-tier models (𝑟 < 0.75) such as Claude-3-Haiku and o1-mini display weaker correla￾tions and broader spread, indicating reduced moral coherence. E Supplementary Per-Model Visualizations Argentina Australia Bangladesh BoliviaBrazil CanadaChile China Colombia Cyprus Ecuador Egypt Ethiopia France Germany Ghana Greece Guatemala Hong Kong India Indonesia Iran Iraq Japan Jordan Kazakhstan Kuwait Lebanon Liby… view at source ↗
Figure 7
Figure 7. Figure 7: Top-tier models (𝑟 ≥ 0.85) such as Claude-3-Opus, GPT-4o, and Gemini-Pro show near-perfect alignment, clustering tightly around the regression line. 1.0 0.5 0.0 0.5 1.0 Survey Score 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Model Score GPT-4 =0.847 1.0 0.5 0.0 0.5 1.0 Survey Score 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Model Score GPT-4o-mini =0.837 1.0 0.5 0.0 0.5 1.0 Survey Score 1.00 0.75 0.50 … view at source ↗
Figure 11
Figure 11. Figure 11: Mean absolute error by topic for all 20 models [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Error distributions for representative individ [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗
read the original abstract

We present EvalMORAAL, a transparent chain-of-thought (CoT) framework that uses two scoring methods (log-probabilities and direct ratings) plus a model-as-judge peer review to evaluate moral alignment in 20 large language models. We assess models on the World Values Survey (55 countries, 19 topics) and the PEW Global Attitudes Survey (39 countries, 8 topics). With EvalMORAAL, top models align closely with survey responses (Pearson's $r \approx 0.90$ on WVS). Yet we find a clear regional difference: Western regions average $r=0.82$ while non-Western regions average $r=0.61$ (a 0.21 absolute gap), indicating a persistent regional alignment gap. Our framework adds three parts: (1) two scoring methods for all models to enable fair comparison, (2) a structured CoT protocol with self-consistency checks, and (3) a model-as-judge peer review that flags 348 conflicts using a data-driven threshold. Peer agreement relates to WVS survey alignment ($r=0.74$, $p<.001$; PEW $r=0.39$, n.s.), supporting automated quality checks. These results show real progress toward culture-aware AI while highlighting open challenges for use across regions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces EvalMORAAL, a transparent chain-of-thought framework that applies two scoring methods (log-probabilities and direct ratings) plus an LLM-as-judge peer review to assess moral alignment of 20 large language models against the World Values Survey (55 countries, 19 topics) and PEW Global Attitudes Survey (39 countries, 8 topics). It reports that top models achieve high Pearson correlations with survey responses (r ≈ 0.90 on WVS) while identifying a regional gap (Western regions average r=0.82 versus non-Western regions average r=0.61).

Significance. If the regional gap is shown to be robust rather than an artifact of survey wording, response styles, or evaluation skew, the work would be significant for documenting challenges in culture-aware AI alignment. The structured CoT protocol with self-consistency and the model-as-judge component that flags 348 conflicts provide methodological transparency, though the paper supplies neither machine-checked proofs nor fully reproducible code.

major comments (2)
  1. [Abstract] Abstract: the central claim of a persistent 0.21 regional alignment gap treats WVS and PEW responses as an unbiased, culture-neutral ground truth that both log-probability and direct-rating scores can measure comparably across regions. No evidence is supplied of language-matched prompts, local-expert validation, topic-stratified robustness checks, or controls for item wording and response-style biases, leaving open the possibility that the gap reflects measurement artifacts rather than model differences.
  2. [Abstract] Abstract: the reported peer-agreement correlations (r=0.74 with WVS, r=0.39 with PEW) are presented separately from the primary alignment scores; it is therefore unclear whether the headline regional gap is derived independently or depends on the same external survey aggregates used to define alignment.
minor comments (2)
  1. [Abstract] The abstract states a 'data-driven threshold' for flagging conflicts but provides no formula, validation procedure, or sensitivity analysis for this threshold.
  2. Clarify the exact criteria used to partition countries into 'Western' and 'non-Western' regions and report whether the 0.21 gap remains stable under alternative groupings or topic subsets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We address each major comment below and indicate where revisions have been made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of a persistent 0.21 regional alignment gap treats WVS and PEW responses as an unbiased, culture-neutral ground truth that both log-probability and direct-rating scores can measure comparably across regions. No evidence is supplied of language-matched prompts, local-expert validation, topic-stratified robustness checks, or controls for item wording and response-style biases, leaving open the possibility that the gap reflects measurement artifacts rather than model differences.

    Authors: We agree that the possibility of measurement artifacts is an important consideration and that stronger controls would be desirable. The current work applies a uniform English-language prompt template across all models and regions to maintain methodological consistency, given that the evaluated LLMs are predominantly English-centric. In the revised manuscript we have added an explicit Limitations subsection that acknowledges the lack of language-matched prompts and local-expert validation, and we have moved the topic-stratified robustness checks from the appendix into the main text to demonstrate that the regional gap is not driven by a small number of topics. We view full controls for response-style biases as a valuable direction for follow-up research rather than a requirement that can be fully addressed within the scope of this study. revision: partial

  2. Referee: [Abstract] Abstract: the reported peer-agreement correlations (r=0.74 with WVS, r=0.39 with PEW) are presented separately from the primary alignment scores; it is therefore unclear whether the headline regional gap is derived independently or depends on the same external survey aggregates used to define alignment.

    Authors: The regional gap is computed solely from the Pearson correlations between each model’s EvalMORAAL scores (log-probability and direct-rating) and the corresponding regional survey aggregates; the peer-agreement analysis is performed afterward as an independent quality check on the LLM-as-judge component. We have revised the abstract and added a clarifying sentence in the Results section to make this separation explicit and to avoid any ambiguity about the source of the reported gap. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results rest on external survey benchmarks

full rationale

The paper defines EvalMORAAL as an evaluation framework applying CoT, log-probability scoring, direct ratings, and model-as-judge to compare LLM outputs against independent external aggregates from the World Values Survey and PEW Global Attitudes Survey. Primary metrics (e.g., Pearson r ≈ 0.90 overall, regional r=0.82 vs 0.61) are computed directly from these comparisons rather than from any parameters fitted inside the paper or self-referential definitions. The peer-agreement correlation (r=0.74 on WVS) is presented separately as a quality-check validation and does not define or substitute for the alignment scores. No self-citations, uniqueness theorems, or ansatzes are shown to carry the central claims, and the derivation chain remains self-contained against the cited external data sources.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The evaluation protocol rests on one main domain assumption about survey validity and introduces a single data-driven threshold; no new physical entities or heavy parameter fitting are described.

free parameters (1)
  • data-driven threshold for flagging conflicts
    Used to identify the 348 conflicts in the model-as-judge peer review step.
axioms (1)
  • domain assumption Responses in the World Values Survey and PEW Global Attitudes Survey constitute a valid, culture-spanning ground truth for moral alignment.
    This premise underpins both the overall correlation claims and the Western/non-Western gap calculation.

pith-pipeline@v0.9.0 · 5786 in / 1436 out tokens · 59154 ms · 2026-05-21T20:34:54.358616+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. A Survey on LLM-as-a-Judge

    cs.CL 2024-11 unverdicted novelty 4.0

    A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 1 Pith paper

  1. [1]

    InProceedings of the 2024 Conference on Empiri- cal Methods in Natural Language Processing, pages 17737–17752, Miami, Florida, USA

    Moral foundations of large language models. InProceedings of the 2024 Conference on Empiri- cal Methods in Natural Language Processing, pages 17737–17752, Miami, Florida, USA. Association for Computational Linguistics. Muhammad Adilazuarda, Sagnik Mukherjee, Prad- hyumna Lavania, Siddhant Singh, Alham Fikri Aji, Jacki O’Neill, Ashutosh Modi, and Monojit C...

  2. [2]

    cul- ture

    Towards measuring and modeling “cul- ture” in llms: A survey. InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, pages 15763–15784. Association for Computational Linguistics. Utkarsh Agarwal, Kumar Tanmay, Aditi Khandelwal, and Monojit Choudhury

  3. [3]

    InProceedings of the 2024 Joint International Conference on Computa- tional Linguistics, Language Resources and Evalua- tion(LREC-COLING2024),pages6330–6340,Turin, Italy

    Ethical reasoning and moral value alignment of llms depend on the language we prompt them in. InProceedings of the 2024 Joint International Conference on Computa- tional Linguistics, Language Resources and Evalua- tion(LREC-COLING2024),pages6330–6340,Turin, Italy. ELRA and ICCL. LREC-COLING

  4. [4]

    CoRR, abs/2312.10075

    Assessing llms for moral value pluralism. CoRR, abs/2312.10075. ArXiv preprint. Pablo Biedma, Xiaoyuan Yi, Linus Huang, Maosong Sun, and Xing Xie

  5. [5]

    ArXiv preprint arXiv:2404.12744

    Beyond human norms: Unveiling unique values of large language models through interdisciplinary approaches. ArXiv preprint arXiv:2404.12744. Gaelle Cachat-Rosset and Alain Klarsfeld

  6. [6]

    ArXiv preprint arXiv:2403.12805

    Contextual moral value alignment through context-based aggregation. ArXiv preprint arXiv:2403.12805. Xinrun Du, Zhouliang Yu, Songyang Gao, et al

  7. [7]

    Chinese tiny llm: Pretraining a chinese-centric large language model

    ChinesetinyLLM:PretrainingaChinese-centriclarge language model. ArXiv preprint arXiv:2404.04167. Sualeha Farid, Jayden Lin, Zean Chen, Shivani Ku- mar, and David Jurgens

  8. [8]

    ArXiv preprint arXiv:2509.21443

    One model, many morals: Uncovering cross-linguistic misalignments in computational moral reasoning. ArXiv preprint arXiv:2509.21443. Jesse Graham, Peter Meindl, Erica Beall, et al

  9. [9]

    Christian W

    Cultural differences in moral judgment and behavior, across and within societies.Current Opinion in Psychology, 8:125–130. Christian W. Haerpfer, Patrick Bernhagen, Ronald F. Inglehart, and Christian Welzel. 2022.World Val- ues Survey: Round Seven - Country-Pooled Datafile Version. Institute for Comparative Survey Research, Vienna. JonathanHaidt.2001. The...

  10. [10]

    ArXiv preprint arXiv:2203.07785

    The ghost in the machine has an american accent: Value conflict in GPT-3. ArXiv preprint arXiv:2203.07785. Kostas Karpouzis

  11. [11]

    ArXiv preprint arXiv:2406.14805

    How well do LLMs represent values across cultures? empirical analysis of LLM responses based on Hofstede cultural dimensions. ArXiv preprint arXiv:2406.14805. Cheng Li, Damien Teney, Linyi Yang, Qingsong Wen, Xing Xie, and Jindong Wang

  12. [12]

    ArXiv preprint arXiv:2405.15145

    Culturepark: Boosting cross-cultural understanding in large lan- guage models. ArXiv preprint arXiv:2405.15145. Chen Liu, Fajri Koto, Timothy Baldwin, and Iryna Gurevych

  13. [13]

    Are multilingual LLMs culturally- diverse reasoners? an investigation into multicultural proverbs and sayings. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 2016–2039, Mexico City, Mexico. Association for Computational Linguist...

  14. [14]

    InProceedingsofthe31stInternationalConfer- enceonComputationalLinguistics,pages8474–8503, AbuDhabi,UAE.AssociationforComputationalLin- guistics

    Cultural alignment in large language models: An ex- planatoryanalysisbasedonhofstede’sculturaldimen- sions. InProceedingsofthe31stInternationalConfer- enceonComputationalLinguistics,pages8474–8503, AbuDhabi,UAE.AssociationforComputationalLin- guistics. Hadi Mohammadi, Ayoub Bagheri, Anastasia Gi- achanou, and Daniel L. Oberski. 2025a. Explainabil- ity in ...

  15. [15]

    Safiya Umoja Noble

    Gender bias in transformers: A comprehensive review of detection and mitigation strategies.Natural Language Process- ing Journal, 6:100047. Safiya Umoja Noble. 2018.Algorithms of Oppression: How Search Engines Reinforce Racism. NYU Press, New York. Nedjma Djouhra Ousidhoum, Xinran Zhao, Tianqing Fang,YangqiuSong,andDit-YanYeung.2021. Prob- ingtoxiccontent...

  16. [16]

    Questions Q84A–Q84H on moral acceptability across 39 countries

    Spring 2013 global at- titudes survey. Questions Q84A–Q84H on moral acceptability across 39 countries. Petar Radanliev

  17. [17]

    CoRR, abs/2112.14168

    A survey on gender bias in natural language processing. CoRR, abs/2112.14168. ArXiv preprint. Yan Tao, Olga Viberg, Ryan S. Baker, and René F. Kizilcec

  18. [18]

    Association for Computational Linguistics

    Exploring multilingual con- cepts of human values in large language models: Is value alignment consistent, transferable and control- lable across languages? InFindings of the Associa- tion for Computational Linguistics: EMNLP 2024, pages 1771–1793, Miami, Florida, USA. Association for Computational Linguistics. Lu Zhou, Yiheng Chen, Xinmin Li, Yanan Li, N...

  19. [19]

    A Complete Model Specifications Table 2 provides complete specifications for all 20 evaluated models, including exact checkpoint identifiers, release dates, and parameter counts

    A new adapter tuning of large language model for Chinese medical named entity recognition.Applied Artificial Intelligence, 38(1):2385268. A Complete Model Specifications Table 2 provides complete specifications for all 20 evaluated models, including exact checkpoint identifiers, release dates, and parameter counts. Table 2: Complete model specifications w...