arxiv: 2605.10843 · v1 · submitted 2026-05-11 · 💻 cs.CL · cs.AI· cs.CY

Recognition: 2 theorem links

· Lean Theorem

Training-Free Cultural Alignment of Large Language Models via Persona Disagreement

Huynh Trung Kiet , Dao Sy Duy Minh , Tuan Nguyen , Chi-Nguyen Tran , Phu-Hoa Pham , Nguyen Lam Phu Quy , The Anh Han , Long Tran-Thanh

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:03 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CY

keywords cultural alignmentlarge language modelsinference-time steeringpersona agentsworld values surveydisagreementblack-box modelsmoral preferences

0 comments

The pith

Disagreement among World Values Survey personas steers black-box LLMs toward country-specific cultural preferences at inference time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that cultural misalignment in large language models can be corrected without any training or access to model weights by using disagreement among simulated sociodemographic personas. It introduces a method that treats within-country variation in survey responses as the key signal rather than seeking consensus. This matters because many users and decisions involve moral judgments across diverse global contexts, and current alignment approaches either require expensive fine-tuning or assume white-box access that commercial models do not provide. If successful, it offers a scalable way to serve varied cultural preferences using only public data.

Core claim

The authors establish that instantiating each country as a panel of persona agents grounded in World Values Survey responses, then converting their disagreement into a bounded logit correction, reduces cultural misalignment on the MultiTP benchmark by 10 to 24 percent across six model scales from 3.8B to 70B parameters, and by 2 to 7 percent in open-ended scenarios, all without modifying model parameters.

What carries the argument

DISCA, a disagreement-informed steering mechanism that instantiates countries via multiple World-Values-Survey-grounded persona agents and applies their disagreement as a loss-averse logit adjustment at inference.

If this is right

Black-box LLMs can be aligned to diverse cultures using only public survey data and inference-time computation.
The method scales across model sizes from 3.8 billion to 70 billion parameters without retraining.
Within-country disagreement serves as a more effective steering signal than seeking cultural consensus.
Open-ended generation scenarios show smaller but positive gains from the same correction.
Alignment becomes feasible for the long tail of global moral preferences without per-country fine-tuning budgets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Providers of API-based models could deploy this as a default layer to improve cultural sensitivity for users in different regions.
The approach might extend to other forms of value alignment, such as political or ethical preferences, by sourcing appropriate disagreement data.
Future work could test whether the same personas improve performance on related tasks like cross-cultural translation or bias detection.
Since it requires no weight changes, it could be combined with other inference techniques like chain-of-thought without interference.

Load-bearing premise

That the disagreement among sociodemographic personas derived from the World Values Survey captures the primary and sufficient signal needed to correct a model's cultural misalignment.

What would settle it

Measuring cultural misalignment on the MultiTP benchmark after applying the DISCA logit correction and finding that scores do not decrease compared to the uncorrected baseline, or decrease less than a control using random personas.

Figures

Figures reproduced from arXiv: 2605.10843 by Chi-Nguyen Tran, Dao Sy Duy Minh, Huynh Trung Kiet, Long Tran-Thanh, Nguyen Lam Phu Quy, Phu-Hoa Pham, The Anh Han, Tuan Nguyen.

**Figure 1.** Figure 1: DISCA overview. Stage 1 builds WVS-grounded persona prompts for a trolley scenario in country c; Stage 2 runs a frozen large language model (LLM) on the base prompt and each persona, aggregates persona-level signals in logit space, and applies Prospect-Theory importance sampling (PT–IS) together with a dual-pass reliability gate to obtain the final sparing probability. Pseudocode and the six MultiTP attrib… view at source ↗

**Figure 2.** Figure 2: Per-dimension DISCA improvement across the seven headline backbones. Each cell is the macro-averaged (over 20 countries) reduction in per-dimension MPR error: ∆ = |vanilla − human| − |DISCA − human|. Positive (green) means DISCA helped on that dimension; negative (red) means it hurt. Utilitarianism, Species, and Social Value are the dimensions where DISCA delivers the largest gains, consistent with these b… view at source ↗

**Figure 3.** Figure 3: Geometric story: DISCA pulls model AMCE vectors toward the human cluster. 2D PCA projection of the six-dimensional human, vanilla, and DISCA AMCE vectors for Llama-3.3- 70B across all 20 countries (joint fit, two components capture 93.2% of the variance). Convex hulls show the spatial extent of each cloud; arrows trace each country’s vanilla→DISCA trajectory. All 20 of 20 country points end closer to the h… view at source ↗

**Figure 4.** Figure 4: Geographic distribution of DISCA gain. Each marker is one of the 20 paper countries placed at its longitude/latitude; marker size is proportional to |∆MIS| and color encodes sign (green = DISCA helped, red = hurt). Aggregated across the seven headline backbones, 19 of 20 countries see a positive mean gain, distributed across the Americas, East and Southeast Asia, and Eastern Europe; the largest single-coun… view at source ↗

**Figure 5.** Figure 5: Cost-vs-quality frontier on the headline 7 models. Per-scenario DISCA latency (log scale) vs. mean DISCA MIS. Marker size is proportional to parameter count; color encodes ∆MIS (greener = larger DISCA gain). Phi-4 (14B, ∆ = +0.108) lies bottom-left: Pareto-dominant over Llama-3.3-70B in both latency and alignment. A16 Relationship to Persona-Dependent LLM Alignment Kim et al. [2025] is the closest prior wo… view at source ↗

read the original abstract

Large language models increasingly mediate decisions that turn on moral judgement, yet a growing body of evidence shows that their implicit preferences are not culturally neutral. Existing cultural alignment methods either require per-country preference data and fine-tuning budgets or assume white-box access to model internals that commercial APIs do not expose. In this work, we focus on this realistic black-box, public-data-only regime and observe that within-country sociodemographic disagreement, not consensus, is the primary steering signal. We introduce DISCA (Disagreement-Informed Steering for Cultural Alignment), an inference-time method that instantiates each country as a panel of World-Values-Survey-grounded persona agents and converts their disagreement into a bounded, loss-averse logit correction. Across 20 countries and 7 open-weight backbones (2B--70B), DISCA reduces cultural misalignment on MultiTP by 10--24% on the six backbones >=3.8B, and 2--7% on open-ended scenarios, without changing any weights. Our results suggest that inference-time calibration is a scalable alternative to fine-tuning for serving the long tail of global moral preferences.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DISCA gives a workable inference-time tweak for cultural alignment on open models by using WVS persona disagreement to adjust logits, but the black-box framing falls apart because it needs probability access that closed APIs don't provide.

read the letter

The core contribution is an inference-time method that builds country-specific persona panels from World Values Survey data, measures their disagreement, and turns that into a bounded logit correction. It avoids any weight updates and reports 10-24% drops in misalignment on MultiTP for the larger open-weight models tested, plus smaller gains on open-ended prompts. That focus on disagreement as the signal, rather than forcing consensus, is the clearest departure from prior fine-tuning or white-box work. The setup is straightforward to describe and uses only public data, which is a practical plus for anyone who wants to adapt models without repeated training runs. The experiments cover 20 countries and seven backbones from 2B to 70B, which gives a reasonable spread. The main weakness is the mismatch between the abstract's claim of a realistic black-box, public-data-only regime and the actual requirement for logit access. All reported numbers come from open-weight models; nothing shows whether a prompt-only approximation would preserve the effect on something like GPT-4 or Claude. That undercuts the central selling point. The paper also leaves the exact bounding and loss-averse mechanics underspecified in the summary, so it's unclear how much hand-tuning went into the correction. Minor details like statistical significance and baseline controls aren't visible here either. This is worth a serious referee for groups working on lightweight cultural adaptation or inference-time steering. Readers who care about closed APIs will need to see whether the method can be approximated without logits. I'd send it to review but flag the black-box scope as something that needs tightening in revision.

Referee Report

3 major / 2 minor

Summary. The paper introduces DISCA, an inference-time method for cultural alignment of LLMs. It instantiates each country as a panel of World Values Survey-grounded persona agents whose disagreement is converted into a bounded, loss-averse logit correction. Evaluated across 20 countries and 7 open-weight backbones (2B-70B), it claims 10-24% reduction in cultural misalignment on MultiTP for models >=3.8B and 2-7% on open-ended scenarios, without any weight changes, positioning it as a scalable alternative to fine-tuning in black-box, public-data-only settings.

Significance. If the results hold, the work would be significant for showing that public survey data and inference-time steering via disagreement can mitigate cultural misalignment without training, offering a practical approach for the long tail of global preferences. It credits the use of external WVS data and open-weight model evaluations as strengths, though applicability beyond logit-accessible models remains untested.

major comments (3)

[Abstract] Abstract: The paper frames DISCA as a solution for the 'realistic black-box, public-data-only regime' because prior methods require fine-tuning or white-box access. However, the method 'converts their disagreement into a bounded, loss-averse logit correction,' which presupposes direct access to output token probabilities. Experiments are reported only on 7 open-weight backbones; no prompt-only approximation or transfer to closed APIs (e.g., GPT-4) is tested. This makes the central black-box claim unsupported.
[Experiments] Experiments section: The abstract states quantitative improvements of 10-24% on MultiTP and 2-7% on open-ended scenarios, but provides no details on experimental controls, baseline comparisons, statistical significance, variance across runs, or how the logit correction bound is enforced. This prevents assessment of whether the data support the claims.
[Method] Method section: The claim that within-country sociodemographic disagreement (via WVS-grounded personas) is the primary steering signal is load-bearing but lacks ablations. No comparison is shown against consensus-based personas, random personas, or alternative disagreement measures to confirm sufficiency.

minor comments (2)

[Abstract] Abstract: Clarify performance on the 2B model, as improvements are reported only for the six backbones >=3.8B.
[Overall] Overall: Define MultiTP at first mention and provide a brief description of the benchmark and open-ended evaluation protocol.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important clarifications needed regarding the scope of our method and the robustness of our experimental claims. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The paper frames DISCA as a solution for the 'realistic black-box, public-data-only regime' because prior methods require fine-tuning or white-box access. However, the method 'converts their disagreement into a bounded, loss-averse logit correction,' which presupposes direct access to output token probabilities. Experiments are reported only on 7 open-weight backbones; no prompt-only approximation or transfer to closed APIs (e.g., GPT-4) is tested. This makes the central black-box claim unsupported.

Authors: We agree that the logit correction requires direct access to output probabilities, which is available for open-weight models but not for closed APIs. In the manuscript, 'black-box' denotes the absence of weight updates or private training data, contrasting with fine-tuning approaches. To resolve the ambiguity, we will revise the abstract, introduction, and related sections to explicitly limit the claim to open-weight models with logit access and note the untested status for proprietary APIs. No prompt-only approximation was developed, as the core mechanism depends on logit adjustments. revision: partial
Referee: [Experiments] Experiments section: The abstract states quantitative improvements of 10-24% on MultiTP and 2-7% on open-ended scenarios, but provides no details on experimental controls, baseline comparisons, statistical significance, variance across runs, or how the logit correction bound is enforced. This prevents assessment of whether the data support the claims.

Authors: We will expand the Experiments section with a dedicated subsection on setup and controls. This will include baseline comparisons (standard prompting, consensus personas, and prior alignment techniques), statistical significance tests (e.g., paired tests across countries), variance as standard deviations over repeated runs, and the precise enforcement of the loss-averse logit bound via the correction formula and hyperparameters. These details will substantiate the reported improvements. revision: yes
Referee: [Method] Method section: The claim that within-country sociodemographic disagreement (via WVS-grounded personas) is the primary steering signal is load-bearing but lacks ablations. No comparison is shown against consensus-based personas, random personas, or alternative disagreement measures to confirm sufficiency.

Authors: We will add an ablation subsection comparing the disagreement-based approach to consensus-based personas, random personas, and alternative metrics such as opinion variance or entropy. The results will show that sociodemographic disagreement yields stronger alignment gains, validating its role as the primary signal. This will be supported by quantitative tables and discussion of cultural nuance captured by disagreement. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation grounded in external WVS data and explicit inference-time procedure

full rationale

The paper's central method (DISCA) instantiates personas from the external World Values Survey, computes disagreement among them, and applies a defined logit correction at inference time. No quoted equations, self-citations, or steps reduce the claimed misalignment reduction to a fitted parameter renamed as prediction, a self-definitional loop, or an ansatz imported only via the authors' prior work. The empirical results on open-weight models are presented as direct measurements rather than forced by construction from the inputs. The black-box framing mismatch noted in external commentary concerns applicability assumptions, not a circular derivation chain within the paper's own logic.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the premise that WVS-grounded personas can faithfully represent cultural disagreement and that converting that disagreement into a bounded logit correction improves alignment; no explicit free parameters, axioms, or invented entities are detailed in the abstract.

pith-pipeline@v0.9.0 · 5529 in / 1160 out tokens · 23661 ms · 2026-05-12T04:03:34.969346+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Proposition 1(Variance-aware shrinkage...): γ⋆ = Δ_h² / (Δ_h² + τ²/N), monotone-decreasing in τ².
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

loss-averse importance sampling... Kahneman–Tversky value function v(z) = z^α (z≥0), −κ(−z)^α (z<0)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 2 internal anchors

[1]

PLOS ONE , year =

Large-scale moral machine experiment on large language models , author =. PLOS ONE , year =. doi:10.1371/journal.pone.0322776 , url =

work page doi:10.1371/journal.pone.0322776
[2]

Test-Time Alignment of

Kanai, Sekitoshi and Yoshida, Tsukasa and Takahashi, Hiroshi and Kuroki, Haru and Hashimoto, Kazumune , journal=. Test-Time Alignment of. 2025 , doi=

work page 2025
[3]

arXiv preprint arXiv:2601.10960 , year=

Steering Language Models Before They Speak: Logit-Level Interventions , author=. arXiv preprint arXiv:2601.10960 , year=. doi:10.48550/arXiv.2601.10960 , url=

work page doi:10.48550/arxiv.2601.10960
[4]

Advances in Neural Information Processing Systems , volume=

Refusal in language models is mediated by a single direction , author=. Advances in Neural Information Processing Systems , volume=. 2025 , doi=

work page 2025
[5]

PsyArXiv preprint , year=

Which Humans? , author=. PsyArXiv preprint , year=. doi:10.31234/osf.io/5b26t , note=

work page doi:10.31234/osf.io/5b26t
[6]

Nature , volume=

The moral machine experiment , author=. Nature , volume=. 2018 , publisher=

work page 2018
[7]

2025 , doi=

Bobbili, Sarat Chandra and Dinesha, Ujwal and Narasimha, Dheeraj and Shakkottai, Srinivas , journal=. 2025 , doi=

work page 2025
[8]

No Free Lunch in Language Model Bias Mitigation?

Chand, Shireen and Baca, Faith and Ferrara, Emilio , journal=. No Free Lunch in Language Model Bias Mitigation?. 2026 , publisher=

work page 2026
[9]

2025 , address=

Chen, Ruizhe and Chai, Wenhao and Yang, Zhifei and Zhang, Xiaotian and Wang, Ziyang and Quek, Tony and Zhou, Joey Tianyi and Poria, Soujanya and Liu, Zuozhu , booktitle=. 2025 , address=. doi:10.18653/v1/2025.acl-long.926 , url=

work page doi:10.18653/v1/2025.acl-long.926 2025
[10]

doi: 10.18653/v1/2023.findings-emnlp.88

Deshpande, Ameet and Murahari, Vishvak and Rajpurohit, Tanmay and Kalyan, Ashwin and Narasimhan, Karthik , booktitle=. Toxicity in. 2023 , address=. doi:10.18653/v1/2023.findings-emnlp.88 , url=

work page doi:10.18653/v1/2023.findings-emnlp.88 2023
[11]

Proceedings of the 41st International Conference on Machine Learning , series=

Improving Factuality and Reasoning in Language Models through Multiagent Debate , author=. Proceedings of the 41st International Conference on Machine Learning , series=. 2024 , publisher=

work page 2024
[12]

arXiv preprint arXiv:2306.16388 , year =

Towards measuring the representation of subjective global opinions in language models , author=. arXiv preprint arXiv:2306.16388 , year=

work page arXiv
[13]

The Annals of Statistics , volume=

Bootstrap methods: another look at the jackknife , author=. The Annals of Statistics , volume=. 1979 , note=. doi:10.1007/978-1-4612-4380-9_41 , url=

work page doi:10.1007/978-1-4612-4380-9_41 1979
[14]

Wiley StatsRef: Statistics Reference Online , pages=

Advances in importance sampling , author=. Wiley StatsRef: Statistics Reference Online , pages=. 2021 , publisher=. doi:10.1002/9781118445112.stat08284 , url=

work page doi:10.1002/9781118445112.stat08284 2021
[15]

2024 , note=

Ethayarajh, Kawin and Xu, Winnie and Muennighoff, Niklas and Jurafsky, Dan and Kiela, Douwe , booktitle=. 2024 , note=

work page 2024
[16]

Dropout as a

Gal, Yarin and Ghahramani, Zoubin , booktitle=. Dropout as a. 2016 , url=

work page 2016
[17]

arXiv preprint arXiv:2601.22396 , year=

Culturally Grounded Personas in Large Language Models: Characterization and Alignment with Socio-Psychological Value Frameworks , author=. arXiv preprint arXiv:2601.22396 , year=

work page arXiv
[18]

Proceedings of the 34th International Conference on Machine Learning , series=

On Calibration of Modern Neural Networks , author=. Proceedings of the 34th International Conference on Machine Learning , series=. 2017 , note=

work page 2017
[19]

2020 , note=

Haerpfer, Christian and Inglehart, Ronald and Moreno, Alejandro and Welzel, Christian and Kizilova, Kseniya and Diez-Medrano, Jaime and Lagos, Marta and Norris, Pippa and Ponarin, Eduard and Puranen, Bi , howpublished=. 2020 , note=

work page 2020
[20]

Aligning

Hendrycks, Dan and Burns, Collin and Basart, Steven and Critch, Andrew and Li, Jerry and Song, Dawn and Steinhardt, Jacob , booktitle=. Aligning. 2021 , url=

work page 2021
[21]

Heine, and Ara Norenzayan

The weirdest people in the world? , author=. Behavioral and Brain Sciences , volume=. 2010 , publisher=. doi:10.1017/S0140525X0999152X , url=

work page doi:10.1017/s0140525x0999152x 2010
[22]

2005 , publisher=

Modernization, Cultural Change, and Democracy: The Human Development Sequence , author=. 2005 , publisher=

work page 2005
[23]

Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics , year =

Estimation with Quadratic Loss , author =. Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics , year =

work page
[24]

Advances in Neural Information Processing Systems , volume=

When to make exceptions: Exploring language models as accounts of human moral judgment , author=. Advances in Neural Information Processing Systems , volume=. 2022 , url=

work page 2022
[25]

International Conference on Learning Representations , year=

Jin, Zhijing and Kleiman-Weiner, Max and Piatti, Giorgio and Levine, Sydney and Liu, Jiarui and Adauto, Fernando Gonzalez and Ortu, Francesco and Strausz, Andr. International Conference on Learning Representations , year=

work page
[26]

Econometrica , volume=

Prospect theory: An analysis of decision under risk , author=. Econometrica , volume=. 1979 , publisher=

work page 1979
[27]

Randomness, Not Representation: The Unreliability of Evaluating Cultural Alignment in

Khan, Ariba and Casper, Stephen and Hadfield-Menell, Dylan , booktitle=. Randomness, Not Representation: The Unreliability of Evaluating Cultural Alignment in. 2025 , pages=. doi:10.1145/3715275.3732147 , url=

work page doi:10.1145/3715275.3732147 2025
[28]

2024 , url=

Khanov, Maxim and Burapacheep, Jirayu and Li, Yixuan , booktitle=. 2024 , url=

work page 2024
[29]

Steering

Khanuja, Simran and Liu, Hongbin and Zhang, Shujian and Lambert, John and Chen, Mingqing and Mathews, Rajiv and Wang, Lun , journal=. Steering

work page
[30]

arXiv preprint arXiv:2504.10886 , year =

Exploring Persona-dependent LLM Alignment for the Moral Machine Experiment , author =. arXiv preprint arXiv:2504.10886 , year =. doi:10.48550/arXiv.2504.10886 , url =

work page doi:10.48550/arxiv.2504.10886
[31]

Kirk, Hannah Rose and Whitefield, Alexander and R. The. arXiv preprint arXiv:2404.16019 , note=. 2024 , doi=

work page arXiv 2024
[32]

1965 , publisher=

Survey Sampling , author=. 1965 , publisher=

work page 1965
[33]

Cognition , volume=

Learning a commonsense moral theory , author=. Cognition , volume=

work page
[34]

Dropouts in Confidence: Moral Uncertainty in Human-

Kwon, Jea and Vecchietti, Luiz Felipe and Park, Sungwon and Cha, Meeyoung , booktitle=. Dropouts in Confidence: Moral Uncertainty in Human-. 2026 , note=

work page 2026
[35]

Reinforcement learning and control as probabilistic inference:

Levine, Sergey , journal=. Reinforcement learning and control as probabilistic inference:. 2018 , doi=

work page 2018
[36]

URL https://doi.org/10

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate , author =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , year =. doi:10.18653/v1/2024.emnlp-main.992 , url =

work page doi:10.18653/v1/2024.emnlp-main.992 2024
[37]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Lost in the middle: How language models use long contexts , author=. Transactions of the Association for Computational Linguistics , volume=. 2024 , publisher=. doi:10.1162/tacl_a_00638 , url=

work page doi:10.1162/tacl_a_00638 2024
[38]

Proceedings of the 41st International Conference on Machine Learning , series=

Controlled Decoding from Language Models , author=. Proceedings of the 41st International Conference on Machine Learning , series=. 2024 , publisher=

work page 2024
[39]

2024 , doi=

Myung, Junho and Lee, Nayeon and Zhou, Yi and Jin, Jiho and Putri, Rifki Afina and Antypas, Dimosthenis and Borkakoty, Hsuvas and Kim, Eunsu and Perez-Almendros, Carla and Ayele, Abinew Ali and others , booktitle=. 2024 , doi=

work page 2024
[40]

2023 , doi=

Nie, Allen and Zhang, Yuhui and Amdekar, Atharva and Piech, Chris and Hashimoto, Tatsunori B and Gerstenberg, Tobias , booktitle=. 2023 , doi=

work page 2023
[41]

An Analysis of

Payne, Kenneth , journal=. An Analysis of. 2025 , doi=

work page 2025
[42]

2025 , eprint=

Multimodal Cultural Safety: Evaluation Framework and Alignment Strategies , author=. 2025 , eprint=. doi:10.48550/arXiv.2505.14972 , url=

work page doi:10.48550/arxiv.2505.14972 2025
[43]

Unintended Impacts of

Ryan, Michael J and Held, William and Yang, Diyi , booktitle=. Unintended Impacts of. 2024 , address=. doi:10.18653/v1/2024.acl-long.853 , url=

work page doi:10.18653/v1/2024.acl-long.853 2024
[44]

arXiv preprint arXiv:2303.17548 , year =

Whose opinions do language models reflect? , author=. arXiv preprint arXiv:2303.17548 , year=

work page arXiv
[45]

Journal of Personality and Social Psychology , volume=

Refining the theory of basic individual values , author=. Journal of Personality and Social Psychology , volume=. 2012 , publisher=

work page 2012
[46]

arXiv preprint arXiv:2402.05070 , year =

A roadmap to pluralistic alignment , author=. arXiv preprint arXiv:2402.05070 , year=. doi:10.48550/arXiv.2402.05070 , url=

work page doi:10.48550/arxiv.2402.05070
[47]

Royal Society Open Science , volume=

The moral machine experiment on large language models , author=. Royal Society Open Science , volume=. 2024 , doi=

work page 2024
[48]

Proceedings of the National Academy of Sciences , volume=

Efficient computation of optimal actions , author=. Proceedings of the National Academy of Sciences , volume=. 2009 , publisher=. doi:10.1073/pnas.0710743106 , url=

work page doi:10.1073/pnas.0710743106 2009
[49]

Steering Language Models With Activation Engineering

Activation Addition: Steering Language Models Without Optimization , author=. arXiv preprint arXiv:2308.10248 , year=. doi:10.48550/arXiv.2308.10248 , url=

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.10248
[50]

Journal of Risk and Uncertainty , volume=

Advances in prospect theory: Cumulative representation of uncertainty , author=. Journal of Risk and Uncertainty , volume=. 1992 , publisher=. doi:10.1007/BF00122574 , url=

work page doi:10.1007/bf00122574 1992
[51]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author =. International Conference on Learning Representations (ICLR) , year =. doi:10.48550/arXiv.2203.11171 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.11171
[52]

IEEE Transactions on Robotics , volume=

Information-theoretic model predictive control: Theory and applications to autonomous driving , author=. IEEE Transactions on Robotics , volume=. 2018 , publisher=. doi:10.1109/TRO.2018.2865891 , url=

work page doi:10.1109/tro.2018.2865891 2018
[53]

2025 , doi=

Yao, Jing and Yi, Xiaoyuan and Wang, Jindong and Dou, Zhicheng and Xie, Xing , journal=. 2025 , doi=

work page 2025
[54]

Proceedings of the National Academy of Sciences , volume=

Moral stereotyping in large language models , author=. Proceedings of the National Academy of Sciences , volume=. 2026 , doi=

work page 2026
[55]

arXiv preprint arXiv:2602.22475 , year=

Mind the Gap in Cultural Alignment: Task-Aware Culture Management for Large Language Models , author=. arXiv preprint arXiv:2602.22475 , year=

work page arXiv
[56]

Representation engineering: A top-down approach to

Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and Guo, Phillip and others , journal=. Representation engineering: A top-down approach to

work page
[57]

, title =

Nash, John F. , title =. Econometrica , volume =. 1950 , doi =

work page 1950
[58]

Econometrica , volume =

Kalai, Ehud and Smorodinsky, Meir , title =. Econometrica , volume =. 1975 , doi =

work page 1975
[59]

Mahalanobis, P. C. , title =. Journal of the Royal Statistical Society , volume =

work page
[60]

, title =

McCarthy, Philip J. , title =. Review of the International Statistical Institute , volume =

work page
[61]

, title =

Wolter, Kirk M. , title =

work page
[62]

Annals of Mathematical Statistics , volume =

Hoeffding, Wassily , title =. Annals of Mathematical Statistics , volume =

work page
[63]

Hanson--

Rudelson, Mark and Vershynin, Roman , journal=. Hanson--. 2013 , publisher=

work page 2013