pith. sign in

arxiv: 2605.23825 · v1 · pith:UN7ZWAKNnew · submitted 2026-05-22 · 💻 cs.LG · cs.AI

It's the humans, not the data: Geopolitical bias in LLMs originates in post-training, amplified by the language of the prompt

Pith reviewed 2026-05-25 04:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords geopolitical biasLLMspost-trainingpre-trainingAI alignmentprompt languagemodel evaluationnational preferences
0
0 comments X

The pith

Geopolitical bias in LLMs is introduced during post-training rather than inherited from pre-training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests the source of geopolitical bias by comparing base models that received only pre-training with their post-trained chat counterparts across seven AI labs. Using a forced-choice probe on country pairs in English, French, and Chinese, the authors find that bias favoring the developers' home country or region appears or strengthens after post-training in six of the seven cases. The largest example is Qwen 2.5 shifting from neutral to strongly pro-China. The size of the bias also varies with the language of the prompt, as with Mistral showing pro-France lean only in French. These results indicate that choices made during alignment and fine-tuning actively shape national preferences instead of the original training data alone.

Core claim

The authors establish that geopolitical bias in LLMs originates in post-training rather than in pre-training. In tests of seven base-chat model pairs on a paired-scenario forced-choice probe over 28 country pairs in three languages, six labs showed shifts favoring the model developer's country or region after post-training. The shift reaches 18x in odds for Qwen 2.5. Bias magnitude further depends on prompt language, with Mistral becoming pro-France only under French prompting.

What carries the argument

The paired-scenario forced-choice probe, which measures bias by requiring the model to choose between scenarios linked to different countries.

If this is right

  • Post-training alignment processes actively introduce or amplify national preferences in model outputs.
  • Transparency and auditing of post-training data and methods are needed to track these effects.
  • The language of user prompts can increase or decrease the expression of specific country biases.
  • Similar bias shifts appear across models from multiple countries, pointing to a shared pattern in post-training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teams performing post-training may embed their own regional perspectives through data selection or reward modeling.
  • Models deployed in different languages could exhibit different geopolitical leanings depending on prompt language.
  • Targeted changes to post-training datasets might reduce or balance the observed home-country shifts.
  • Regulators or users could require disclosure of post-training procedures to assess national bias risks.

Load-bearing premise

That the base models contain only pre-training effects with no post-training influences and that the forced-choice probe measures geopolitical bias without interference from model size, architecture, or prompt wording.

What would settle it

A set of additional base-chat pairs in which post-training produces no consistent increase in home-country favoritism would undermine the claim that the bias originates in post-training.

Figures

Figures reproduced from arXiv: 2605.23825 by Brinnae Bent, Stuart Bladon.

Figure 1
Figure 1. Figure 1: Overview, seven families. (A) Per-country preference base → post-trained; for the six non-GLM bases, cross-country spread σ grows post-training (Qwen 3.9 → 30.3 pp). (B) Post-training ∆ in China-favourability (EN, coherent subset). 3/3 Western labs shift anti-China; 3/4 Chinese labs shift pro-China; Yi shifts anti-China after prefill correction. GLM is shown with its (atypical) base preserved for completen… view at source ↗
Figure 2
Figure 2. Figure 2: Inference-time language modulates the post-training bias. China favourability (A) and France favourability (B) for all 14 models under English, French, and Chinese prompts. Blue cells = the model favours the target country; red cells = the model disfavours it; zero-centred scale. The Chinese column is uniformly bluer for post-trained models (except the already￾saturated Qwen); the Mistral-inst cell under t… view at source ↗
Figure 3
Figure 3. Figure 3: China-vs-X favourability decomposed by opponent X, seven post-trained models. Orange: China favoured; blue: China disfavoured vs. Western opponent; grey: vs. non-Western. Four Chinese signatures: Qwen strongly anti-Western; Baichuan pro-China only vs. Global South; Yi pro-China only vs. USA; GLM broad mild anti-China. Western models are anti-China most strongly against the Global-South countries they other… view at source ↗
Figure 4
Figure 4. Figure 4: Three robustness ablations on Qwen 2.5 inst and Mistral 7B inst. (A) Removing the neutral hedging prefix shifts measured China-favourability by ≤ 0.3 log-odds for both models. (B) Three alternate MCQ wordings preserve sign in 8/8 model×phrasing combinations. (C) Cross-prompting factorial: for Mistral the scenario language does nearly all the work (ZH scenario / EN question → −0.03, close to ZH/ZH −0.07); f… view at source ↗
read the original abstract

It has generally been assumed that geopolitical bias in language models originates from the training data used during the pre-training phase. We tested seven open-weight LLM pairs consisting of the base model (pre-training only) and the chat model (pre-training and post-training) from seven labs on a paired-scenario forced-choice probe over 28 country pairs in English, French, and Chinese, and found that geopolitical bias originates in post-training rather than in pre-training. Across seven AI labs, six showed shifts in the direction associated with the country or region of the model developer after post-training. This shift is strongest in Alibaba's Qwen 2.5: while the base is neutral on China-favourability (-0.15 log-odds, p=0.15), the post-trained chat variant is at +2.91 (p<10^-4), an 18x shift in odds. We also observe shifts in biases toward other countries across all models. Additionally, the magnitude of this shift depends on the language used to prompt the model: the French-made Mistral becomes pro-France only under French prompting (FR-EN shift +1.91, p<10^-4). These findings suggest that geopolitical preferences in language models are not simply inherited from large-scale internet data but are actively shaped during post-training, highlighting the need for greater transparency, auditing, and oversight of alignment processes that influence how models represent nations, cultures, and political perspectives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that geopolitical bias in LLMs originates in post-training rather than pre-training. This is based on testing seven open-weight base/chat model pairs from different labs using a paired-scenario forced-choice probe over 28 country pairs in English, French, and Chinese. Six of seven labs show post-training shifts toward the developer's country/region; the largest is Qwen (base -0.15 log-odds on China-favourability to chat +2.91). Prompt language modulates the effect (e.g., Mistral pro-France only in French).

Significance. If the attribution to post-training holds after verification of base-model purity and full methodological controls, the result would shift understanding of LLM bias from pre-training data inheritance to active shaping during alignment. This has clear implications for transparency requirements around post-training and for auditing geopolitical preferences in deployed models. The multi-lab, multi-language design provides a useful comparative framework.

major comments (2)
  1. [Abstract] Abstract: The claim that geopolitical bias 'originates in post-training rather than in pre-training' requires that each base model contains zero post-training effects. The manuscript states the bases are 'pre-training only' but provides no verification, release-note analysis, or discussion of possible undisclosed steps (continued pre-training, data filtering, or early safety) that could affect even one or two pairs and thereby undermine the within-pair shift attribution (e.g., the reported Qwen change from -0.15 to +2.91).
  2. [Methods] Methods section (implied by abstract description): The paired-scenario forced-choice probe is presented at summary level only. Without the exact prompt templates, the full set of 28 country-pair scenarios, controls for model-size or architecture confounds, and the precise procedure for computing log-odds and p-values, it is not possible to confirm that the probe cleanly isolates post-training effects from prompt sensitivity or other factors.
minor comments (1)
  1. The abstract reports specific numerical shifts and p-values; the main text should include the full statistical reporting, sample sizes per condition, and any multiple-comparison corrections to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and indicate where revisions will be made.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that geopolitical bias 'originates in post-training rather than in pre-training' requires that each base model contains zero post-training effects. The manuscript states the bases are 'pre-training only' but provides no verification, release-note analysis, or discussion of possible undisclosed steps (continued pre-training, data filtering, or early safety) that could affect even one or two pairs and thereby undermine the within-pair shift attribution (e.g., the reported Qwen change from -0.15 to +2.91).

    Authors: We agree that stronger attribution requires acknowledging the limits of our assumption. The manuscript relies on the public release documentation from each lab, which designates the base models as pre-training only. We did not conduct independent verification (e.g., release-note forensics or probing for undisclosed filtering). We will add an explicit Limitations paragraph stating this reliance and noting that the within-pair shifts are interpreted under the standard base-versus-chat distinction used in the field. revision: yes

  2. Referee: [Methods] Methods section (implied by abstract description): The paired-scenario forced-choice probe is presented at summary level only. Without the exact prompt templates, the full set of 28 country-pair scenarios, controls for model-size or architecture confounds, and the precise procedure for computing log-odds and p-values, it is not possible to confirm that the probe cleanly isolates post-training effects from prompt sensitivity or other factors.

    Authors: The full manuscript contains a Methods section that expands on the abstract, but we accept that greater detail is warranted for reproducibility. We will expand the Methods section (and add an appendix) with the exact prompt templates, the complete list of 28 country-pair scenarios, the step-by-step computation of log-odds and p-values, and an explicit discussion of controls. The paired base/chat design inherently controls for architecture and size confounds because each comparison holds the underlying model fixed; we will state this clearly. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical comparison of base vs. chat variants

full rationale

The paper reports an empirical measurement: geopolitical bias scores on a forced-choice probe shift after post-training across six of seven model pairs. No equations, fitted parameters, or derivations are present that could reduce the result to its inputs by construction. The central attribution (bias originates in post-training) follows from the observed within-pair differences under the stated assumption that base models contain only pre-training; this assumption is declared rather than derived, and the probe results are not statistically forced by any self-referential definition or self-citation chain. No self-definitional, fitted-input, or ansatz-smuggling patterns appear. The study is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the forced-choice probe as a measure of geopolitical bias and the assumption that differences between base and chat models are attributable only to post-training.

axioms (1)
  • domain assumption Standard statistical significance testing (p-values) accurately reflects genuine differences in model preference.
    Used to report shifts such as p<10^-4.

pith-pipeline@v0.9.0 · 5803 in / 1248 out tokens · 50346 ms · 2026-05-25T04:44:49.520281+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 2 internal anchors

  1. [1]

    Findings of the Association for Computational Linguistics: ACL 2023 , year =

    Perez, Ethan and Ringer, Sam and Lukosiute, Kamile and Nguyen, Karina and Chen, Edwin and Heiner, Scott and Pettit, Craig and Olsson, Catherine and Kundu, Sandipan and Kadavath, Saurav and others , title =. Findings of the Association for Computational Linguistics: ACL 2023 , year =

  2. [2]

    PLOS ONE , volume =

    Rozado, David , title =. PLOS ONE , volume =. 2024 , pages =

  3. [3]

    Proceedings of the 40th International Conference on Machine Learning (ICML) , year =

    Santurkar, Shibani and Durmus, Esin and Ladhak, Faisal and Lee, Cinoo and Liang, Percy and Hashimoto, Tatsunori , title =. Proceedings of the 40th International Conference on Machine Learning (ICML) , year =

  4. [4]

    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) , year =

    Feng, Shangbin and Park, Chan Young and Liu, Yuhan and Tsvetkov, Yulia , title =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) , year =

  5. [5]

    Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models , booktitle =

    R. Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models , booktitle =

  6. [6]

    and Kizilcec, Ren

    Tao, Yan and Viberg, Olga and Baker, Ryan S. and Kizilcec, Ren. Cultural bias and cultural alignment of large language models , journal =

  7. [7]

    Towards Measuring the Representation of Subjective Global Opinions in Language Models

    Durmus, Esin and Nyugen, Karina and Liao, Thomas I. and Schiefer, Nicholas and Askell, Amanda and Bakhtin, Anton and Chen, Carol and Hatfield-Dodds, Zac and Hernandez, Danny and Joseph, Nicholas and others , title =. arXiv preprint arXiv:2306.16388 , year =

  8. [8]

    and Ritter, Alan and Xu, Wei , title =

    Naous, Tarek and Ryan, Michael J. and Ritter, Alan and Xu, Wei , title =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , year =

  9. [9]

    Proceedings of the First Workshop on Cross-Cultural Considerations in NLP , year =

    Cao, Yong and Zhou, Li and Lee, Seolhwa and Cabello, Laura and Chen, Min and Hershcovich, Daniel , title =. Proceedings of the First Workshop on Cross-Cultural Considerations in NLP , year =

  10. [10]

    and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and others , title =

    Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll L. and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and others , title =. Advances in Neural Information Processing Systems 35 (NeurIPS) , year =

  11. [11]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Bai, Yuntao and Jones, Andy and Ndousse, Kamal and Askell, Amanda and Chen, Anna and DasSarma, Nova and Drain, Dawn and Fort, Stanislav and Ganguli, Deep and Henighan, Tom and others , title =. arXiv preprint arXiv:2204.05862 , year =

  12. [12]

    Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback , journal =

    Casper, Stephen and Davies, Xander and Shi, Claudia and Gilbert, Thomas Krendl and Scheurer, J. Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback , journal =

  13. [13]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , year =

    Zhang, Zhexin and Lei, Leqi and Wu, Lindong and Sun, Rui and Huang, Yongkang and Long, Chong and Liu, Xiao and Lei, Xuanyu and Tang, Jie and Huang, Minlie , title =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , year =

  14. [14]

    arXiv preprint arXiv:2307.15020 , year =

    Xu, Liang and Li, Anqi and Zhu, Lei and Xue, Hang and Zhu, Changtai and Zhao, Kangkang and He, Haonan and Zhang, Xuanwei and Kang, Qiyue and Lan, Zhenzhong , title =. arXiv preprint arXiv:2307.15020 , year =

  15. [15]

    NeurIPS Datasets and Benchmarks , year =

    Wang, Boxin and Chen, Weixin and Pei, Hengzhi and Xie, Chulin and Kang, Mintong and Zhang, Chenhui and Xu, Chejian and Xiong, Zidi and Dutta, Ritik and Schaeffer, Rylan and others , title =. NeurIPS Datasets and Benchmarks , year =

  16. [16]

    Findings of the Association for Computational Linguistics: EMNLP 2023 , year =

    Deshpande, Ameet and Murahari, Vishvak and Rajpurohit, Tanmay and Kalyan, Ashwin and Narasimhan, Karthik , title =. Findings of the Association for Computational Linguistics: EMNLP 2023 , year =

  17. [17]

    and Narayanan, Arvind , title =

    Caliskan, Aylin and Bryson, Joanna J. and Narayanan, Arvind , title =. Science , volume =. 2017 , pages =

  18. [18]

    and Gebru, Timnit and McMillan-Major, Angelina and Shmitchell, Shmargaret , title =

    Bender, Emily M. and Gebru, Timnit and McMillan-Major, Angelina and Shmitchell, Shmargaret , title =. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT) , year =

  19. [19]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , year =

    Groeneveld, Dirk and Beltagy, Iz and Walsh, Pete and Bhagia, Akshita and Kinney, Rodney and Tafjord, Oyvind and Jha, Ananya Harsh and Ivison, Hamish and Magnusson, Ian and Wang, Yizhong and others , title =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) , year =