Language Bias under Conflicting Information in Multilingual LLMs

Murathan Kurfal{\i}; Robert \"Ostling

arxiv: 2604.07123 · v1 · submitted 2026-04-08 · 💻 cs.CL

Language Bias under Conflicting Information in Multilingual LLMs

Robert \"Ostling , Murathan Kurfal{\i} This is my paper

Pith reviewed 2026-05-10 18:23 UTC · model grok-4.3

classification 💻 cs.CL

keywords language biasmultilingual LLMsconflicting informationneedles in haystackGPT modelsRussianChinesecontext length

0 comments

The pith

Multilingual LLMs ignore conflicts in provided facts and assert only one answer, with a consistent bias against Russian and toward Chinese at long contexts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether language choice for conflicting facts affects how LLMs resolve them, by placing opposing pieces of news information in five languages inside long contexts. Across many models and sizes, including GPT-5.2, the systems largely discard the conflict and confidently output just one of the two answers. The choice of which answer is kept shows a stable language pattern: Russian sources are disfavored, while Chinese sources gain preference when contexts reach maximum length. These patterns appear in models trained both inside and outside mainland China, though the effect is somewhat stronger in the latter group.

Core claim

When conflicting factual claims appear in different languages within the same context, all evaluated LLMs—including GPT-5.2—ignore the contradiction in the large majority of cases and instead assert only one of the two possible answers with high confidence. Across models there is a reliable language preference: Russian is systematically under-selected, and Chinese is over-selected at the longest context lengths. The same directional biases hold for both models trained inside and outside mainland China, though the magnitude is larger for the former group.

What carries the argument

Multilingual extension of the conflicting needles-in-a-haystack test, using pairs of opposing naturalistic news statements placed in five languages inside a single long prompt.

If this is right

LLMs cannot be trusted to synthesize conflicting information fairly when the sources are in different languages.
The observed bias against Russian and toward Chinese at maximum lengths is reproducible across independent model families.
Training origin (inside versus outside mainland China) modulates but does not eliminate the language preference pattern.
Longer contexts amplify rather than reduce the language-based selection effect.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the bias survives when the factual content is held constant and only language is swapped, it points to surface-level token or script preferences rather than deeper semantic evaluation.
Global applications that mix news or reports from many languages may systematically under-weight information from Russian sources.
Adding explicit conflict-resolution instructions or balanced language sampling during fine-tuning could be tested as a mitigation.

Load-bearing premise

That differences in model outputs are driven mainly by the language in which each conflicting fact appears rather than by the specific content of the facts, imbalances in training data, or details of how the prompt is formatted.

What would settle it

Repeat the experiment after swapping the languages of the two conflicting facts while keeping their content identical; if the preference for Chinese and against Russian disappears or reverses, the language-bias claim is falsified.

Figures

Figures reproduced from arXiv: 2604.07123 by Murathan Kurfal{\i}, Robert \"Ostling.

read the original abstract

Large Language Models (LLMs) have been shown to contain biases in the process of integrating conflicting information when answering questions. Here we ask whether such biases also exist with respect to which language is used for each conflicting piece of information. To answer this question, we extend the conflicting needles in a haystack paradigm to a multilingual setting and perform a comprehensive set of evaluations with naturalistic news domain data in five different languages, for a range of multilingual LLMs of different sizes. We find that all LLMs tested, including GPT-5.2, ignore the conflict and confidently assert only one of the possible answers in the large majority of cases. Furthermore, there is a consistent bias across models in which languages are preferred, with a general bias against Russian and, for the longest context lengths, in favor of Chinese. Both of these patterns are consistent between models trained inside and outside of mainland China, though somewhat stronger in the former category.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper extends conflicting-needles to multilingual news and reports consistent language preferences in how models resolve conflicts, but the evidence for genuine linguistic bias over content confounds is still weak.

read the letter

The main takeaway is that multilingual LLMs largely ignore conflicts in the input and default to one answer, with a recurring tilt against Russian-language needles and toward Chinese ones at longer contexts. This holds across models trained inside and outside China. That pattern is the paper's core observation and the reason it might interest people working on cross-lingual factuality or retrieval-augmented systems. The extension itself is straightforward: they take the existing needles setup, swap in naturalistic news items across five languages, and run the same models. That move is useful because prior work stayed mostly monolingual or used synthetic facts. Testing GPT-5.2 alongside smaller models and noting the consistency is also a plus; it shows the behavior is not limited to one training regime. The naturalistic domain helps too, since real news stories make the conflicts more representative than made-up facts. The soft spots sit mainly in the attribution step. The abstract and setup do not describe how the five-language news items were checked for equivalent content, framing, or salience. Naturalistic data makes perfect matching hard, so differences in phrasing, cultural salience, or training-data frequency could produce the observed preferences without any special language bias. No sample sizes, statistical tests, or prompt templates appear in the provided summary, which leaves the strength of the claims hard to judge. The stress-test concern about translation or content artifacts therefore lands; without back-translation checks or equivalence metrics, the language-bias interpretation stays provisional. Readers working on multilingual evaluation or bias mitigation would get the most from this, as a prompt for tighter controls rather than a finished result. It is coherent on its own terms and engages the relevant literature, so it deserves a serious referee who can ask for the missing methodological details and re-run checks. I would send it to review with the expectation of revisions on the confound controls, not desk reject.

Referee Report

1 major / 2 minor

Summary. The manuscript extends the conflicting needles in a haystack paradigm to a multilingual setting with naturalistic news-domain data in five languages. It evaluates a range of multilingual LLMs and reports that models largely ignore conflicts, confidently asserting only one answer; it further identifies consistent language preferences (bias against Russian, toward Chinese at longest contexts) that hold across models trained inside and outside mainland China.

Significance. If the language preferences survive controls for content equivalence, the work would usefully document a systematic limitation in how current multilingual LLMs integrate conflicting information across languages, with direct relevance to multilingual QA and retrieval applications. The reported consistency across model origins is a positive feature that strengthens the empirical claim.

major comments (1)

[Abstract and experimental setup] The central claim attributes output preferences to the language of the conflicting information rather than to content-specific factors. Because the stimuli are naturalistic news items, this attribution requires that the five-language versions are matched for semantic content, factual framing, length, and salience. The abstract and setup description do not report back-translation checks, parallel-corpus alignment, or post-hoc content-equivalence metrics, leaving open the possibility that the observed Russian/Chinese patterns are artifacts of the particular stories chosen.

minor comments (2)

The abstract provides no sample sizes, number of trials per condition, or statistical tests supporting the statements that models 'ignore the conflict ... in the large majority of cases' and exhibit 'consistent bias.'
Prompt templates, exact context lengths, and the procedure for inserting the conflicting needles are not described, which limits reproducibility and makes it hard to judge whether formatting details could drive the language effects.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The concern about content equivalence is well-taken and we will strengthen the manuscript to address it directly.

read point-by-point responses

Referee: [Abstract and experimental setup] The central claim attributes output preferences to the language of the conflicting information rather than to content-specific factors. Because the stimuli are naturalistic news items, this attribution requires that the five-language versions are matched for semantic content, factual framing, length, and salience. The abstract and setup description do not report back-translation checks, parallel-corpus alignment, or post-hoc content-equivalence metrics, leaving open the possibility that the observed Russian/Chinese patterns are artifacts of the particular stories chosen.

Authors: We agree that rigorous verification of semantic, factual, and length equivalence is necessary to support the attribution of preferences to language rather than content. The stimuli were created by selecting English news articles and producing parallel versions in the other four languages via professional translation, with explicit instructions to preserve facts, framing, and approximate length. We acknowledge that the current manuscript does not report back-translation checks, alignment metrics, or post-hoc equivalence statistics. In the revision we will (1) expand the data-construction subsection to detail the translation protocol and any manual review steps, (2) add quantitative equivalence checks (e.g., LaBSE embedding cosine similarity, token-length histograms, and key-fact overlap), and (3) report any residual differences and their likely impact on the results. These additions will directly test whether the Russian/Chinese patterns survive content controls. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation of LLM outputs

full rationale

The paper reports direct empirical observations from evaluating multiple LLMs on a multilingual extension of the conflicting needles task using naturalistic news data. No equations, derivations, fitted parameters, or predictions are present; results consist of measured output frequencies and language preferences across models. The methodology relies on external benchmarks (model evaluations) rather than any self-referential construction, self-citation chain, or renaming of inputs as outputs. The central claims about bias patterns are falsifiable observations, not quantities defined by the authors' own choices.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study is observational and introduces no free parameters, invented entities, or non-standard axioms beyond routine assumptions about how model outputs can be interpreted as conflict resolution or language preference.

axioms (1)

domain assumption Model outputs can be categorized as ignoring or resolving conflicts based on surface text without deeper semantic analysis
The reported findings depend on interpreting whether the model acknowledges the conflict or asserts one answer.

pith-pipeline@v0.9.0 · 5455 in / 1277 out tokens · 95288 ms · 2026-05-10T18:23:28.661958+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Sagi Shaier, Ari Kobren, and Philip V

Whose facts win? llm source preferences under knowledge conflicts. Sagi Shaier, Ari Kobren, and Philip V . Ogren. 2024. Adaptive question answering: Enhancing language model proficiency for addressing knowledge con- flicts with source citations. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17226–17239, Mi...

work page 2024
[2]

Improved Evidence Extraction and Metrics for Document Inconsistency Detection with LLMs

Improved evidence extraction for docu- ment inconsistency detection with llms.Preprint, arXiv:2601.02627. Weiyun Wang, Shuibo Zhang, Yiming Ren, Yuchen Duan, Tiantong Li, Shuo Liu, Mengkang Hu, Zhe Chen, Kaipeng Zhang, Lewei Lu, Xizhou Zhu, Ping Luo, Yu Qiao, Jifeng Dai, Wenqi Shao, and Wenhai Wang. 2024. Needle in a multimodal haystack. In Advances in Ne...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Sagi Shaier, Ari Kobren, and Philip V

Whose facts win? llm source preferences under knowledge conflicts. Sagi Shaier, Ari Kobren, and Philip V . Ogren. 2024. Adaptive question answering: Enhancing language model proficiency for addressing knowledge con- flicts with source citations. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17226–17239, Mi...

work page 2024

[2] [2]

Improved Evidence Extraction and Metrics for Document Inconsistency Detection with LLMs

Improved evidence extraction for docu- ment inconsistency detection with llms.Preprint, arXiv:2601.02627. Weiyun Wang, Shuibo Zhang, Yiming Ren, Yuchen Duan, Tiantong Li, Shuo Liu, Mengkang Hu, Zhe Chen, Kaipeng Zhang, Lewei Lu, Xizhou Zhu, Ping Luo, Yu Qiao, Jifeng Dai, Wenqi Shao, and Wenhai Wang. 2024. Needle in a multimodal haystack. In Advances in Ne...

work page internal anchor Pith review Pith/arXiv arXiv 2024