Language Bias under Conflicting Information in Multilingual LLMs
Pith reviewed 2026-05-10 18:23 UTC · model grok-4.3
The pith
Multilingual LLMs ignore conflicts in provided facts and assert only one answer, with a consistent bias against Russian and toward Chinese at long contexts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When conflicting factual claims appear in different languages within the same context, all evaluated LLMs—including GPT-5.2—ignore the contradiction in the large majority of cases and instead assert only one of the two possible answers with high confidence. Across models there is a reliable language preference: Russian is systematically under-selected, and Chinese is over-selected at the longest context lengths. The same directional biases hold for both models trained inside and outside mainland China, though the magnitude is larger for the former group.
What carries the argument
Multilingual extension of the conflicting needles-in-a-haystack test, using pairs of opposing naturalistic news statements placed in five languages inside a single long prompt.
If this is right
- LLMs cannot be trusted to synthesize conflicting information fairly when the sources are in different languages.
- The observed bias against Russian and toward Chinese at maximum lengths is reproducible across independent model families.
- Training origin (inside versus outside mainland China) modulates but does not eliminate the language preference pattern.
- Longer contexts amplify rather than reduce the language-based selection effect.
Where Pith is reading between the lines
- If the bias survives when the factual content is held constant and only language is swapped, it points to surface-level token or script preferences rather than deeper semantic evaluation.
- Global applications that mix news or reports from many languages may systematically under-weight information from Russian sources.
- Adding explicit conflict-resolution instructions or balanced language sampling during fine-tuning could be tested as a mitigation.
Load-bearing premise
That differences in model outputs are driven mainly by the language in which each conflicting fact appears rather than by the specific content of the facts, imbalances in training data, or details of how the prompt is formatted.
What would settle it
Repeat the experiment after swapping the languages of the two conflicting facts while keeping their content identical; if the preference for Chinese and against Russian disappears or reverses, the language-bias claim is falsified.
Figures
read the original abstract
Large Language Models (LLMs) have been shown to contain biases in the process of integrating conflicting information when answering questions. Here we ask whether such biases also exist with respect to which language is used for each conflicting piece of information. To answer this question, we extend the conflicting needles in a haystack paradigm to a multilingual setting and perform a comprehensive set of evaluations with naturalistic news domain data in five different languages, for a range of multilingual LLMs of different sizes. We find that all LLMs tested, including GPT-5.2, ignore the conflict and confidently assert only one of the possible answers in the large majority of cases. Furthermore, there is a consistent bias across models in which languages are preferred, with a general bias against Russian and, for the longest context lengths, in favor of Chinese. Both of these patterns are consistent between models trained inside and outside of mainland China, though somewhat stronger in the former category.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript extends the conflicting needles in a haystack paradigm to a multilingual setting with naturalistic news-domain data in five languages. It evaluates a range of multilingual LLMs and reports that models largely ignore conflicts, confidently asserting only one answer; it further identifies consistent language preferences (bias against Russian, toward Chinese at longest contexts) that hold across models trained inside and outside mainland China.
Significance. If the language preferences survive controls for content equivalence, the work would usefully document a systematic limitation in how current multilingual LLMs integrate conflicting information across languages, with direct relevance to multilingual QA and retrieval applications. The reported consistency across model origins is a positive feature that strengthens the empirical claim.
major comments (1)
- [Abstract and experimental setup] The central claim attributes output preferences to the language of the conflicting information rather than to content-specific factors. Because the stimuli are naturalistic news items, this attribution requires that the five-language versions are matched for semantic content, factual framing, length, and salience. The abstract and setup description do not report back-translation checks, parallel-corpus alignment, or post-hoc content-equivalence metrics, leaving open the possibility that the observed Russian/Chinese patterns are artifacts of the particular stories chosen.
minor comments (2)
- The abstract provides no sample sizes, number of trials per condition, or statistical tests supporting the statements that models 'ignore the conflict ... in the large majority of cases' and exhibit 'consistent bias.'
- Prompt templates, exact context lengths, and the procedure for inserting the conflicting needles are not described, which limits reproducibility and makes it hard to judge whether formatting details could drive the language effects.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The concern about content equivalence is well-taken and we will strengthen the manuscript to address it directly.
read point-by-point responses
-
Referee: [Abstract and experimental setup] The central claim attributes output preferences to the language of the conflicting information rather than to content-specific factors. Because the stimuli are naturalistic news items, this attribution requires that the five-language versions are matched for semantic content, factual framing, length, and salience. The abstract and setup description do not report back-translation checks, parallel-corpus alignment, or post-hoc content-equivalence metrics, leaving open the possibility that the observed Russian/Chinese patterns are artifacts of the particular stories chosen.
Authors: We agree that rigorous verification of semantic, factual, and length equivalence is necessary to support the attribution of preferences to language rather than content. The stimuli were created by selecting English news articles and producing parallel versions in the other four languages via professional translation, with explicit instructions to preserve facts, framing, and approximate length. We acknowledge that the current manuscript does not report back-translation checks, alignment metrics, or post-hoc equivalence statistics. In the revision we will (1) expand the data-construction subsection to detail the translation protocol and any manual review steps, (2) add quantitative equivalence checks (e.g., LaBSE embedding cosine similarity, token-length histograms, and key-fact overlap), and (3) report any residual differences and their likely impact on the results. These additions will directly test whether the Russian/Chinese patterns survive content controls. revision: yes
Circularity Check
No circularity: purely empirical evaluation of LLM outputs
full rationale
The paper reports direct empirical observations from evaluating multiple LLMs on a multilingual extension of the conflicting needles task using naturalistic news data. No equations, derivations, fitted parameters, or predictions are present; results consist of measured output frequencies and language preferences across models. The methodology relies on external benchmarks (model evaluations) rather than any self-referential construction, self-citation chain, or renaming of inputs as outputs. The central claims about bias patterns are falsifiable observations, not quantities defined by the authors' own choices.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Model outputs can be categorized as ignoring or resolving conflicts based on surface text without deeper semantic analysis
Reference graph
Works this paper leans on
-
[1]
Sagi Shaier, Ari Kobren, and Philip V
Whose facts win? llm source preferences under knowledge conflicts. Sagi Shaier, Ari Kobren, and Philip V . Ogren. 2024. Adaptive question answering: Enhancing language model proficiency for addressing knowledge con- flicts with source citations. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17226–17239, Mi...
work page 2024
-
[2]
Improved Evidence Extraction and Metrics for Document Inconsistency Detection with LLMs
Improved evidence extraction for docu- ment inconsistency detection with llms.Preprint, arXiv:2601.02627. Weiyun Wang, Shuibo Zhang, Yiming Ren, Yuchen Duan, Tiantong Li, Shuo Liu, Mengkang Hu, Zhe Chen, Kaipeng Zhang, Lewei Lu, Xizhou Zhu, Ping Luo, Yu Qiao, Jifeng Dai, Wenqi Shao, and Wenhai Wang. 2024. Needle in a multimodal haystack. In Advances in Ne...
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.