pith. sign in

arxiv: 2605.30481 · v1 · pith:RSC6LSRUnew · submitted 2026-05-28 · 💻 cs.CL

When English Rewrites Local Knowledge: Global Narrative Dominance in Large Language Models

Pith reviewed 2026-06-29 07:40 UTC · model grok-4.3

classification 💻 cs.CL
keywords global narrative dominancelarge language modelscultural knowledgeBanglacross-lingual consistencyepistemic perspectiveinstitutional biaslocal knowledge
0
0 comments X

The pith

Questions in English lead LLMs to substitute global and institutional narratives for local Bangla cultural perspectives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how large language models respond to culturally specific questions about Bangla traditions and knowledge. It creates a dataset of 717 parallel Bangla-English instances with annotations for sociocultural context and evidence. Tests on nine models show that English prompts reliably increase the replacement of local details with global ones and heighten institutional framing. Adding local evidence raises factual accuracy and some perspective balance, yet the language-driven shift persists. The work frames these outcomes as problems of narrative prioritization rather than simple knowledge gaps.

Core claim

English-language prompts about Bengali cultural topics cause LLMs to increase global substitution and institutional bias while decreasing coverage of local epistemic perspectives. This pattern appears in both question-only and evidence-based settings across multiple models, and local evidence improves consistency without removing the language-induced change in narrative framing.

What carries the argument

The CulturalNB dataset of parallel Bangla-English question-answer pairs with evidence, metadata, and annotations, evaluated through metrics of global substitution, institutional bias, and epistemic perspective coverage scored by human and LLM judges.

If this is right

  • Cultural errors in LLMs include failures of grounding and narrative choice in addition to missing facts.
  • Local evidence raises factual consistency and some perspective coverage but leaves language-based epistemic shifts intact.
  • Cross-lingual use of LLMs for low-resource cultural topics can systematically favor dominant global frames.
  • Prompt language choice affects which knowledge sources LLMs prioritize even when supporting evidence is supplied.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same language-induced substitution may occur in other low-resource languages when English prompts are used.
  • Applications such as cultural education tools or local knowledge archives could inherit these framing biases if English is the default interface.
  • Testing whether the effect scales with model size or training data volume would clarify whether it is a general property of current LLMs.
  • Designers could explore language-anchored prompting or separate local-knowledge modules to reduce the observed shifts.

Load-bearing premise

Ratings of global substitution, institutional bias, and local perspective coverage by human and LLM judges measure narrative dominance without being skewed by training data imbalances or prompt wording.

What would settle it

A follow-up evaluation on a comparable set of cultural questions that finds no measurable rise in global substitution or drop in local perspective coverage when switching from Bangla to English prompts would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.30481 by Farhan Samir, Md Arid Hasan, Ruwad Naswan, Sharifa Sultana, Syed Ishtiaque Ahmed.

Figure 1
Figure 1. Figure 1: Example from CulturalNB illustrating global narrative dominance and evaluation dimensions. The figure shows a culturally grounded question, a translated English question, and the responses (both Bangla and English) generated by GPT-5.4. Highlighted responses represent narrative dominance. Abstract Large language models (LLMs) are widely used as cross-lingual knowledge interfaces. However, culturally ground… view at source ↗
Figure 2
Figure 2. Figure 2: Source distribution of CulturalNB dataset. Through this manual collection process, we re￾viewed eight books on Bengali culture, along with a wide range of news articles, Wikipedia pages, and regional archival sources, and compiled a dataset of 717 culturally grounded instances. Each instance consists of a culturally grounded question–answer pair, a supporting passage or transcription, domain, source type, … view at source ↗
Figure 3
Figure 3. Figure 3: Cross-Lingual Factual Consistency (higher is [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Language Anchor Bias (lower is better) across [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Epistemic Perspective Coverage (higher is [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Large language models (LLMs) are widely used as cross-lingual knowledge interfaces. However, culturally grounded questions often reflect globally dominant narratives rather than local contexts. We study this failure mode as \textit{global narrative dominance} in Bangla, a low-resource cultural context. We introduce \texttt{CulturalNB}, a dataset of 717 manually curated Bengali cultural instances with parallel Bangla--English question--answer pairs and supporting evidence, metadata, and sociocultural annotations. Using question-only and evidence-based prompting, we evaluate nine state-of-the-art LLMs with human and two independent LLM judges across metrics for cross-lingual consistency, language anchoring, global substitution, institutional bias, and epistemic perspective coverage. Results show that questions asked in English systematically increase global substitution and institutional framing while reducing local perspective coverage. Local evidence improves factual consistency and perspective coverage, but does not eliminate language-induced epistemic shifts. These findings suggest that cultural failures in LLMs are not only missing-knowledge errors but also failures of grounding and narrative prioritization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CulturalNB, a dataset of 717 manually curated Bengali cultural instances with parallel Bangla-English question-answer pairs, supporting evidence, and sociocultural annotations. It evaluates nine state-of-the-art LLMs via question-only and evidence-based prompting, using human and two LLM judges to measure cross-lingual consistency, language anchoring, global substitution, institutional bias, and epistemic perspective coverage. The central claim is that English questions systematically increase global substitution and institutional framing while reducing local perspective coverage, and that local evidence improves consistency but does not eliminate language-induced epistemic shifts.

Significance. If the judge-based measurements are shown to be reliable, the work would provide concrete evidence that query language affects narrative prioritization in LLMs for low-resource cultural contexts, beyond simple knowledge gaps. The new CulturalNB dataset with parallel pairs and annotations represents a useful resource for future multilingual cultural evaluation studies.

major comments (2)
  1. [Abstract and Evaluation] Abstract and Evaluation section: the central claim that English questions 'systematically increase global substitution and institutional framing' rests entirely on human and LLM judge ratings of interpretive categories on the 717 instances, yet no inter-rater agreement figures, rubric examples, calibration procedures, or tests for judge bias versus model output are reported. This directly affects the validity of the language-induced shift results.
  2. [Evaluation Methodology] Evaluation Methodology: the abstract states results from 'human and two independent LLM judges' across multiple metrics but provides no details on statistical tests, sample balancing across cultural instances, or controls for confounds such as prompt artifacts or training-data imbalances in the judges themselves. These omissions are load-bearing for interpreting the reported directional effects.
minor comments (2)
  1. [Abstract] The abstract mentions 'nine state-of-the-art LLMs' but does not list model names or versions; adding this in the methods would improve reproducibility.
  2. [Dataset Construction] Clarify whether the parallel Bangla-English pairs were created by native speakers and how sociocultural annotations were validated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We address each major comment below and commit to revisions that strengthen the reporting of our evaluation procedures.

read point-by-point responses
  1. Referee: [Abstract and Evaluation] Abstract and Evaluation section: the central claim that English questions 'systematically increase global substitution and institutional framing' rests entirely on human and LLM judge ratings of interpretive categories on the 717 instances, yet no inter-rater agreement figures, rubric examples, calibration procedures, or tests for judge bias versus model output are reported. This directly affects the validity of the language-induced shift results.

    Authors: We agree that inter-rater reliability metrics and rubric details are essential for validating the judge-based results. The original manuscript omitted these elements. In the revision we will add Cohen's kappa (or equivalent) between the human annotator and each LLM judge, pairwise agreement between the two LLM judges, sample rubric excerpts with calibration examples, and any available checks for systematic judge bias relative to model outputs. These will appear in a new subsection of the Evaluation Methodology and an expanded appendix. revision: yes

  2. Referee: [Evaluation Methodology] Evaluation Methodology: the abstract states results from 'human and two independent LLM judges' across multiple metrics but provides no details on statistical tests, sample balancing across cultural instances, or controls for confounds such as prompt artifacts or training-data imbalances in the judges themselves. These omissions are load-bearing for interpreting the reported directional effects.

    Authors: We concur that explicit statistical procedures, sampling details, and confound controls are required. The revision will specify the statistical tests applied to directional effects (e.g., McNemar or paired Wilcoxon tests with exact p-values and effect sizes), confirm that the 717 instances were stratified by sociocultural category and region, and describe prompt-artifact controls (fixed templates, order randomization). For judge confounds we will report the distinct model families used for the two LLM judges and any post-hoc checks for training-data overlap. These additions will be integrated into the Evaluation Methodology section. revision: yes

Circularity Check

0 steps flagged

Empirical study with new dataset and annotations; no equations, fits, or self-citation chains reduce claims to inputs.

full rationale

The paper introduces the CulturalNB dataset of 717 curated instances with parallel questions and annotations, then evaluates nine LLMs using human and LLM judges on metrics including global substitution, institutional bias, and epistemic perspective coverage. No mathematical derivations, fitted parameters, or predictions appear in the provided text. Results are reported directly from the new evaluations rather than reduced to prior self-citations or ansatzes. The central claim rests on the introduced data and rating procedures, which are independent of any self-referential loop. This is a standard empirical contribution with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim depends on the validity of the new dataset curation and the judgment metrics for narrative dominance; these are introduced without external benchmarks in the abstract.

axioms (1)
  • domain assumption Human and LLM judges can reliably quantify 'global substitution' and 'epistemic perspective coverage'
    Evaluation metrics rest on these judgments; abstract does not report validation or agreement statistics.

pith-pipeline@v0.9.1-grok · 5722 in / 1083 out tokens · 27765 ms · 2026-06-29T07:40:36.492507+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 5 canonical work pages · 1 internal anchor

  1. [1]

    In Proceedings of the 63rd Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 11772–11817, Vienna, Austria

    CaLMQA: Exploring culturally specific long- form question answering across 23 languages. In Proceedings of the 63rd Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 11772–11817, Vienna, Austria. Association for Computational Linguistics. Yejin Bang, Ziwei Ji, Alan Schelten, Anthony Hartshorn, Tara Fowler, Ch...

  2. [2]

    Shiwei Guo, Sihang Jiang, Qianxi He, Yanghua Xiao, Jiaqing Liang, Bi Yude, Minggui He, Shimin Tao, and Li Zhang

    Bias and fairness in large language models: A survey.Computational Linguistics, 50(3):1097– 1179. Shiwei Guo, Sihang Jiang, Qianxi He, Yanghua Xiao, Jiaqing Liang, Bi Yude, Minggui He, Shimin Tao, and Li Zhang. 2025. Do large language models truly understand cross-cultural differences?arXiv preprint arXiv:2512.07075. Md Arid Hasan, Maram Hasanain, Fatema ...

  3. [3]

    InProceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 4411–4421

    XTREME: A massively multilingual multi- task benchmark for evaluating cross-lingual gener- alisation. InProceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 4411–4421. PMLR. Dieuwke Hupkes and Nikolay Bogoychev. 2025. Mul- tiloko: a multilingual local knowledge benchmark for ll...

  4. [4]

    InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 16366–16393, Bangkok, Thailand

    Having beer after prayer? Measuring cultural bias in large language models. InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 16366–16393, Bangkok, Thailand. Association for Computational Linguistics. Ilana Nguyen, Harini Suresh, and Evan Shieh. 2025. Representational harms in llm-ge...

  5. [5]

    InsideOut: Measuring and Mitigating Insider-Outsider Bias in Interview Script Generation

    Global MMLU: Understanding and addressing cultural and linguistic biases in multilingual evalua- tion. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 18761–18799, Vienna, Austria. Association for Computational Linguistics. Eshaan Tanwar, Anwoy Chatterjee, Michael Saxon, Alon Albal...

  6. [6]

    InProceedings of the 2025 Conference on Empirical Methods in Natural Lan- guage Processing, pages 6367–6389, Suzhou, China

    Multilingual prompting for improving LLM generation diversity. InProceedings of the 2025 Conference on Empirical Methods in Natural Lan- guage Processing, pages 6367–6389, Suzhou, China. Association for Computational Linguistics. Wenxuan Wang, Wenxiang Jiao, Jingyuan Huang, Ruyi Dai, Jen-tse Huang, Zhaopeng Tu, and Michael Lyu

  7. [7]

    In Proceedings of the 62nd Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 6349–6384, Bangkok, Thailand

    Not all countries celebrate thanksgiving: On the cultural dominance in large language models. In Proceedings of the 62nd Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 6349–6384, Bangkok, Thailand. Association for Computational Linguistics. Yasin A Yadkori, Ilja Kuzborskij, András György, and Csaba Szepesv...

  8. [8]

    answer":

    Mechanistic understanding and mitigation of language model non-factual hallucinations. InFind- ings of the Association for Computational Linguistics: EMNLP 2024, pages 7943–7956, Miami, Florida, USA. Association for Computational Linguistics. Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen,...

  9. [9]

    Fabrication or hallucination

  10. [10]

    Confident but incorrect answer

  11. [11]

    K": 0 or 1,

    Evasive but non-abstaining ## SET K = 0 if: - Response is factually correct - Minor omissions exist - Proper abstention exists ## OUTPUT: {{ "K": 0 or 1, "error_type": "none | hallucination | substitution | misunderstanding | incorrect | evasive", "confidence": 0.0-1.0 }} Listing 4: Judge prompt for identifying knowledge gaps. You are an expert evaluator ...

  12. [12]

    Global: generic, widely documented, institutional, Western/global narrative

  13. [13]

    units": [ {{

    Local: culturally specific, community-based, oral, contested, non-global Response: {response} ## Output JSON format: {{ "units": [ {{"unit": "atomic unit 1", "label": "global | local | neutral"}}, {{"unit": "atomic unit 2", "label": "global | local | neutral"}}, .... ] }} Listing 9: Judge prompt for extracting atomic units along with epistemic perspective...