Recognition: unknown
All Languages Matter: Understanding and Mitigating Language Bias in Multilingual RAG
Pith reviewed 2026-05-09 23:48 UTC · model grok-4.3
The pith
Multilingual RAG rerankers favor English and the query language, creating a gap that LAURA closes by aligning scores to generative utility.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Current mRAG systems suffer from language bias during reranking that favors English and the query's native language, as shown by a substantial gap to the oracle upper bound and by systematic suppression of answer-critical documents scattered across multiple languages; LAURA bridges this gap by aligning multilingual evidence ranking directly with downstream generative utility.
What carries the argument
LAURA, a reranker that aligns multilingual evidence scores with the utility of that evidence for the downstream generation model rather than with language identity.
If this is right
- Rerankers select evidence from a broader set of languages when it improves generation quality.
- The gap between observed mRAG performance and the oracle bound shrinks.
- Improvements appear consistently across tested languages and generation models without retraining the generator.
- Answer-critical documents that were previously suppressed become available to the generator.
Where Pith is reading between the lines
- The same utility-alignment idea could be applied to other retrieval biases such as domain or cultural preferences.
- Feedback from generation quality may become a standard signal for training cross-lingual retrievers.
- Low-resource languages may show larger gains if the bias is stronger there.
Load-bearing premise
The estimated oracle evidence analysis gives a reliable upper bound and that realigning reranker scores to generative utility will not create new unintended biases or drops in specific languages.
What would settle it
Running the full LAURA pipeline on a new collection of languages and models and finding no reduction in language bias or no gain in end-to-end mRAG accuracy compared with the original reranker.
Figures
read the original abstract
Multilingual Retrieval-Augmented Generation (mRAG) leverages cross-lingual evidence to ground Large Language Models (LLMs) in global knowledge. However, we show that current mRAG systems suffer from a language bias during reranking, systematically favoring English and the query's native language. By introducing an estimated oracle evidence analysis, we quantify a substantial performance gap between existing rerankers and the achievable upper bound. Further analysis reveals a critical distributional mismatch: while optimal predictions require evidence scattered across multiple languages, current systems systematically suppress such ``answer-critical'' documents, thereby limiting downstream generation performance. To bridge this gap, we propose \textit{\textbf{L}anguage-\textbf{A}gnostic \textbf{U}tility-driven \textbf{R}eranker \textbf{A}lignment (LAURA)}, which aligns multilingual evidence ranking with downstream generative utility. Experiments across diverse languages and generation models show that LAURA effectively mitigates language bias and consistently improves mRAG performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that multilingual RAG systems exhibit language bias in reranking that favors English and the query's native language, quantifies a substantial gap to an estimated oracle upper bound via oracle evidence analysis, identifies a distributional mismatch where systems suppress answer-critical multi-language documents, and proposes LAURA (Language-Agnostic Utility-driven Reranker Alignment) to align reranker scores with downstream generative utility; experiments across languages and models reportedly show bias mitigation and consistent mRAG gains.
Significance. If the oracle bound is tight and LAURA's gains hold without new language-specific degradations, the work would provide a practical method to reduce language bias in mRAG, improving equity and performance for non-English and low-resource languages in retrieval-augmented generation.
major comments (2)
- [Abstract / Oracle Evidence Analysis] The estimated oracle evidence analysis (described in the abstract as quantifying the performance gap) is load-bearing for the headline claim of a 'substantial performance gap'; its construction must be specified in detail (e.g., how optimal cross-lingual evidence is selected and whether it assumes idealized fusion) to confirm the bound is not inflated by unrealistic assumptions.
- [Experiments] The claim that LAURA 'consistently improves mRAG performance' across diverse languages rests on the assumption that utility-driven alignment does not create compensating drops in low-resource languages; experiments must report per-language breakdowns, statistical tests, and controls to rule out new biases introduced by the generative-utility proxy.
minor comments (2)
- [Abstract] The abstract introduces the LAURA acronym and method but provides no concrete details on baselines, reranker models, or generation models used; adding a brief experimental summary would improve readability.
- [Introduction / Analysis] Ensure all terms such as 'answer-critical' documents and 'distributional mismatch' are defined on first use with a reference to the relevant analysis section.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to improve clarity and rigor.
read point-by-point responses
-
Referee: [Abstract / Oracle Evidence Analysis] The estimated oracle evidence analysis (described in the abstract as quantifying the performance gap) is load-bearing for the headline claim of a 'substantial performance gap'; its construction must be specified in detail (e.g., how optimal cross-lingual evidence is selected and whether it assumes idealized fusion) to confirm the bound is not inflated by unrealistic assumptions.
Authors: We agree that additional detail on the oracle construction is warranted to support the performance gap claim. In the revised manuscript, we will expand the description in Section 3.2 (and add a clarifying paragraph in the abstract if space permits) to specify the exact procedure: the oracle enumerates feasible combinations of documents from the multilingual retrieval pool, selects the subset that maximizes downstream answer accuracy (exact match/F1) when fed to the same generator and fusion method used by the evaluated mRAG systems, and reports the resulting upper-bound performance. No idealized cross-lingual fusion or perfect retrieval is assumed beyond what is achievable with the given evidence pool and standard concatenation. This makes the bound a realistic, tight estimate rather than an inflated theoretical ceiling. revision: yes
-
Referee: [Experiments] The claim that LAURA 'consistently improves mRAG performance' across diverse languages rests on the assumption that utility-driven alignment does not create compensating drops in low-resource languages; experiments must report per-language breakdowns, statistical tests, and controls to rule out new biases introduced by the generative-utility proxy.
Authors: We appreciate the referee's emphasis on verifying the absence of new biases. The current manuscript already provides per-language breakdowns in the appendix tables, which show gains (or no degradation) across both high- and low-resource languages. In the revision we will move the key per-language results to the main paper, add paired statistical significance tests (t-tests with p-values across multiple random seeds), and include control experiments that compare LAURA against both language-specific rerankers and a language-agnostic baseline. These additions will explicitly demonstrate that the generative-utility alignment does not introduce compensating drops or new language biases. revision: partial
Circularity Check
No circularity: empirical alignment method with independent experimental validation
full rationale
The paper presents LAURA as a utility-driven reranker alignment technique for mRAG, supported by experiments across languages and models that demonstrate bias mitigation and performance gains. No equations, derivations, or self-referential constructions appear in the abstract or described method. The estimated oracle analysis quantifies gaps but is not shown to reduce to a fitted parameter or self-definition by construction. Any self-citations (if present in full text) are not load-bearing for the central claim, which rests on external empirical results rather than internal redefinition. This is a standard empirical contribution without the enumerated circular patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[3]
Akari Asai, Xinyan Yu, Jungo Kasai, and Hanna Hajishirzi. 2021 b . One question answering model for many languages with cross-lingual dense passage retrieval. Advances in Neural Information Processing Systems, 34:7547--7560
2021
-
[4]
Akari Asai, Xinyan Yu, Jungo Kasai, and Hannaneh Hajishirzi. 2021 c . One question answering model for many languages with cross-lingual dense passage retrieval. In Proceedings of the 35th International Conference on Neural Information Processing Systems, NIPS '21, Red Hook, NY, USA. Curran Associates Inc
2021
- [5]
-
[6]
Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation
Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation
-
[8]
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, and others. 2025. https://arxiv.org/abs/2501.12948 Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning . Preprint, arXiv:2501.12948
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, and 1 others. 2024. https://arxiv.org/abs/2407.21783 The llama 3 herd of models . Preprint, arXiv:2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Patrick Lewis, Ethan Perez, Aleksandara Piktus, Filippo Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. arXiv: Computation and Language,arXiv: Computation and Language
2020
-
[16]
Ilya Loshchilov and Frank Hutter. 2019. https://openreview.net/forum?id=Bkg6RiCqY7 Decoupled weight decay regularization . In International Conference on Learning Representations
2019
-
[20]
Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, and 25 others. 2025. https://arxiv.org/abs/2412.15115 Qwen2.5 technical report . Preprint, arXiv:2412.15115
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Aho and Jeffrey D
Alfred V. Aho and Jeffrey D. Ullman , title =. 1972
1972
-
[27]
Publications Manual , year = "1983", publisher =
1983
-
[28]
Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243
-
[29]
Scalable training of
Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
-
[30]
Dan Gusfield , title =. 1997
1997
-
[31]
Tetreault , title =
Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =
2015
-
[32]
A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =
Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
-
[33]
Retrieval-augmented generation in multilingual settings
Chirkova, Nadezhda and Rau, David and D \'e jean, Herv \'e and Formal, Thibault and Clinchant, St \'e phane and Nikoulina, Vassilina. Retrieval-augmented generation in multilingual settings. Proceedings of the 1st Workshop on Towards Knowledgeable Language Models (KnowLLM 2024). 2024. doi:10.18653/v1/2024.knowllm-1.15
-
[34]
2025 , eprint=
Qwen2.5 Technical Report , author=. 2025 , eprint=
2025
-
[35]
2024 , eprint=
The Llama 3 Herd of Models , author=. 2024 , eprint=
2024
-
[36]
MKQA : A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering
Longpre, Shayne and Lu, Yi and Daiber, Joachim , year=. MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering , url=. doi:10.1162/tacl_a_00433 , journal=
-
[37]
https://aclanthology.org/ Q19-1026/
Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and Toutanova, Kristina and Jones, Llion and Kelcey, Matthew and Chang, Ming-Wei and Dai, Andrew M. and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav , year=...
-
[38]
BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation , author=
-
[39]
2024 , howpublished =
Model Card for Command-R , author =. 2024 , howpublished =
2024
-
[40]
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , journal=
Lewis, Patrick and Perez, Ethan and Piktus, Aleksandara and Petroni, Filippo and Karpukhin, Vladimir and Goyal, Naman and Küttler, Heinrich and Lewis, Mike and Yih, Wen-tau and Rocktäschel, Tim and Riedel, Sebastian and Kiela, Douwe , year=. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , journal=
-
[41]
Ram, Ori and Levine, Yoav and Dalmedigos, Itay and Muhlgay, Dor and Shashua, Amnon and Leyton-Brown, Kevin and Shoham, Yoav. In-Context Retrieval-Augmented Language Models. Transactions of the Association for Computational Linguistics. 2023. doi:10.1162/tacl_a_00605
-
[42]
Retrieval-based Language Models and Applications
Asai, Akari and Min, Sewon and Zhong, Zexuan and Chen, Danqi. Retrieval-based Language Models and Applications. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 6: Tutorial Abstracts). 2023. doi:10.18653/v1/2023.acl-tutorials.6
-
[43]
2024 , eprint=
Retrieval-Augmented Generation for Large Language Models: A Survey , author=. 2024 , eprint=
2024
-
[44]
Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages =
Fan, Wenqi and Ding, Yujuan and Ning, Liangbo and Wang, Shijie and Li, Hengyun and Yin, Dawei and Chua, Tat-Seng and Li, Qing , title =. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages =. 2024 , isbn =. doi:10.1145/3637528.3671470 , abstract =
-
[45]
Advances in Neural Information Processing Systems , volume=
One question answering model for many languages with cross-lingual dense passage retrieval , author=. Advances in Neural Information Processing Systems , volume=
-
[46]
B ord IR lines: A Dataset for Evaluating Cross-lingual Retrieval Augmented Generation
Li, Bryan and Haider, Samar and Luo, Fiona and Agashe, Adwait and Callison-Burch, Chris. B ord IR lines: A Dataset for Evaluating Cross-lingual Retrieval Augmented Generation. Proceedings of the First Workshop on Advancing Natural Language Processing for Wikipedia. 2024. doi:10.18653/v1/2024.wikinlp-1.3
-
[47]
Investigating Language Preference of Multilingual RAG Systems
Park, Jeonghyun and Lee, Hwanhee. Investigating Language Preference of Multilingual RAG Systems. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.295
-
[48]
The Cross-Lingual Cost: Retrieval Biases in RAG over A rabic- E nglish Corpora
Amiraz, Chen and Fyodorov, Yaroslav and Haramaty, Elad and Karnin, Zohar and Lewin-Eytan, Liane. The Cross-Lingual Cost: Retrieval Biases in RAG over A rabic- E nglish Corpora. Proceedings of The Third Arabic Natural Language Processing Conference. 2025. doi:10.18653/v1/2025.arabicnlp-main.6
-
[49]
On the Consistency of Multilingual Context Utilization in Retrieval-Augmented Generation
Qi, Jirui and Fern \'a ndez, Raquel and Bisazza, Arianna. On the Consistency of Multilingual Context Utilization in Retrieval-Augmented Generation. Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025). 2025. doi:10.18653/v1/2025.mrl-main.15
-
[50]
Quality-Aware Translation Tagging in Multilingual RAG system
Moon, Hoyeon and Kim, Byeolhee and Verma, Nikhil. Quality-Aware Translation Tagging in Multilingual RAG system. Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025). 2025. doi:10.18653/v1/2025.mrl-main.12
-
[51]
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models , author=. arXiv preprint arXiv:2506.05176 , year=
work page internal anchor Pith review arXiv
-
[52]
SWIFT: A scalable lightweight infrastructure for fine-tuning
SWIFT: A Scalable Lightweight Infrastructure for Fine-Tuning , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2025 , month=. doi:10.1609/aaai.v39i28.35383 , abstractNote=
-
[53]
International Conference on Learning Representations , year=
Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=
-
[54]
Retrieval-Augmented Multilingual Keyphrase Generation with Retriever-Generator Iterative Training
Gao, Yifan and Yin, Qingyu and Li, Zheng and Meng, Rui and Zhao, Tong and Yin, Bing and King, Irwin and Lyu, Michael. Retrieval-Augmented Multilingual Keyphrase Generation with Retriever-Generator Iterative Training. Findings of the Association for Computational Linguistics: NAACL 2022. 2022. doi:10.18653/v1/2022.findings-naacl.92
-
[55]
XOR QA : Cross-lingual Open-Retrieval Question Answering
Asai, Akari and Kasai, Jungo and Clark, Jonathan and Lee, Kenton and Choi, Eunsol and Hajishirzi, Hannaneh. XOR QA : Cross-lingual Open-Retrieval Question Answering. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021. doi:10.18653/v1/2021.naacl-main.46
-
[56]
Proceedings of the 35th International Conference on Neural Information Processing Systems , articleno =
Asai, Akari and Yu, Xinyan and Kasai, Jungo and Hajishirzi, Hannaneh , title =. Proceedings of the 35th International Conference on Neural Information Processing Systems , articleno =. 2021 , isbn =
2021
-
[57]
Zhang, Xinyu and Ogueji, Kelechi and Ma, Xueguang and Lin, Jimmy , title =. ACM Trans. Inf. Syst. , month = sep, articleno =. 2023 , issue_date =. doi:10.1145/3613447 , abstract =
-
[58]
2024 , eprint=
What are the limits of cross-lingual dense passage retrieval for low-resource languages? , author=. 2024 , eprint=
2024
-
[59]
2024 , eprint=
Not All Languages are Equal: Insights into Multilingual Retrieval-Augmented Generation , author=. 2024 , eprint=
2024
-
[60]
Faux Polyglot: A Study on Information Disparity in Multilingual Large Language Models
Sharma, Nikhil and Murray, Kenton and Xiao, Ziang. Faux Polyglot: A Study on Information Disparity in Multilingual Large Language Models. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.naacl-long.411
-
[61]
2025 , eprint=
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models , author=. 2025 , eprint=
2025
-
[62]
2025 , eprint=
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , eprint=
2025
-
[63]
and Lee, Kenton and Choi, Eunsol and Hajishirzi, Hannaneh , year=
Asai, Akari and Kasai, Jungo and Clark, JonathanH. and Lee, Kenton and Choi, Eunsol and Hajishirzi, Hannaneh , year=. XOR QA: Cross-lingual Open-Retrieval Question Answering , journal=
-
[64]
Clark, Jonathan H. and Choi, Eunsol and Collins, Michael and Garrette, Dan and Kwiatkowski, Tom and Nikolaev, Vitaly and Palomaki, Jennimaria , year=. T<scp>y</scp>D<scp>i</scp> QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages , url=. doi:10.1162/tacl_a_00317 , journal=
-
[65]
In: Yang, G.H., Wang, H., Han, S., Hauff, C., Zuccon, G., Zhang, Y
Yang, Eugene and J\". Language Fairness in Multilingual Information Retrieval , year =. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =. doi:10.1145/3626772.3657943 , abstract =
-
[66]
Liu, Wei and Trenous, Sony and Ribeiro, Leonardo F. R. and Byrne, Bill and Hieber, Felix. XRAG : Cross-lingual Retrieval-Augmented Generation. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.849
-
[67]
2022 , eprint=
mMARCO: A Multilingual Version of the MS MARCO Passage Ranking Dataset , author=. 2022 , eprint=
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.