pith. machine review for the scientific record. sign in

arxiv: 2604.20199 · v1 · submitted 2026-04-22 · 💻 cs.CL

Recognition: unknown

All Languages Matter: Understanding and Mitigating Language Bias in Multilingual RAG

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:48 UTC · model grok-4.3

classification 💻 cs.CL
keywords multilingual RAGlanguage biasreranker alignmentcross-lingual retrievalretrieval-augmented generationLAURAgenerative utility
0
0 comments X

The pith

Multilingual RAG rerankers favor English and the query language, creating a gap that LAURA closes by aligning scores to generative utility.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that multilingual retrieval-augmented generation systems exhibit language bias during reranking, systematically preferring English and the query's native language while suppressing useful evidence from other languages. An estimated oracle analysis quantifies the resulting performance gap between current rerankers and the best achievable evidence selection. The authors identify a distributional mismatch in which answer-critical documents are spread across languages yet current systems downrank them. They introduce LAURA to realign reranking scores according to the evidence's utility for the final generation task. Experiments across languages and models show that this reduces bias and raises mRAG performance.

Core claim

Current mRAG systems suffer from language bias during reranking that favors English and the query's native language, as shown by a substantial gap to the oracle upper bound and by systematic suppression of answer-critical documents scattered across multiple languages; LAURA bridges this gap by aligning multilingual evidence ranking directly with downstream generative utility.

What carries the argument

LAURA, a reranker that aligns multilingual evidence scores with the utility of that evidence for the downstream generation model rather than with language identity.

If this is right

  • Rerankers select evidence from a broader set of languages when it improves generation quality.
  • The gap between observed mRAG performance and the oracle bound shrinks.
  • Improvements appear consistently across tested languages and generation models without retraining the generator.
  • Answer-critical documents that were previously suppressed become available to the generator.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same utility-alignment idea could be applied to other retrieval biases such as domain or cultural preferences.
  • Feedback from generation quality may become a standard signal for training cross-lingual retrievers.
  • Low-resource languages may show larger gains if the bias is stronger there.

Load-bearing premise

The estimated oracle evidence analysis gives a reliable upper bound and that realigning reranker scores to generative utility will not create new unintended biases or drops in specific languages.

What would settle it

Running the full LAURA pipeline on a new collection of languages and models and finding no reduction in language bias or no gain in end-to-end mRAG accuracy compared with the original reranker.

Figures

Figures reproduced from arXiv: 2604.20199 by Ben He, Boxi Cao, Bo Zheng, Cheng Zhang, Dan Wang, Guozhao Mo, Hongyu Lin, Le Sun, Xianpei Han, Xuanang Chen, Yafei Shi, Yaojie Lu.

Figure 1
Figure 1. Figure 1: Illustration of failures induced by reranker lan [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the oracle evidence estimation strategy, where candidate documnents are grouped by lan [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Heatmaps showing the proportion of selected document languages (y-axis) for each query language (x [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Two-stage data construction pipeline in the LAURA framework. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Language distribution of queries (inner ring) [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Vanilla document reranking with BGE-emma, BGE-Minicpm and Qwen3-Reranker-0.6B rerankers. The [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Oracle evidence estimation with BGE-Gemma, BGE-Minicpm and Qwen3-Reranker-0.6B rerankers. The [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
read the original abstract

Multilingual Retrieval-Augmented Generation (mRAG) leverages cross-lingual evidence to ground Large Language Models (LLMs) in global knowledge. However, we show that current mRAG systems suffer from a language bias during reranking, systematically favoring English and the query's native language. By introducing an estimated oracle evidence analysis, we quantify a substantial performance gap between existing rerankers and the achievable upper bound. Further analysis reveals a critical distributional mismatch: while optimal predictions require evidence scattered across multiple languages, current systems systematically suppress such ``answer-critical'' documents, thereby limiting downstream generation performance. To bridge this gap, we propose \textit{\textbf{L}anguage-\textbf{A}gnostic \textbf{U}tility-driven \textbf{R}eranker \textbf{A}lignment (LAURA)}, which aligns multilingual evidence ranking with downstream generative utility. Experiments across diverse languages and generation models show that LAURA effectively mitigates language bias and consistently improves mRAG performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that multilingual RAG systems exhibit language bias in reranking that favors English and the query's native language, quantifies a substantial gap to an estimated oracle upper bound via oracle evidence analysis, identifies a distributional mismatch where systems suppress answer-critical multi-language documents, and proposes LAURA (Language-Agnostic Utility-driven Reranker Alignment) to align reranker scores with downstream generative utility; experiments across languages and models reportedly show bias mitigation and consistent mRAG gains.

Significance. If the oracle bound is tight and LAURA's gains hold without new language-specific degradations, the work would provide a practical method to reduce language bias in mRAG, improving equity and performance for non-English and low-resource languages in retrieval-augmented generation.

major comments (2)
  1. [Abstract / Oracle Evidence Analysis] The estimated oracle evidence analysis (described in the abstract as quantifying the performance gap) is load-bearing for the headline claim of a 'substantial performance gap'; its construction must be specified in detail (e.g., how optimal cross-lingual evidence is selected and whether it assumes idealized fusion) to confirm the bound is not inflated by unrealistic assumptions.
  2. [Experiments] The claim that LAURA 'consistently improves mRAG performance' across diverse languages rests on the assumption that utility-driven alignment does not create compensating drops in low-resource languages; experiments must report per-language breakdowns, statistical tests, and controls to rule out new biases introduced by the generative-utility proxy.
minor comments (2)
  1. [Abstract] The abstract introduces the LAURA acronym and method but provides no concrete details on baselines, reranker models, or generation models used; adding a brief experimental summary would improve readability.
  2. [Introduction / Analysis] Ensure all terms such as 'answer-critical' documents and 'distributional mismatch' are defined on first use with a reference to the relevant analysis section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract / Oracle Evidence Analysis] The estimated oracle evidence analysis (described in the abstract as quantifying the performance gap) is load-bearing for the headline claim of a 'substantial performance gap'; its construction must be specified in detail (e.g., how optimal cross-lingual evidence is selected and whether it assumes idealized fusion) to confirm the bound is not inflated by unrealistic assumptions.

    Authors: We agree that additional detail on the oracle construction is warranted to support the performance gap claim. In the revised manuscript, we will expand the description in Section 3.2 (and add a clarifying paragraph in the abstract if space permits) to specify the exact procedure: the oracle enumerates feasible combinations of documents from the multilingual retrieval pool, selects the subset that maximizes downstream answer accuracy (exact match/F1) when fed to the same generator and fusion method used by the evaluated mRAG systems, and reports the resulting upper-bound performance. No idealized cross-lingual fusion or perfect retrieval is assumed beyond what is achievable with the given evidence pool and standard concatenation. This makes the bound a realistic, tight estimate rather than an inflated theoretical ceiling. revision: yes

  2. Referee: [Experiments] The claim that LAURA 'consistently improves mRAG performance' across diverse languages rests on the assumption that utility-driven alignment does not create compensating drops in low-resource languages; experiments must report per-language breakdowns, statistical tests, and controls to rule out new biases introduced by the generative-utility proxy.

    Authors: We appreciate the referee's emphasis on verifying the absence of new biases. The current manuscript already provides per-language breakdowns in the appendix tables, which show gains (or no degradation) across both high- and low-resource languages. In the revision we will move the key per-language results to the main paper, add paired statistical significance tests (t-tests with p-values across multiple random seeds), and include control experiments that compare LAURA against both language-specific rerankers and a language-agnostic baseline. These additions will explicitly demonstrate that the generative-utility alignment does not introduce compensating drops or new language biases. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical alignment method with independent experimental validation

full rationale

The paper presents LAURA as a utility-driven reranker alignment technique for mRAG, supported by experiments across languages and models that demonstrate bias mitigation and performance gains. No equations, derivations, or self-referential constructions appear in the abstract or described method. The estimated oracle analysis quantifies gaps but is not shown to reduce to a fitted parameter or self-definition by construction. Any self-citations (if present in full text) are not load-bearing for the central claim, which rests on external empirical results rather than internal redefinition. This is a standard empirical contribution without the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the oracle analysis and LAURA objective are described at a high level without mathematical or implementation details.

pith-pipeline@v0.9.0 · 5500 in / 1084 out tokens · 34264 ms · 2026-05-09T23:48:10.652742+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 25 canonical work pages · 4 internal anchors

  1. [3]

    Akari Asai, Xinyan Yu, Jungo Kasai, and Hanna Hajishirzi. 2021 b . One question answering model for many languages with cross-lingual dense passage retrieval. Advances in Neural Information Processing Systems, 34:7547--7560

  2. [4]

    Akari Asai, Xinyan Yu, Jungo Kasai, and Hannaneh Hajishirzi. 2021 c . One question answering model for many languages with cross-lingual dense passage retrieval. In Proceedings of the 35th International Conference on Neural Information Processing Systems, NIPS '21, Red Hook, NY, USA. Curran Associates Inc

  3. [5]

    Luiz Bonifacio, Vitor Jeronymo, Hugo Queiroz Abonizio, Israel Campiotti, Marzieh Fadaee, Roberto Lotufo, and Rodrigo Nogueira. 2022. https://arxiv.org/abs/2108.13897 mmarco: A multilingual version of the ms marco passage ranking dataset . Preprint, arXiv:2108.13897

  4. [6]

    Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation

    Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation

  5. [8]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, and others. 2025. https://arxiv.org/abs/2501.12948 Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning . Preprint, arXiv:2501.12948

  6. [10]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, and 1 others. 2024. https://arxiv.org/abs/2407.21783 The llama 3 herd of models . Preprint, arXiv:2407.21783

  7. [12]

    Patrick Lewis, Ethan Perez, Aleksandara Piktus, Filippo Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. arXiv: Computation and Language,arXiv: Computation and Language

  8. [16]

    Ilya Loshchilov and Frank Hutter. 2019. https://openreview.net/forum?id=Bkg6RiCqY7 Decoupled weight decay regularization . In International Conference on Learning Representations

  9. [20]

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, and 25 others. 2025. https://arxiv.org/abs/2412.15115 Qwen2.5 technical report . Preprint, arXiv:2412.15115

  10. [26]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  11. [27]

    Publications Manual , year = "1983", publisher =

  12. [28]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  13. [29]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  14. [30]

    Dan Gusfield , title =. 1997

  15. [31]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  16. [32]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  17. [33]

    Retrieval-augmented generation in multilingual settings

    Chirkova, Nadezhda and Rau, David and D \'e jean, Herv \'e and Formal, Thibault and Clinchant, St \'e phane and Nikoulina, Vassilina. Retrieval-augmented generation in multilingual settings. Proceedings of the 1st Workshop on Towards Knowledgeable Language Models (KnowLLM 2024). 2024. doi:10.18653/v1/2024.knowllm-1.15

  18. [34]

    2025 , eprint=

    Qwen2.5 Technical Report , author=. 2025 , eprint=

  19. [35]

    2024 , eprint=

    The Llama 3 Herd of Models , author=. 2024 , eprint=

  20. [36]

    MKQA : A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering

    Longpre, Shayne and Lu, Yi and Daiber, Joachim , year=. MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering , url=. doi:10.1162/tacl_a_00433 , journal=

  21. [37]

    https://aclanthology.org/ Q19-1026/

    Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and Toutanova, Kristina and Jones, Llion and Kelcey, Matthew and Chang, Ming-Wei and Dai, Andrew M. and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav , year=...

  22. [38]

    BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation , author=

  23. [39]

    2024 , howpublished =

    Model Card for Command-R , author =. 2024 , howpublished =

  24. [40]

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , journal=

    Lewis, Patrick and Perez, Ethan and Piktus, Aleksandara and Petroni, Filippo and Karpukhin, Vladimir and Goyal, Naman and Küttler, Heinrich and Lewis, Mike and Yih, Wen-tau and Rocktäschel, Tim and Riedel, Sebastian and Kiela, Douwe , year=. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , journal=

  25. [41]

    In-context retrieval-augmented language models.Transactions of the Association for Computational Linguistics, 11:1316–1331, 2023

    Ram, Ori and Levine, Yoav and Dalmedigos, Itay and Muhlgay, Dor and Shashua, Amnon and Leyton-Brown, Kevin and Shoham, Yoav. In-Context Retrieval-Augmented Language Models. Transactions of the Association for Computational Linguistics. 2023. doi:10.1162/tacl_a_00605

  26. [42]

    Retrieval-based Language Models and Applications

    Asai, Akari and Min, Sewon and Zhong, Zexuan and Chen, Danqi. Retrieval-based Language Models and Applications. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 6: Tutorial Abstracts). 2023. doi:10.18653/v1/2023.acl-tutorials.6

  27. [43]

    2024 , eprint=

    Retrieval-Augmented Generation for Large Language Models: A Survey , author=. 2024 , eprint=

  28. [44]

    Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages =

    Fan, Wenqi and Ding, Yujuan and Ning, Liangbo and Wang, Shijie and Li, Hengyun and Yin, Dawei and Chua, Tat-Seng and Li, Qing , title =. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages =. 2024 , isbn =. doi:10.1145/3637528.3671470 , abstract =

  29. [45]

    Advances in Neural Information Processing Systems , volume=

    One question answering model for many languages with cross-lingual dense passage retrieval , author=. Advances in Neural Information Processing Systems , volume=

  30. [46]

    B ord IR lines: A Dataset for Evaluating Cross-lingual Retrieval Augmented Generation

    Li, Bryan and Haider, Samar and Luo, Fiona and Agashe, Adwait and Callison-Burch, Chris. B ord IR lines: A Dataset for Evaluating Cross-lingual Retrieval Augmented Generation. Proceedings of the First Workshop on Advancing Natural Language Processing for Wikipedia. 2024. doi:10.18653/v1/2024.wikinlp-1.3

  31. [47]

    Investigating Language Preference of Multilingual RAG Systems

    Park, Jeonghyun and Lee, Hwanhee. Investigating Language Preference of Multilingual RAG Systems. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.295

  32. [48]

    The Cross-Lingual Cost: Retrieval Biases in RAG over A rabic- E nglish Corpora

    Amiraz, Chen and Fyodorov, Yaroslav and Haramaty, Elad and Karnin, Zohar and Lewin-Eytan, Liane. The Cross-Lingual Cost: Retrieval Biases in RAG over A rabic- E nglish Corpora. Proceedings of The Third Arabic Natural Language Processing Conference. 2025. doi:10.18653/v1/2025.arabicnlp-main.6

  33. [49]

    On the Consistency of Multilingual Context Utilization in Retrieval-Augmented Generation

    Qi, Jirui and Fern \'a ndez, Raquel and Bisazza, Arianna. On the Consistency of Multilingual Context Utilization in Retrieval-Augmented Generation. Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025). 2025. doi:10.18653/v1/2025.mrl-main.15

  34. [50]

    Quality-Aware Translation Tagging in Multilingual RAG system

    Moon, Hoyeon and Kim, Byeolhee and Verma, Nikhil. Quality-Aware Translation Tagging in Multilingual RAG system. Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025). 2025. doi:10.18653/v1/2025.mrl-main.12

  35. [51]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models , author=. arXiv preprint arXiv:2506.05176 , year=

  36. [52]

    SWIFT: A scalable lightweight infrastructure for fine-tuning

    SWIFT: A Scalable Lightweight Infrastructure for Fine-Tuning , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2025 , month=. doi:10.1609/aaai.v39i28.35383 , abstractNote=

  37. [53]

    International Conference on Learning Representations , year=

    Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=

  38. [54]

    Retrieval-Augmented Multilingual Keyphrase Generation with Retriever-Generator Iterative Training

    Gao, Yifan and Yin, Qingyu and Li, Zheng and Meng, Rui and Zhao, Tong and Yin, Bing and King, Irwin and Lyu, Michael. Retrieval-Augmented Multilingual Keyphrase Generation with Retriever-Generator Iterative Training. Findings of the Association for Computational Linguistics: NAACL 2022. 2022. doi:10.18653/v1/2022.findings-naacl.92

  39. [55]

    XOR QA : Cross-lingual Open-Retrieval Question Answering

    Asai, Akari and Kasai, Jungo and Clark, Jonathan and Lee, Kenton and Choi, Eunsol and Hajishirzi, Hannaneh. XOR QA : Cross-lingual Open-Retrieval Question Answering. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021. doi:10.18653/v1/2021.naacl-main.46

  40. [56]

    Proceedings of the 35th International Conference on Neural Information Processing Systems , articleno =

    Asai, Akari and Yu, Xinyan and Kasai, Jungo and Hajishirzi, Hannaneh , title =. Proceedings of the 35th International Conference on Neural Information Processing Systems , articleno =. 2021 , isbn =

  41. [57]

    ACM Trans

    Zhang, Xinyu and Ogueji, Kelechi and Ma, Xueguang and Lin, Jimmy , title =. ACM Trans. Inf. Syst. , month = sep, articleno =. 2023 , issue_date =. doi:10.1145/3613447 , abstract =

  42. [58]

    2024 , eprint=

    What are the limits of cross-lingual dense passage retrieval for low-resource languages? , author=. 2024 , eprint=

  43. [59]

    2024 , eprint=

    Not All Languages are Equal: Insights into Multilingual Retrieval-Augmented Generation , author=. 2024 , eprint=

  44. [60]

    Faux Polyglot: A Study on Information Disparity in Multilingual Large Language Models

    Sharma, Nikhil and Murray, Kenton and Xiao, Ziang. Faux Polyglot: A Study on Information Disparity in Multilingual Large Language Models. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.naacl-long.411

  45. [61]

    2025 , eprint=

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models , author=. 2025 , eprint=

  46. [62]

    2025 , eprint=

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , eprint=

  47. [63]

    and Lee, Kenton and Choi, Eunsol and Hajishirzi, Hannaneh , year=

    Asai, Akari and Kasai, Jungo and Clark, JonathanH. and Lee, Kenton and Choi, Eunsol and Hajishirzi, Hannaneh , year=. XOR QA: Cross-lingual Open-Retrieval Question Answering , journal=

  48. [64]

    and Choi, Eunsol and Collins, Michael and Garrette, Dan and Kwiatkowski, Tom and Nikolaev, Vitaly and Palomaki, Jennimaria , year =

    Clark, Jonathan H. and Choi, Eunsol and Collins, Michael and Garrette, Dan and Kwiatkowski, Tom and Nikolaev, Vitaly and Palomaki, Jennimaria , year=. T<scp>y</scp>D<scp>i</scp> QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages , url=. doi:10.1162/tacl_a_00317 , journal=

  49. [65]

    In: Yang, G.H., Wang, H., Han, S., Hauff, C., Zuccon, G., Zhang, Y

    Yang, Eugene and J\". Language Fairness in Multilingual Information Retrieval , year =. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =. doi:10.1145/3626772.3657943 , abstract =

  50. [66]

    Liu, Wei and Trenous, Sony and Ribeiro, Leonardo F. R. and Byrne, Bill and Hieber, Felix. XRAG : Cross-lingual Retrieval-Augmented Generation. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.849

  51. [67]

    2022 , eprint=

    mMARCO: A Multilingual Version of the MS MARCO Passage Ranking Dataset , author=. 2022 , eprint=