Recognition: unknown
Code-Switching Information Retrieval: Benchmarks, Analysis, and the Limits of Current Retrievers
Pith reviewed 2026-05-10 05:13 UTC · model grok-4.3
The pith
Code-switching acts as a performance bottleneck for retrieval systems because mixed-language queries create large divergences in embedding spaces.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Code-switching is a fundamental performance bottleneck in information retrieval. Evaluations on the new CSR-L benchmark and the broader CS-MTEB show effectiveness drops of up to 27 percent for current models. The root cause is substantial divergence between the embeddings of pure-language text and code-switched text. Common multilingual techniques such as vocabulary expansion fail to resolve these deficits completely.
What carries the argument
The CSR-L human-annotated benchmark and the measured divergence in embedding space between pure and code-switched queries.
If this is right
- Retrieval effectiveness on real global queries is lower than monolingual benchmarks indicate.
- Vocabulary expansion and similar multilingual adaptations leave residual deficits in code-switched settings.
- New model designs must target alignment of pure and mixed-language representations in embedding space.
- Future IR systems need dedicated benchmarks like CS-MTEB to measure progress on mixed-language inputs.
Where Pith is reading between the lines
- Search engines serving bilingual populations would gain from query rewriting or hybrid indexes that detect and handle switches explicitly.
- The embedding divergence finding suggests similar hidden weaknesses may exist in other multilingual tasks such as question answering or summarization.
- Synthetic code-switched data generated during pre-training could be tested as a direct mitigation strategy.
- Performance gaps may widen further when code-switching involves low-resource language pairs not well represented in current training corpora.
Load-bearing premise
The human-annotated CSR-L queries reflect authentic natural code-switching and that embedding divergence is the primary driver of the observed performance drops rather than annotation artifacts or other factors.
What would settle it
A retrieval model trained to eliminate embedding divergence on mixed-language text that shows no effectiveness drop on CSR-L or CS-MTEB compared with pure-language queries would falsify the bottleneck claim.
Figures
read the original abstract
Code-switching is a pervasive linguistic phenomenon in global communication, yet modern information retrieval systems remain predominantly designed for, and evaluated within, monolingual contexts. To bridge this critical disconnect, we present a holistic study dedicated to code-switching IR. We introduce CSR-L (Code-Switching Retrieval benchmark-Lite), constructing a dataset via human annotation to capture the authentic naturalness of mixed-language queries. Our evaluation across statistical, dense, and late-interaction paradigms reveals that code-switching acts as a fundamental performance bottleneck, degrading the effectiveness of even robust multilingual models. We demonstrate that this failure stems from substantial divergence in the embedding space between pure and code-switched text. Scaling this investigation, we propose CS-MTEB, a comprehensive benchmark covering 11 diverse tasks, where we observe performance declines of up to 27%. Finally, we show that standard multilingual techniques like vocabulary expansion are insufficient to resolve these deficits completely. These findings underscore the fragility of current systems and establish code-switching as a crucial frontier for future IR optimization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CSR-L, a human-annotated benchmark for code-switching retrieval queries, and evaluates statistical, dense, and late-interaction retrievers to show that code-switching creates a performance bottleneck via embedding-space divergence. It scales the analysis with CS-MTEB (11 tasks) reporting declines up to 27% and finds that vocabulary expansion fails to fully mitigate the deficits, positioning code-switching as a key challenge for multilingual IR.
Significance. If the central findings hold, the work provides valuable new benchmarks (CSR-L and CS-MTEB) and empirical evidence of model fragility in mixed-language settings, which could guide targeted improvements in multilingual embeddings and retrieval. The multi-paradigm evaluation and scale of the new benchmark are clear strengths that enable reproducible follow-up research.
major comments (3)
- [Dataset construction] Dataset construction section: The claim that CSR-L captures 'authentic naturalness' of mixed-language queries rests on human annotation, but no inter-annotator agreement statistics, annotation guidelines, or comparison to naturally occurring code-switched queries (e.g., from social media corpora) are provided; without these, annotation artifacts cannot be ruled out as a contributor to the reported performance drops, which is load-bearing for the bottleneck conclusion.
- [Embedding analysis] Embedding divergence analysis: The paper links retrieval degradation to 'substantial divergence in the embedding space' between pure and code-switched text, yet presents only correlational evidence (similarity metrics or visualizations) without an ablation that isolates or corrects the divergence (e.g., via fine-tuning on code-switched pairs) to test whether closing the gap restores performance; this leaves open alternative explanations such as tokenization mismatches or training-data scarcity.
- [CS-MTEB evaluation] CS-MTEB results: The 'up to 27%' performance decline is reported across 11 tasks, but the manuscript does not specify per-task breakdowns, exact models evaluated, or statistical significance tests (e.g., paired t-tests or confidence intervals); without these details the consistency of the bottleneck claim across paradigms cannot be fully verified.
minor comments (2)
- [Introduction and benchmarks] Clarify the exact definition and examples of code-switching types (e.g., intra-sentential vs. inter-sentential) used in both CSR-L and CS-MTEB to aid reproducibility.
- [Conclusion] Add a limitations paragraph explicitly discussing potential domain shift between the annotated queries and real user code-switched searches.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments have helped us identify areas where additional clarity and evidence can strengthen the manuscript. We address each major comment below and indicate the revisions made.
read point-by-point responses
-
Referee: [Dataset construction] Dataset construction section: The claim that CSR-L captures 'authentic naturalness' of mixed-language queries rests on human annotation, but no inter-annotator agreement statistics, annotation guidelines, or comparison to naturally occurring code-switched queries (e.g., from social media corpora) are provided; without these, annotation artifacts cannot be ruled out as a contributor to the reported performance drops, which is load-bearing for the bottleneck conclusion.
Authors: We agree that these details strengthen the claims. In the revised manuscript we have added the complete annotation guidelines to Appendix A. We also report inter-annotator agreement statistics computed during dataset creation. Furthermore, we include a qualitative and quantitative comparison of switch-point distributions and language ratios between CSR-L and a sample of naturally occurring code-switched text from social media, demonstrating close alignment. These additions confirm that the observed retrieval drops are not attributable to annotation artifacts. revision: yes
-
Referee: [Embedding analysis] Embedding divergence analysis: The paper links retrieval degradation to 'substantial divergence in the embedding space' between pure and code-switched text, yet presents only correlational evidence (similarity metrics or visualizations) without an ablation that isolates or corrects the divergence (e.g., via fine-tuning on code-switched pairs) to test whether closing the gap restores performance; this leaves open alternative explanations such as tokenization mismatches or training-data scarcity.
Authors: The referee is correct that the primary evidence is correlational. In the revision we have expanded the analysis section to explicitly discuss alternative explanations, including tokenization mismatches and training-data scarcity, and provide supporting measurements that control for tokenization effects. A full ablation via fine-tuning on code-switched pairs lies beyond the scope of the current work due to computational cost and is noted as future research; however, the additional controls we present reinforce embedding divergence as a central factor in the performance bottleneck. revision: partial
-
Referee: [CS-MTEB evaluation] CS-MTEB results: The 'up to 27%' performance decline is reported across 11 tasks, but the manuscript does not specify per-task breakdowns, exact models evaluated, or statistical significance tests (e.g., paired t-tests or confidence intervals); without these details the consistency of the bottleneck claim across paradigms cannot be fully verified.
Authors: We thank the referee for highlighting this omission. The revised manuscript now contains a dedicated table with per-task results for all 11 CS-MTEB tasks, explicitly listing the models evaluated under each retrieval paradigm. We have also added paired t-tests together with 95% confidence intervals, confirming that the performance declines are statistically significant and consistent across statistical, dense, and late-interaction retrievers. revision: yes
Circularity Check
Empirical benchmark construction with no circular derivations or self-referential reductions
full rationale
This is an empirical IR paper that introduces CSR-L through human annotation of mixed-language queries, evaluates statistical/dense/late-interaction retrievers on it, observes performance drops up to 27% on the expanded CS-MTEB benchmark, and notes that vocabulary expansion does not fully resolve issues. No mathematical equations, fitted parameters, or predictive models are presented that reduce by construction to the inputs. Claims about embedding divergence as a bottleneck are observational from the new data rather than derived via self-definition, self-citation chains, or renaming of prior results. The analysis is self-contained against external benchmarks and does not rely on load-bearing self-citations or ansatzes smuggled from prior author work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human annotations of query relevance and naturalness are accurate and representative of real code-switched usage.
Reference graph
Works this paper leans on
-
[1]
2025 , eprint=
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning , author=. 2025 , eprint=
2025
-
[2]
Robertson, Stephen and Zaragoza, Hugo , title =. Found. Trends Inf. Retr. , month = apr, pages =. 2009 , issue_date =. doi:10.1561/1500000019 , abstract =
-
[3]
OpenAI blog , volume=
Language models are unsupervised multitask learners , author=. OpenAI blog , volume=
-
[4]
Cross-Dialect Information Retrieval: Information Access in Low-Resource and High-Variance Languages
Litschko, Robert and Kraus, Oliver and Blaschke, Verena and Plank, Barbara. Cross-Dialect Information Retrieval: Information Access in Low-Resource and High-Variance Languages. Proceedings of the 31st International Conference on Computational Linguistics. 2025
2025
-
[5]
Do, Junggeun and Lee, Jaeseong and Hwang, Seung-won. C ontrastive M ix: Overcoming Code-Mixing Dilemma in Cross-Lingual Transfer for Information Retrieval. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers). 2024. doi:10.18653/v1/2024.naacl...
-
[6]
Boosting Zero-shot Cross-lingual Retrieval by Training on Artificially Code-Switched Data
Litschko, Robert and Artemova, Ekaterina and Plank, Barbara. Boosting Zero-shot Cross-lingual Retrieval by Training on Artificially Code-Switched Data. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.193
-
[7]
M i LQ : Benchmarking IR Models for Bilingual Web Search with Mixed Language Queries
Kim, Jonghwi and Kang, Deokhyung and Hwang, Seonjeong and Kim, Yunsu and Ok, Jungseul and Lee, Gary. M i LQ : Benchmarking IR Models for Bilingual Web Search with Mixed Language Queries. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1153
-
[8]
SimCSE: Simple Contrastive Learning of Sentence Embeddings , booktitle =
Gao, Tianyu and Yao, Xingcheng and Chen, Danqi. S im CSE : Simple Contrastive Learning of Sentence Embeddings. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.552
-
[9]
ColBERT: Efficient and effective passage search via con- textualized late interaction over bert
Khattab, Omar and Zaharia, Matei , title =. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =. 2020 , isbn =. doi:10.1145/3397271.3401075 , abstract =
-
[10]
2026 , eprint=
MiMo-V2-Flash Technical Report , author=. 2026 , eprint=
2026
-
[11]
Evaluating Large Language Models for Cross-Lingual Retrieval
Zuo, Longfei and Hong, Pingjun and Kraus, Oliver and Plank, Barbara and Litschko, Robert. Evaluating Large Language Models for Cross-Lingual Retrieval. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.612
-
[12]
2024 , eprint=
Arctic-Embed 2.0: Multilingual Retrieval Without Compromise , author=. 2024 , eprint=
2024
-
[13]
2021 , eprint=
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , author=. 2021 , eprint=
2021
-
[14]
Proceedings of the 16th Annual Meeting of the Forum for Information Retrieval Evaluation , pages =
Chanda, Supriya and Pal, Sukomal , title =. Proceedings of the 16th Annual Meeting of the Forum for Information Retrieval Evaluation , pages =. 2025 , isbn =. doi:10.1145/3734947.3735670 , abstract =
-
[15]
The bilingualism reader / edited by Li Wei
Li, Wei , address =. The bilingualism reader / edited by Li Wei. , year =. The bilingualism reader , edition =
-
[16]
, title =
Ahmed, Yusuf M. , title =. Journal of International English Research Studies (JIERS) , volume =. 2024 , month =
2024
-
[17]
and Choudhury, Monojit and Rosso, Paolo , title =
Gupta, Parth and Bali, Kalika and Banchs, Rafael E. and Choudhury, Monojit and Rosso, Paolo , title =. Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval , pages =. 2014 , isbn =. doi:10.1145/2600428.2609622 , abstract =
-
[18]
2021 , eprint=
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models , author=. 2021 , eprint=
2021
-
[19]
2023 , eprint=
MTEB: Massive Text Embedding Benchmark , author=. 2023 , eprint=
2023
-
[20]
2025 , eprint=
BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval , author=. 2025 , eprint=
2025
-
[21]
2025 , eprint=
MMTEB: Massive Multilingual Text Embedding Benchmark , author=. 2025 , eprint=
2025
-
[22]
Weller, Orion and Chang, Benjamin and Yang, Eugene and Yarmohammadi, Mahsa and Barham, Samuel and MacAvaney, Sean and Cohan, Arman and Soldaini, Luca and Van Durme, Benjamin and Lawrie, Dawn , title =. Advances in Information Retrieval: 47th European Conference on Information Retrieval, ECIR 2025, Lucca, Italy, April 6–10, 2025, Proceedings, Part II , pag...
-
[23]
MINERS : Multilingual Language Models as Semantic Retrievers
Winata, Genta Indra and Zhang, Ruochen and Adelani, David Ifeoluwa. MINERS : Multilingual Language Models as Semantic Retrievers. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.155
-
[24]
Sometimes I'll start a sentence in Spanish y termino en espa
Poplack, Shana , booktitle=. Sometimes I'll start a sentence in Spanish y termino en espa. 2020 , publisher=
2020
-
[25]
1997 , publisher=
Duelling languages: Grammatical structure in codeswitching , author=. 1997 , publisher=
1997
-
[26]
Language Modeling for Code-Mixing: The Role of Linguistic Theory based Synthetic Data
Pratapa, Adithya and Bhat, Gayatri and Choudhury, Monojit and Sitaram, Sunayana and Dandapat, Sandipan and Bali, Kalika. Language Modeling for Code-Mixing: The Role of Linguistic Theory based Synthetic Data. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1143
-
[27]
Code-Switched Text Synthesis in Unseen Language Pairs
Hsu, I-Hung and Ray, Avik and Garg, Shubham and Peng, Nanyun and Huang, Jing. Code-Switched Text Synthesis in Unseen Language Pairs. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.318
-
[28]
Overview of Touch \'e 2020: Argument Retrieval
Bondarenko, Alexander and Fr \"o be, Maik and Beloucif, Meriem and Gienapp, Lukas and Ajjour, Yamen and Panchenko, Alexander and Biemann, Chris and Stein, Benno and Wachsmuth, Henning and Potthast, Martin and Hagen, Matthias. Overview of Touch \'e 2020: Argument Retrieval. Experimental IR Meets Multilinguality, Multimodality, and Interaction. 2020
2020
-
[29]
2021 , eprint=
Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=
2021
-
[30]
2021 , eprint=
Searching for Scientific Evidence in a Pandemic: An Overview of TREC-COVID , author=. 2021 , eprint=
2021
-
[31]
F ollow IR : Evaluating and teaching information retrieval models to follow instructions
Weller, Orion and Chang, Benjamin and MacAvaney, Sean and Lo, Kyle and Cohan, Arman and Van Durme, Benjamin and Lawrie, Dawn and Soldaini, Luca. F ollow IR : Evaluating and Teaching Information Retrieval Models to Follow Instructions. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics...
-
[32]
2024 , eprint=
Text Embeddings by Weakly-Supervised Contrastive Pre-training , author=. 2024 , eprint=
2024
-
[33]
2025 , eprint=
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models , author=. 2025 , eprint=
2025
-
[34]
2025 , eprint=
Llama-Embed-Nemotron-8B: A Universal Text Embedding Model for Multilingual and Cross-Lingual Tasks , author=. 2025 , eprint=
2025
-
[35]
2024 , url=
SFR-Embedding-2: Advanced Text Embedding with Multi-stage Training , author=. 2024 , url=
2024
-
[36]
2025 , eprint=
jina-reranker-v3: Last but Not Late Interaction for Listwise Document Reranking , author=. 2025 , eprint=
2025
-
[37]
2024 , eprint=
BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation , author=. 2024 , eprint=
2024
-
[38]
2022 , eprint=
ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction , author=. 2022 , eprint=
2022
-
[39]
Zeng, Qingcheng and Garay, Lucas and Zhou, Peilin and Chong, Dading and Hua, Yining and Wu, Jiageng and Pan, Yikang and Zhou, Han and Voigt, Rob and Yang, Jie , title =. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence , articleno =. 2023 , isbn =. doi:10.24963/ijcai.2023/698 , abstract =
-
[40]
Retrieval of the Best Counterargument without Prior Topic Knowledge
Wachsmuth, Henning and Syed, Shahbaz and Stein, Benno. Retrieval of the Best Counterargument without Prior Topic Knowledge. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1023
-
[41]
2021 , eprint=
CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims , author=. 2021 , eprint=
2021
-
[42]
Semi-supervised Question Retrieval with Gated Convolutions
Lei, Tao and Joshi, Hrishikesh and Barzilay, Regina and Jaakkola, Tommi and Tymoshenko, Kateryna and Moschitti, Alessandro and M \`a rquez, Llu \'i s. Semi-supervised Question Retrieval with Gated Convolutions. Proceedings of the 2016 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies. 2...
-
[43]
S em E val-2015 Task 1: Paraphrase and Semantic Similarity in T witter ( PIT )
Xu, Wei and Callison-Burch, Chris and Dolan, Bill. S em E val-2015 Task 1: Paraphrase and Semantic Similarity in T witter ( PIT ). Proceedings of the 9th International Workshop on Semantic Evaluation ( S em E val 2015). 2015. doi:10.18653/v1/S15-2001
-
[44]
2025 , url=
MiMo-V2-Flash Technical Report , author=. 2025 , url=
2025
-
[45]
Expanding Pretrained Models to Thousands More Languages via Lexicon-based Adaptation
Wang, Xinyi and Ruder, Sebastian and Neubig, Graham. Expanding Pretrained Models to Thousands More Languages via Lexicon-based Adaptation. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.61
-
[46]
Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks
Reimers, Nils and Gurevych, Iryna. Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1410
-
[47]
2018 , eprint=
Word Translation Without Parallel Data , author=. 2018 , eprint=
2018
-
[48]
arXiv preprint arXiv:2509.00303 , year=
Access Paths for Efficient Ordering with Large Language Models , author=. arXiv preprint arXiv:2509.00303 , year=
-
[49]
Representation Learning with Contrastive Predictive Coding
Representation learning with contrastive predictive coding , author=. arXiv preprint arXiv:1807.03748 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[50]
Multilingual E5 Text Embeddings: A Technical Report
Multilingual e5 text embeddings: A technical report , author=. arXiv preprint arXiv:2402.05672 , year=
work page internal anchor Pith review arXiv
-
[51]
V -Measure: A Conditional Entropy-Based External Cluster Evaluation Measure
Rosenberg, Andrew and Hirschberg, Julia. V -Measure: A Conditional Entropy-Based External Cluster Evaluation Measure. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning ( EMNLP - C o NLL ). 2007
2007
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.