arxiv: 2510.14274 · v2 · submitted 2025-10-16 · 💻 cs.CL

Retrofitting Small Multilingual Models for Retrieval: Matching 7B Performance with 300M Parameters

Lifu Tu , Yingbo Zhou , Semih Yavuz This is my paper

Pith reviewed 2026-05-18 06:55 UTC · model grok-4.3

classification 💻 cs.CL

keywords multilingual embeddingsretrievalhard negativestask diversitysmall modelsmodel retrofittingembedding performance

0 comments p. Extension

The pith

A 300 million parameter multilingual model can match or exceed 7 billion parameter models on retrieval tasks through targeted training adjustments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines why small multilingual embedding models lag behind larger ones specifically on retrieval, despite solid results on other multilingual tasks. It tests how training data volume, hard negative examples, and the mix of tasks versus languages each affect final accuracy. Scale brings quick gains that then flatten, while hard negatives deliver steady improvements and task variety proves more valuable than simply covering more languages. These insights allow the authors to build a compact model that reaches or beats current strong 7B baselines. If the approach holds, high-quality multilingual search becomes feasible on far less hardware and compute.

Core claim

We develop a compact approximately 300M multilingual model that achieves retrieval performance comparable to or even surpassing current strong 7B models by focusing on hard negative sampling and task-diverse training mixtures after observing that raw data scale quickly plateaus.

What carries the argument

Retrofitting via hard-negative sampling combined with task-diverse data mixtures rather than language expansion or continued data scaling.

Load-bearing premise

The finding that task diversity contributes more to performance than language diversity depends on the authors' internal breakdown of their specific training mixtures.

What would settle it

Train two otherwise identical 300M models, one using a high-task-diversity mixture and one using a high-language-diversity but low-task-diversity mixture with the same total examples and hard negatives, then measure which reaches higher recall on a standard multilingual retrieval benchmark.

Figures

Figures reproduced from arXiv: 2510.14274 by Lifu Tu, Semih Yavuz, Yingbo Zhou.

**Figure 1.** Figure 1: Synthetic Data scale experiments. 5.3 Hard Negatives Recent studies (Lee et al., 2024, 2025) have demonstrated the effectiveness of incorporating hard nega3 [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Hard negatives experiments. 5.4 Task Diversity and Language Diversity Mul. Eur. Fr Ja Our (all) 60.56 59.76 68.28 72.29 En-Mix 59.60 58.22 67.56 72.26 Mul-Syn 59.36 58.61 67.19 71.46 En-Syn + En-Mix 60.38 59.56 68.05 72.29 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Training effective multilingual embedding models presents unique challenges due to the diversity of languages and task objectives. Although small multilingual models (<1 B parameters) perform well on multilingual tasks generally, they consistently lag behind larger models (>1 B) in the most prevalent use case: retrieval. This raises a critical question: Can smaller models be retrofitted specifically for retrieval tasks to enhance their performance? In this work, we investigate key factors that influence the effectiveness of multilingual embeddings, focusing on training data scale, negative sampling strategies, and data diversity. We find that while increasing the scale of training data yields initial performance gains, these improvements quickly plateau - indicating diminishing returns. Incorporating hard negatives proves essential for consistently improving retrieval accuracy. Furthermore, our analysis reveals that task diversity in the training data contributes more significantly to performance than language diversity alone. As a result, we develop a compact (approximately 300M) multilingual model that achieves retrieval performance comparable to or even surpassing current strong 7B models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A 300M model matching 7B retrieval performance is the headline result, but the task-over-language diversity ranking rests on an ablation whose data-volume controls are unclear.

read the letter

The main takeaway is that a compact 300M multilingual embedding model can reach or exceed the retrieval performance of current 7B models when trained with hard negatives and a data mix that favors task variety over language variety. The authors show that simply adding more training data stops helping after a point, while hard negatives deliver consistent lifts, and that their particular mixture choices let the small model close the gap to much larger ones. This is a practical finding for anyone who needs multilingual retrieval without large inference costs. The experiments on scaling plateaus and negative sampling are straightforward and line up with what people already do in monolingual settings, just applied here to the multilingual case. The claim that task diversity matters more than language diversity is the part that feels less settled. It comes from their internal mixture comparisons, and the description does not confirm that total example counts were held fixed or that language-specific and task-specific effects were isolated. If the conditions differed in data budget, the ranking could be an artifact rather than a real signal about diversity. The abstract also omits the actual scores, baseline details, and any statistical checks, which makes it hard to gauge how large or reliable the gains are. This paper is aimed at practitioners who build or deploy multilingual retrieval systems and want to stay under a billion parameters. Readers who care about data curation and negative sampling for embeddings will find the trends useful. It is solid enough on the empirical side to deserve a serious referee, though the authors should be asked to document the ablation controls and report the full numbers with baselines.

Referee Report

2 major / 3 minor

Summary. The manuscript investigates retrofitting small multilingual models (<1B parameters) for retrieval tasks. It empirically examines the effects of training data scale (finding diminishing returns after initial gains), hard negative sampling (essential for consistent gains), and data diversity (claiming task diversity contributes more than language diversity). Based on these, the authors develop a ~300M parameter model that achieves retrieval performance comparable to or surpassing strong 7B models on standard benchmarks.

Significance. If the central performance claims hold after addressing controls in the diversity analysis, this would be a meaningful contribution to multilingual embedding research by showing that targeted retrofitting and data curation can close the gap with much larger models, with implications for efficient deployment. The observations on plateauing scale effects and hard negatives are useful empirical guidance, though the task-vs-language diversity ranking requires tighter validation to support the training recipe.

major comments (2)

[§4] §4 (Data Diversity Analysis) and associated ablation tables: The prioritization of task diversity over language diversity in constructing the final training mixture for the 300M model is justified by internal comparisons, but the manuscript does not specify whether total example count or data volume was held fixed across the task-diversity and language-diversity conditions. Without this control, performance deltas could arise from unequal data budgets rather than diversity type, directly affecting the validity of the retrofitting recipe and the headline claim.
[Results section] Results section and Table comparing to 7B models: The claim of matching or surpassing 7B models lacks explicit details on whether the 7B baselines were evaluated zero-shot, fine-tuned on the same mixtures, or used their original training; this information is load-bearing for interpreting the 'comparable or surpassing' result and for reproducibility.

minor comments (3)

[Abstract] Abstract and §3: Provide quantitative numbers (e.g., exact nDCG or Recall@10 deltas, standard deviations, and number of runs) for the reported trends on data scale plateau and hard-negative benefits, rather than qualitative statements only.
[§2] Notation and §2: Clarify the exact definition of 'hard negatives' and the sampling ratio used, as this is a key hyperparameter in the free_parameters list.
Figure captions: Ensure all figures include error bars or statistical significance markers where performance comparisons are shown.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback on our work. We address the major comments point-by-point below, providing clarifications and indicating the revisions we will make to the manuscript.

read point-by-point responses

Referee: §4 (Data Diversity Analysis) and associated ablation tables: The prioritization of task diversity over language diversity in constructing the final training mixture for the 300M model is justified by internal comparisons, but the manuscript does not specify whether total example count or data volume was held fixed across the task-diversity and language-diversity conditions. Without this control, performance deltas could arise from unequal data budgets rather than diversity type, directly affecting the validity of the retrofitting recipe and the headline claim.

Authors: We agree that controlling for data volume is crucial to validate the claims about diversity. In our experiments, the total number of training examples was kept constant across the task diversity, language diversity, and baseline conditions by adjusting the sampling rates from the respective datasets. However, this control was not clearly documented in the original manuscript. We will revise §4 to explicitly describe this experimental setup and update the ablation tables with a note on the fixed data budget. revision: yes
Referee: Results section and Table comparing to 7B models: The claim of matching or surpassing 7B models lacks explicit details on whether the 7B baselines were evaluated zero-shot, fine-tuned on the same mixtures, or used their original training; this information is load-bearing for interpreting the 'comparable or surpassing' result and for reproducibility.

Authors: The 7B baselines were evaluated using their original, publicly available model checkpoints in a zero-shot manner on the standard retrieval benchmarks. No fine-tuning on our specific training mixtures was performed for these comparisons. We will add this important detail to the results section and the table description in the revised manuscript to improve clarity and reproducibility. revision: yes

Circularity Check

0 steps flagged

Empirical retrieval benchmarks ground the 300M model claim; internal diversity ablation does not reduce metrics to fitted inputs by construction

full rationale

The paper trains a compact multilingual model and reports performance on standard retrieval benchmarks, with ablations examining data scale, negatives, and diversity. The claim that task diversity outweighs language diversity derives from the authors' mixture experiments rather than any self-referential definition or parameter fit that forces the headline result. No equation or procedure equates the final retrieval score to a quantity defined solely within the paper's own fitted values; results remain externally testable on held-out data. A minor self-citation risk exists around the ablation controls, but it is not load-bearing for the core performance comparison.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The work is almost entirely empirical; it relies on standard assumptions about embedding training rather than new theoretical constructs.

free parameters (2)

hard negative sampling ratio
Chosen empirically to improve retrieval accuracy
training data mixture proportions
Adjusted to emphasize task diversity

axioms (1)

domain assumption Hard negatives improve contrastive learning for retrieval
Invoked when stating that hard negatives are essential

pith-pipeline@v0.9.0 · 5703 in / 1250 out tokens · 40160 ms · 2026-05-18T06:55:13.262722+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 7 internal anchors

[1]

InFindings of the Asso- ciation for Computational Linguistics: ACL 2024, pages 2318–2335, Bangkok, Thailand

M3- embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self- knowledge distillation. InFindings of the Asso- ciation for Computational Linguistics: ACL 2024, pages 2318–2335, Bangkok, Thailand. Association for Computational Linguistics. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume ...

work page 2024
[2]

Unsupervised Cross-lingual Representation Learning at Scale

Unsuper- vised cross-lingual representation learning at scale. Preprint, arXiv:1911.02116. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova

work page internal anchor Pith review Pith/arXiv arXiv 1911
[3]

BERT: Pre-training of deep bidirectional transformers for language under- standing. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, V olume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics. Kennet...

work page 2019
[4]

InProceedings of the 2021 Conference on Empirical Methods in Natural Language Process- ing, pages 6894–6910, Online and Punta Cana, Do- minican Republic

SimCSE: Simple contrastive learning of sentence em- beddings. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Process- ing, pages 6894–6910, Online and Punta Cana, Do- minican Republic. Association for Computational Linguistics. Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wan...

work page 2021
[5]

Retrieval-Augmented Generation for Large Language Models: A Survey

Retrieval-augmented gener- ation for large language models: A survey.Preprint, arXiv:2312.10997. Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebas- tian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Mistral 7b.Preprint, arXiv:2310.06825. Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- field, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Ken- ton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Gemini Embedding: Generalizable Embeddings from Gemini

Gemini embedding: Generalizable em- beddings from gemini.Preprint, arXiv:2503.07891. Jinhyuk Lee, Zhuyun Dai, Xiaoqi Ren, Blair Chen, Daniel Cer, Jeremy R. Cole, Kai Hui, Michael Boratko, Rajvi Kapadia, Wen Ding, Yi Luan, Sai Meher Karthik Duddu, Gustavo Hernandez Abrego, 5 Weiqiang Shi, Nithi Gupta, Aditya Kusupati, Pra- teek Jain, Siddhartha Reddy Jonna...

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang

Gecko: Ver- satile text embeddings distilled from large language models.Preprint, arXiv:2403.20327. Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang

work page arXiv
[9]

Towards General Text Embeddings with Multi-stage Contrastive Learning

Towards general text embeddings with multi-stage contrastive learning.arXiv preprint arXiv:2308.03281. Rui Meng, Ye Liu, Shafiq Rayhan Joty, Caiming Xiong, Yingbo Zhou, and Semih Yavuz

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Leveraging LLMs for synthesizing training data across many languages in multilingual dense retrieval. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies (V olume 1: Long Papers), pages 7699–7724, Mexico City, Mexico. Association for Computational Lin- guistics....

work page 2024
[11]

FEVER: a large-scale dataset for fact extraction and VERification. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, V olume 1 (Long Papers), pages 809–819, New Orleans, Louisiana. Association for Computational Linguistics. Jörg Tiedemann

work page 2018
[12]

Improving text embeddings with large language models

Improving text embeddings with large language models.arXiv preprint arXiv:2401.00368. Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei

work page arXiv
[13]

Multilingual E5 Text Embeddings: A Technical Report

Multilin- gual e5 text embeddings: A technical report.arXiv preprint arXiv:2402.05672. Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff

work page internal anchor Pith review Pith/arXiv arXiv
[14]

C-Pack: Packed Resources For General Chinese Embeddings

C-pack: Packaged resources to advance general chinese embedding.Preprint, arXiv:2309.07597. Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel

work page internal anchor Pith review Pith/arXiv arXiv
[15]

InProceedings of the 2021 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, On- line

mT5: A massively multilingual pre-trained text-to-text transformer. InProceedings of the 2021 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, On- line. Association for Computational Linguistics. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Rusla...

work page 2021
[16]

InProceedings of the 2018 Conference on Empiri- cal Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium

HotpotQA: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empiri- cal Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium. Association for Com- putational Linguistics. Puxuan Yu, Luke Merrick, Gaurav Nuti, and Daniel Campos

work page 2018
[17]

best marvel movie

Arctic-embed 2.0: Multilin- gual retrieval without compromise.Preprint, arXiv:2412.04506. Xin Zhang, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin, Baosong Yang, Pengjun Xie, Fei Huang, Meishan Zhang, Wenjie Li, and Min Zhang

work page arXiv
[18]

InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing: Industry Track, pages 1393–1412, Miami, Florida, US

mGTE: Generalized long- context text representation and reranking models for multilingual text retrieval. InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing: Industry Track, pages 1393–1412, Miami, Florida, US. Association for Computational Linguistics. A Details on Synthetic Data Generation A.1 Languages for Synthe...

work page 2024