pith. the verified trust layer for science. sign in

arxiv: 2510.14274 · v2 · submitted 2025-10-16 · 💻 cs.CL

Retrofitting Small Multilingual Models for Retrieval: Matching 7B Performance with 300M Parameters

Pith reviewed 2026-05-18 06:55 UTC · model grok-4.3

classification 💻 cs.CL
keywords multilingual embeddingsretrievalhard negativestask diversitysmall modelsmodel retrofittingembedding performance
0
0 comments X p. Extension

The pith

A 300 million parameter multilingual model can match or exceed 7 billion parameter models on retrieval tasks through targeted training adjustments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines why small multilingual embedding models lag behind larger ones specifically on retrieval, despite solid results on other multilingual tasks. It tests how training data volume, hard negative examples, and the mix of tasks versus languages each affect final accuracy. Scale brings quick gains that then flatten, while hard negatives deliver steady improvements and task variety proves more valuable than simply covering more languages. These insights allow the authors to build a compact model that reaches or beats current strong 7B baselines. If the approach holds, high-quality multilingual search becomes feasible on far less hardware and compute.

Core claim

We develop a compact approximately 300M multilingual model that achieves retrieval performance comparable to or even surpassing current strong 7B models by focusing on hard negative sampling and task-diverse training mixtures after observing that raw data scale quickly plateaus.

What carries the argument

Retrofitting via hard-negative sampling combined with task-diverse data mixtures rather than language expansion or continued data scaling.

Load-bearing premise

The finding that task diversity contributes more to performance than language diversity depends on the authors' internal breakdown of their specific training mixtures.

What would settle it

Train two otherwise identical 300M models, one using a high-task-diversity mixture and one using a high-language-diversity but low-task-diversity mixture with the same total examples and hard negatives, then measure which reaches higher recall on a standard multilingual retrieval benchmark.

Figures

Figures reproduced from arXiv: 2510.14274 by Lifu Tu, Semih Yavuz, Yingbo Zhou.

Figure 1
Figure 1. Figure 1: Synthetic Data scale experiments. 5.3 Hard Negatives Recent studies (Lee et al., 2024, 2025) have demon￾strated the effectiveness of incorporating hard nega￾3 [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Hard negatives experiments. 5.4 Task Diversity and Language Diversity Mul. Eur. Fr Ja Our (all) 60.56 59.76 68.28 72.29 En-Mix 59.60 58.22 67.56 72.26 Mul-Syn 59.36 58.61 67.19 71.46 En-Syn + En-Mix 60.38 59.56 68.05 72.29 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

Training effective multilingual embedding models presents unique challenges due to the diversity of languages and task objectives. Although small multilingual models (<1 B parameters) perform well on multilingual tasks generally, they consistently lag behind larger models (>1 B) in the most prevalent use case: retrieval. This raises a critical question: Can smaller models be retrofitted specifically for retrieval tasks to enhance their performance? In this work, we investigate key factors that influence the effectiveness of multilingual embeddings, focusing on training data scale, negative sampling strategies, and data diversity. We find that while increasing the scale of training data yields initial performance gains, these improvements quickly plateau - indicating diminishing returns. Incorporating hard negatives proves essential for consistently improving retrieval accuracy. Furthermore, our analysis reveals that task diversity in the training data contributes more significantly to performance than language diversity alone. As a result, we develop a compact (approximately 300M) multilingual model that achieves retrieval performance comparable to or even surpassing current strong 7B models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript investigates retrofitting small multilingual models (<1B parameters) for retrieval tasks. It empirically examines the effects of training data scale (finding diminishing returns after initial gains), hard negative sampling (essential for consistent gains), and data diversity (claiming task diversity contributes more than language diversity). Based on these, the authors develop a ~300M parameter model that achieves retrieval performance comparable to or surpassing strong 7B models on standard benchmarks.

Significance. If the central performance claims hold after addressing controls in the diversity analysis, this would be a meaningful contribution to multilingual embedding research by showing that targeted retrofitting and data curation can close the gap with much larger models, with implications for efficient deployment. The observations on plateauing scale effects and hard negatives are useful empirical guidance, though the task-vs-language diversity ranking requires tighter validation to support the training recipe.

major comments (2)
  1. [§4] §4 (Data Diversity Analysis) and associated ablation tables: The prioritization of task diversity over language diversity in constructing the final training mixture for the 300M model is justified by internal comparisons, but the manuscript does not specify whether total example count or data volume was held fixed across the task-diversity and language-diversity conditions. Without this control, performance deltas could arise from unequal data budgets rather than diversity type, directly affecting the validity of the retrofitting recipe and the headline claim.
  2. [Results section] Results section and Table comparing to 7B models: The claim of matching or surpassing 7B models lacks explicit details on whether the 7B baselines were evaluated zero-shot, fine-tuned on the same mixtures, or used their original training; this information is load-bearing for interpreting the 'comparable or surpassing' result and for reproducibility.
minor comments (3)
  1. [Abstract] Abstract and §3: Provide quantitative numbers (e.g., exact nDCG or Recall@10 deltas, standard deviations, and number of runs) for the reported trends on data scale plateau and hard-negative benefits, rather than qualitative statements only.
  2. [§2] Notation and §2: Clarify the exact definition of 'hard negatives' and the sampling ratio used, as this is a key hyperparameter in the free_parameters list.
  3. Figure captions: Ensure all figures include error bars or statistical significance markers where performance comparisons are shown.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback on our work. We address the major comments point-by-point below, providing clarifications and indicating the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: §4 (Data Diversity Analysis) and associated ablation tables: The prioritization of task diversity over language diversity in constructing the final training mixture for the 300M model is justified by internal comparisons, but the manuscript does not specify whether total example count or data volume was held fixed across the task-diversity and language-diversity conditions. Without this control, performance deltas could arise from unequal data budgets rather than diversity type, directly affecting the validity of the retrofitting recipe and the headline claim.

    Authors: We agree that controlling for data volume is crucial to validate the claims about diversity. In our experiments, the total number of training examples was kept constant across the task diversity, language diversity, and baseline conditions by adjusting the sampling rates from the respective datasets. However, this control was not clearly documented in the original manuscript. We will revise §4 to explicitly describe this experimental setup and update the ablation tables with a note on the fixed data budget. revision: yes

  2. Referee: Results section and Table comparing to 7B models: The claim of matching or surpassing 7B models lacks explicit details on whether the 7B baselines were evaluated zero-shot, fine-tuned on the same mixtures, or used their original training; this information is load-bearing for interpreting the 'comparable or surpassing' result and for reproducibility.

    Authors: The 7B baselines were evaluated using their original, publicly available model checkpoints in a zero-shot manner on the standard retrieval benchmarks. No fine-tuning on our specific training mixtures was performed for these comparisons. We will add this important detail to the results section and the table description in the revised manuscript to improve clarity and reproducibility. revision: yes

Circularity Check

0 steps flagged

Empirical retrieval benchmarks ground the 300M model claim; internal diversity ablation does not reduce metrics to fitted inputs by construction

full rationale

The paper trains a compact multilingual model and reports performance on standard retrieval benchmarks, with ablations examining data scale, negatives, and diversity. The claim that task diversity outweighs language diversity derives from the authors' mixture experiments rather than any self-referential definition or parameter fit that forces the headline result. No equation or procedure equates the final retrieval score to a quantity defined solely within the paper's own fitted values; results remain externally testable on held-out data. A minor self-citation risk exists around the ablation controls, but it is not load-bearing for the core performance comparison.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The work is almost entirely empirical; it relies on standard assumptions about embedding training rather than new theoretical constructs.

free parameters (2)
  • hard negative sampling ratio
    Chosen empirically to improve retrieval accuracy
  • training data mixture proportions
    Adjusted to emphasize task diversity
axioms (1)
  • domain assumption Hard negatives improve contrastive learning for retrieval
    Invoked when stating that hard negatives are essential

pith-pipeline@v0.9.0 · 5703 in / 1250 out tokens · 40160 ms · 2026-05-18T06:55:13.262722+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 7 internal anchors

  1. [1]

    InFindings of the Asso- ciation for Computational Linguistics: ACL 2024, pages 2318–2335, Bangkok, Thailand

    M3- embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self- knowledge distillation. InFindings of the Asso- ciation for Computational Linguistics: ACL 2024, pages 2318–2335, Bangkok, Thailand. Association for Computational Linguistics. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume ...

  2. [2]

    Unsupervised Cross-lingual Representation Learning at Scale

    Unsuper- vised cross-lingual representation learning at scale. Preprint, arXiv:1911.02116. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova

  3. [3]

    BERT: Pre-training of deep bidirectional transformers for language under- standing. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, V olume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics. Kennet...

  4. [4]

    InProceedings of the 2021 Conference on Empirical Methods in Natural Language Process- ing, pages 6894–6910, Online and Punta Cana, Do- minican Republic

    SimCSE: Simple contrastive learning of sentence em- beddings. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Process- ing, pages 6894–6910, Online and Punta Cana, Do- minican Republic. Association for Computational Linguistics. Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wan...

  5. [5]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Retrieval-augmented gener- ation for large language models: A survey.Preprint, arXiv:2312.10997. Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebas- tian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave

  6. [6]

    Mistral 7b.Preprint, arXiv:2310.06825. Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- field, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Ken- ton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

  7. [7]

    Gemini Embedding: Generalizable Embeddings from Gemini

    Gemini embedding: Generalizable em- beddings from gemini.Preprint, arXiv:2503.07891. Jinhyuk Lee, Zhuyun Dai, Xiaoqi Ren, Blair Chen, Daniel Cer, Jeremy R. Cole, Kai Hui, Michael Boratko, Rajvi Kapadia, Wen Ding, Yi Luan, Sai Meher Karthik Duddu, Gustavo Hernandez Abrego, 5 Weiqiang Shi, Nithi Gupta, Aditya Kusupati, Pra- teek Jain, Siddhartha Reddy Jonna...

  8. [8]

    Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang

    Gecko: Ver- satile text embeddings distilled from large language models.Preprint, arXiv:2403.20327. Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang

  9. [9]

    Towards General Text Embeddings with Multi-stage Contrastive Learning

    Towards general text embeddings with multi-stage contrastive learning.arXiv preprint arXiv:2308.03281. Rui Meng, Ye Liu, Shafiq Rayhan Joty, Caiming Xiong, Yingbo Zhou, and Semih Yavuz

  10. [10]

    Leveraging LLMs for synthesizing training data across many languages in multilingual dense retrieval. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies (V olume 1: Long Papers), pages 7699–7724, Mexico City, Mexico. Association for Computational Lin- guistics....

  11. [11]

    FEVER: a large-scale dataset for fact extraction and VERification. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, V olume 1 (Long Papers), pages 809–819, New Orleans, Louisiana. Association for Computational Linguistics. Jörg Tiedemann

  12. [12]

    Improving text embeddings with large language models

    Improving text embeddings with large language models.arXiv preprint arXiv:2401.00368. Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei

  13. [13]

    Multilingual E5 Text Embeddings: A Technical Report

    Multilin- gual e5 text embeddings: A technical report.arXiv preprint arXiv:2402.05672. Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff

  14. [14]

    C-Pack: Packed Resources For General Chinese Embeddings

    C-pack: Packaged resources to advance general chinese embedding.Preprint, arXiv:2309.07597. Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel

  15. [15]

    InProceedings of the 2021 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, On- line

    mT5: A massively multilingual pre-trained text-to-text transformer. InProceedings of the 2021 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, On- line. Association for Computational Linguistics. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Rusla...

  16. [16]

    InProceedings of the 2018 Conference on Empiri- cal Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium

    HotpotQA: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empiri- cal Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium. Association for Com- putational Linguistics. Puxuan Yu, Luke Merrick, Gaurav Nuti, and Daniel Campos

  17. [17]

    best marvel movie

    Arctic-embed 2.0: Multilin- gual retrieval without compromise.Preprint, arXiv:2412.04506. Xin Zhang, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin, Baosong Yang, Pengjun Xie, Fei Huang, Meishan Zhang, Wenjie Li, and Min Zhang

  18. [18]

    InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing: Industry Track, pages 1393–1412, Miami, Florida, US

    mGTE: Generalized long- context text representation and reranking models for multilingual text retrieval. InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing: Industry Track, pages 1393–1412, Miami, Florida, US. Association for Computational Linguistics. A Details on Synthetic Data Generation A.1 Languages for Synthe...