m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder
Pith reviewed 2026-05-20 05:59 UTC · model grok-4.3
The pith
A single pretrained embedding model supports multiple sizes and resource levels by jointly optimizing across layers and dimensions during training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
m3BERT introduces a pretraining strategy that jointly optimizes representations across transformer layers and multiple embedding dimensions so a single model remains consistent with pretraining when later used at any chosen size or depth. After monolingual pretraining, multilingual adaptation, and continual pretraining on a massive web-domain corpus, the model outperforms prior state-of-the-art embedding models on the large-scale Bing-Click industrial retrieval dataset and demonstrates general effectiveness on public datasets.
What carries the argument
The Matryoshka pretraining strategy, which jointly optimizes representations at multiple transformer layers and multiple embedding dimensions to support flexible post-training adaptation without misalignment.
If this is right
- A single m3BERT checkpoint can be deployed at high-accuracy, medium, or low-resource settings without separate retraining runs.
- Retrieval performance on industrial-scale data improves because downstream usage stays aligned with the original pretraining objective.
- Multilingual and domain-adapted capabilities remain available even after the model is reduced to fit tighter constraints.
- The same multigranular training pattern proves useful on public benchmarks beyond the proprietary dataset.
Where Pith is reading between the lines
- Production systems could switch embedding sizes on the fly according to current load or hardware limits using the same underlying model.
- The staged pretraining sequence suggests that adding domain-specific web data after multilingual training is especially valuable for commercial retrieval quality.
Load-bearing premise
Jointly optimizing representations across transformer layers and multiple embedding dimensions during pretraining will remove the misalignment that arises when only part of a larger model is used downstream.
What would settle it
If smaller-dimension or shallower-layer versions of m3BERT fail to beat partially-initialized larger models on retrieval metrics such as recall or NDCG in the Bing-Click dataset, the claim that joint pretraining eliminates misalignment would not hold.
Figures
read the original abstract
Embedding models are pivotal in industrial information retrieval systems like search and advertising. However, existing pretrained models often exhibit fixed architectures and embedding dimensionalities, posing significant challenges when adapting them to diverse deployment scenarios with varying business-driven constraints. A common practice involves fine-tuning with partial parameter initialization from larger pretrained models for resource-constrained tasks. This method is often suboptimal as the misalignment between pretraining and downstream usage prevents full realization of pretraining benefits. To address this limitation, we introduce m3BERT: a Modern, Multi-lingual, Matryoshka Bidirectional Encoder, which features a novel pretraining strategy that jointly optimizes representations across both transformer layers and multiple embedding dimensions. This enables a single model to be tailored to varied resource and accuracy targets while maintaining consistency with pretraining. Incorporating recent architectural improvements, m3BERT uses a three-stage pretraining: monolingual pretraining, multilingual adaptation to serve diverse user bases, and crucial continual pretraining on a massive web domain corpus to enhance utility in commercial retrieval. m3BERT significantly outperforms state-of-the-art embedding models in Bing-Click, a large-scale industrial retrieval dataset, showcasing its practical versatility as an efficient foundation for resource-aware industrial retrieval systems. Further experiments on public datasets also confirm the general effectiveness of our multigranular Matryoshka pretraining strategy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces m3BERT, a multilingual Matryoshka bidirectional encoder that jointly optimizes representations across transformer layers and multiple embedding dimensions during pretraining. It employs a three-stage pipeline (monolingual pretraining, multilingual adaptation, and continual pretraining on a massive web-domain corpus) and claims this eliminates misalignment from partial parameter initialization, enabling flexible resource-accuracy tradeoffs. The central empirical claim is significant outperformance over state-of-the-art embedding models on the large-scale Bing-Click industrial retrieval dataset, with supporting results on public datasets.
Significance. If the attribution of gains holds, the work would provide a practical, single-model solution for resource-constrained industrial retrieval systems by allowing consistent adaptation across embedding sizes without retraining from scratch. The three-stage web-scale pretraining addresses real deployment needs in commercial search and advertising. The approach builds on Matryoshka ideas but extends them to joint layer-dimension optimization in a multilingual setting.
major comments (2)
- [Experiments] Experiments section: The central claim that m3BERT significantly outperforms SOTA models on Bing-Click due to the multi-granular Matryoshka pretraining requires an ablation that holds the three-stage schedule, web corpus, and architectural updates fixed while removing only the joint optimization across layers and embedding dimensions. No such controlled ablation is described, leaving open the possibility that reported gains arise primarily from the additional continual pretraining on massive web data rather than the claimed innovation.
- [Abstract] Abstract and Experiments: The outperformance claim on Bing-Click provides no details on baselines, exact metrics (e.g., recall@K, NDCG), statistical significance tests, data splits, or preprocessing, which are load-bearing for assessing whether the results support the misalignment-resolution hypothesis.
minor comments (2)
- [Abstract] The abstract introduces 'Matryoshka' without a short parenthetical reference to prior work on Matryoshka embeddings, which would aid readers new to the concept.
- [Method] Notation for the joint loss across layers and dimensions could be clarified with an explicit equation in the method section to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of experimental rigor and reporting clarity that we have addressed in the revision. We respond to each major comment below.
read point-by-point responses
-
Referee: [Experiments] Experiments section: The central claim that m3BERT significantly outperforms SOTA models on Bing-Click due to the multi-granular Matryoshka pretraining requires an ablation that holds the three-stage schedule, web corpus, and architectural updates fixed while removing only the joint optimization across layers and embedding dimensions. No such controlled ablation is described, leaving open the possibility that reported gains arise primarily from the additional continual pretraining on massive web data rather than the claimed innovation.
Authors: We agree that a controlled ablation isolating the joint layer-and-dimension optimization is necessary to strengthen attribution of the observed gains. In the revised manuscript we have added this ablation (new Table 5 and accompanying text in Section 4.3). The experiment keeps the three-stage schedule, web corpus, and all architectural modifications identical while comparing the full joint Matryoshka objective against a variant that optimizes layers and embedding dimensions independently. The results show a consistent additional lift on Bing-Click from the joint optimization, supporting the claim that the multi-granular pretraining contributes beyond the continual web pretraining alone. revision: yes
-
Referee: [Abstract] Abstract and Experiments: The outperformance claim on Bing-Click provides no details on baselines, exact metrics (e.g., recall@K, NDCG), statistical significance tests, data splits, or preprocessing, which are load-bearing for assessing whether the results support the misalignment-resolution hypothesis.
Authors: We acknowledge that the original submission lacked sufficient experimental detail. The revised manuscript now includes an expanded description in both the abstract and Section 4.2: we list all baselines with citations, report recall@K and NDCG@K for multiple K, include paired t-test p-values for statistical significance, describe the train/validation/test splits of Bing-Click, and detail the preprocessing pipeline. These additions allow readers to evaluate the strength of the misalignment-resolution hypothesis directly from the reported numbers. revision: yes
Circularity Check
No circularity: empirical pretraining evaluated on external data
full rationale
The paper introduces m3BERT via a three-stage pretraining pipeline (monolingual, multilingual adaptation, continual web pretraining) plus joint optimization across layers and embedding dimensions. All performance claims rest on direct empirical results against external industrial (Bing-Click) and public datasets rather than any mathematical derivation, fitted-parameter renaming, or self-citation chain that reduces the central claim to its own inputs. No equations appear that would allow a prediction to be recovered by construction from the training objective or prior self-work; the argument is therefore self-contained against independent benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
jointly optimizes representations across both transformer layers and multiple embedding dimensions... Ltotal = sum over l_i in L, d_j in D of L_MLM
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
three-stage pretraining: monolingual, multilingual, continual web-domain Inf-CL
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report.arXiv preprint arXiv:2309.16609(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. 2018. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv:1611.09268 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[3]
Mu Cai, Jianwei Yang, Jianfeng Gao, and Yong Jae Lee. 2025. Matryoshka multi- modal models. InThe Thirteenth International Conference on Learning Representa- tions
work page 2025
- [4]
-
[5]
Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashat- tention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems35 (2022), 16344–16359
work page 2022
-
[7]
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient Finetuning of Quantized LLMs. InAdvances in Neural Informa- tion Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 10088–10115
work page 2023
-
[8]
Fnu Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hanna Hajishirzi, Sham Kakade, Ali Farhadi, et al. 2024. Matformer: Nested transformer for elastic inference.Advances in Neural Information Processing Systems(2024)
work page 2024
-
[9]
Elias Frantar and Dan Alistarh. 2023. SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot. InInternational Conference on Machine Learning, ICML 2023 (Proceedings of Machine Learning Research, Vol. 202). PMLR, 10323–10337
work page 2023
-
[10]
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2023. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. InInternational Conference on Learning Representations
work page 2023
- [11]
-
[12]
Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple Contrastive Learning of Sentence Embeddings. InProceedings of the 2021 Conference on Em- pirical Methods in Natural Language Processing (EMNLP)
work page 2021
-
[13]
Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, et al. 2024. Olmo: Accelerating the science of language models.arXiv preprint arXiv:2402.00838(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. 2024. MiniLLM: Knowledge Distillation of Large Language Models. InInternational Conference on Learning Representations
work page 2024
-
[15]
Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149(2015)
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[16]
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531(2015)
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[17]
Cheng-Yu Hsieh, Chun-Liang Li, Chih-kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. 2023. Distilling Step- by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes. InFindings of the Association for Computational Linguistics: ACL 2023. Association for Computational...
work page 2023
-
[18]
Wenbo Hu, Zi-Yi Dou, Liunian Li, Amita Kamath, Nanyun Peng, and Kai-Wei Chang. 2024. Matryoshka query transformer for large vision-language models. Advances in Neural Information Processing Systems(2024)
work page 2024
-
[19]
Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. 2018. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2704–2713
work page 2018
-
[20]
Gueyoung Jung, Matti A Hiltunen, Kaustubh R Joshi, Richard D Schlichting, and Calton Pu. 2010. Mistral: Dynamically managing power, performance, and adaptation cost in cloud infrastructures. In2010 IEEE 30th International Conference on Distributed Computing Systems. IEEE, 62–73
work page 2010
-
[21]
Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, Vol. 1. Minneapolis, Minnesota
work page 2019
-
[22]
Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, et al. 2022. Matryoshka representation learning.Advances in Neural Information Processing Systems35 (2022), 30233–30249
work page 2022
-
[23]
Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural Questions: A Benchmark for Question Answering Research.Tr...
work page 2019
- [24]
- [25]
-
[26]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[27]
Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101(2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[28]
Zhongjian Miao, Wen Zhang, Jinsong Su, Xiang Li, Jian Luan, Yidong Chen, Bin Wang, and Min Zhang. 2023. Exploring all-in-one knowledge distillation framework for neural machine translation. InProceedings of the 2023 conference on empirical methods in natural language processing. 2929–2940
work page 2023
-
[29]
Multi-Linguality Multi-Functionality Multi-Granularity. 2024. M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. (2024)
work page 2024
-
[30]
OrdalieTech. [n. d.]. Solon Embeddings Large 0.1. https://huggingface.co/ OrdalieTech/Solon-embeddings-large-0.1
-
[31]
Rajvardhan Patil, Sorio Boit, Venkat Gudivada, and Jagadeesh Nandigam. 2023. A survey of text representation and embedding techniques in nlp.IEEE Access11 (2023), 36120–36146
work page 2023
-
[32]
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of the 2019 Conference on Em- pirical Methods in Natural Language Processing. Association for Computational Linguistics
work page 2019
-
[33]
Tal Schuster, Adam Fisch, Jai Gupta, Mostafa Dehghani, Dara Bahri, Vinh Tran, Yi Tay, and Donald Metzler. 2022. Confident Adaptive Language Modeling. InAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 17456–17472
work page 2022
-
[34]
Noam Shazeer. 2020. Glu variants improve transformer.arXiv preprint arXiv:2002.05202(2020)
work page internal anchor Pith review Pith/arXiv arXiv 2020
- [35]
-
[36]
Mingjie Sun, Zhuang Liu, Anna Bair, and Zico Kolter. 2024. A Simple and Effective Pruning Approach for Large Language Models. InInternational Conference on Representation Learning, B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun (Eds.). 4942–4964
work page 2024
-
[37]
Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupati- raju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. [n. d.]. Gemma: Open models based on gemini research and technology, 2024.URL https://arxiv. org/abs/2403.082952 ([n. d.]), 10–19
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [39]
-
[40]
Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533(2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[41]
Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F Wong, and Lidia S Chao. 2019. Learning deep transformer models for machine translation. arXiv preprint arXiv:1906.01787(2019)
work page internal anchor Pith review Pith/arXiv arXiv 2019
- [42]
- [43]
-
[44]
Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hall- ström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, and Iacopo Poli. 2024. m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Smarter...
work page internal anchor Pith review arXiv 2024
-
[45]
Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. 2021. Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652(2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [46]
- [47]
-
[48]
Liping Yi, Han Yu, Chao Ren, Gang Wang, Xiaoxiao Li, et al . 2024. Federated model heterogeneous matryoshka representation learning.Advances in Neural Information Processing Systems(2024)
work page 2024
-
[49]
Biao Zhang and Rico Sennrich. 2019. Root mean square layer normalization. Advances in Neural Information Processing Systems32 (2019). A Implementation Details A.1 Evaluation As mentioned in Section 4.1, we conduct experiments on four bench- mark datasets: BING-CLICK, MS MARCO Document Ranking, Nat- ural Questions, and TREC-COVID. All datasets are formatte...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.