m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder

Jian Jiao; Jinsong Su; Qingguo Hu; Simiao Zuo; Yaoxiang Wang; Yeyun Gong; Yucheng Ding

arxiv: 2605.19568 · v1 · pith:6GNWPSKBnew · submitted 2026-05-19 · 💻 cs.CL

m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder

Yaoxiang Wang , Simiao Zuo , Qingguo Hu , Yucheng Ding , Yeyun Gong , Jian Jiao , Jinsong Su This is my paper

Pith reviewed 2026-05-20 05:59 UTC · model grok-4.3

classification 💻 cs.CL

keywords m3BERTMatryoshka embeddingsmultilingual pretrainingindustrial retrievalbidirectional encoderembedding modelsresource-aware deployment

0 comments

The pith

A single pretrained embedding model supports multiple sizes and resource levels by jointly optimizing across layers and dimensions during training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to remove the usual trade-off between model size and deployment flexibility in industrial retrieval. Standard practice takes a large pretrained model and initializes only part of it for smaller tasks, which breaks the alignment learned in pretraining and hurts results. m3BERT instead trains the model once so that every layer and every embedding dimension stays useful on its own. This is done through a three-stage process that starts monolingual, adds multilingual coverage, and finishes with large-scale web data. The result is one model that can be trimmed to different accuracy and compute targets while keeping the benefits of full pretraining.

Core claim

m3BERT introduces a pretraining strategy that jointly optimizes representations across transformer layers and multiple embedding dimensions so a single model remains consistent with pretraining when later used at any chosen size or depth. After monolingual pretraining, multilingual adaptation, and continual pretraining on a massive web-domain corpus, the model outperforms prior state-of-the-art embedding models on the large-scale Bing-Click industrial retrieval dataset and demonstrates general effectiveness on public datasets.

What carries the argument

The Matryoshka pretraining strategy, which jointly optimizes representations at multiple transformer layers and multiple embedding dimensions to support flexible post-training adaptation without misalignment.

If this is right

A single m3BERT checkpoint can be deployed at high-accuracy, medium, or low-resource settings without separate retraining runs.
Retrieval performance on industrial-scale data improves because downstream usage stays aligned with the original pretraining objective.
Multilingual and domain-adapted capabilities remain available even after the model is reduced to fit tighter constraints.
The same multigranular training pattern proves useful on public benchmarks beyond the proprietary dataset.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Production systems could switch embedding sizes on the fly according to current load or hardware limits using the same underlying model.
The staged pretraining sequence suggests that adding domain-specific web data after multilingual training is especially valuable for commercial retrieval quality.

Load-bearing premise

Jointly optimizing representations across transformer layers and multiple embedding dimensions during pretraining will remove the misalignment that arises when only part of a larger model is used downstream.

What would settle it

If smaller-dimension or shallower-layer versions of m3BERT fail to beat partially-initialized larger models on retrieval metrics such as recall or NDCG in the Bing-Click dataset, the claim that joint pretraining eliminates misalignment would not hold.

Figures

Figures reproduced from arXiv: 2605.19568 by Jian Jiao, Jinsong Su, Qingguo Hu, Simiao Zuo, Yaoxiang Wang, Yeyun Gong, Yucheng Ding.

**Figure 1.** Figure 1: Illustrative curves showing the diminishing returns of retrieval performance (Recall@100) with increasing (a) [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of the matryoshka model structure using masked language modeling (MLM) as the training objective. The [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Embedding models are pivotal in industrial information retrieval systems like search and advertising. However, existing pretrained models often exhibit fixed architectures and embedding dimensionalities, posing significant challenges when adapting them to diverse deployment scenarios with varying business-driven constraints. A common practice involves fine-tuning with partial parameter initialization from larger pretrained models for resource-constrained tasks. This method is often suboptimal as the misalignment between pretraining and downstream usage prevents full realization of pretraining benefits. To address this limitation, we introduce m3BERT: a Modern, Multi-lingual, Matryoshka Bidirectional Encoder, which features a novel pretraining strategy that jointly optimizes representations across both transformer layers and multiple embedding dimensions. This enables a single model to be tailored to varied resource and accuracy targets while maintaining consistency with pretraining. Incorporating recent architectural improvements, m3BERT uses a three-stage pretraining: monolingual pretraining, multilingual adaptation to serve diverse user bases, and crucial continual pretraining on a massive web domain corpus to enhance utility in commercial retrieval. m3BERT significantly outperforms state-of-the-art embedding models in Bing-Click, a large-scale industrial retrieval dataset, showcasing its practical versatility as an efficient foundation for resource-aware industrial retrieval systems. Further experiments on public datasets also confirm the general effectiveness of our multigranular Matryoshka pretraining strategy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

m3BERT combines Matryoshka optimization with a three-stage multilingual pretraining pipeline and claims gains on industrial retrieval, but the extra web-scale stage may explain more of the Bing-Click results than the joint layer-dimension objective.

read the letter

The main takeaway is that this paper applies Matryoshka representation learning to a multilingual BERT and adds a three-stage pretraining schedule that ends with heavy web-domain data. It reports better results on Bing-Click and some public sets, but the specific credit for the multi-granular objective versus the overall training recipe is not cleanly separated yet.

Referee Report

2 major / 2 minor

Summary. The paper introduces m3BERT, a multilingual Matryoshka bidirectional encoder that jointly optimizes representations across transformer layers and multiple embedding dimensions during pretraining. It employs a three-stage pipeline (monolingual pretraining, multilingual adaptation, and continual pretraining on a massive web-domain corpus) and claims this eliminates misalignment from partial parameter initialization, enabling flexible resource-accuracy tradeoffs. The central empirical claim is significant outperformance over state-of-the-art embedding models on the large-scale Bing-Click industrial retrieval dataset, with supporting results on public datasets.

Significance. If the attribution of gains holds, the work would provide a practical, single-model solution for resource-constrained industrial retrieval systems by allowing consistent adaptation across embedding sizes without retraining from scratch. The three-stage web-scale pretraining addresses real deployment needs in commercial search and advertising. The approach builds on Matryoshka ideas but extends them to joint layer-dimension optimization in a multilingual setting.

major comments (2)

[Experiments] Experiments section: The central claim that m3BERT significantly outperforms SOTA models on Bing-Click due to the multi-granular Matryoshka pretraining requires an ablation that holds the three-stage schedule, web corpus, and architectural updates fixed while removing only the joint optimization across layers and embedding dimensions. No such controlled ablation is described, leaving open the possibility that reported gains arise primarily from the additional continual pretraining on massive web data rather than the claimed innovation.
[Abstract] Abstract and Experiments: The outperformance claim on Bing-Click provides no details on baselines, exact metrics (e.g., recall@K, NDCG), statistical significance tests, data splits, or preprocessing, which are load-bearing for assessing whether the results support the misalignment-resolution hypothesis.

minor comments (2)

[Abstract] The abstract introduces 'Matryoshka' without a short parenthetical reference to prior work on Matryoshka embeddings, which would aid readers new to the concept.
[Method] Notation for the joint loss across layers and dimensions could be clarified with an explicit equation in the method section to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of experimental rigor and reporting clarity that we have addressed in the revision. We respond to each major comment below.

read point-by-point responses

Referee: [Experiments] Experiments section: The central claim that m3BERT significantly outperforms SOTA models on Bing-Click due to the multi-granular Matryoshka pretraining requires an ablation that holds the three-stage schedule, web corpus, and architectural updates fixed while removing only the joint optimization across layers and embedding dimensions. No such controlled ablation is described, leaving open the possibility that reported gains arise primarily from the additional continual pretraining on massive web data rather than the claimed innovation.

Authors: We agree that a controlled ablation isolating the joint layer-and-dimension optimization is necessary to strengthen attribution of the observed gains. In the revised manuscript we have added this ablation (new Table 5 and accompanying text in Section 4.3). The experiment keeps the three-stage schedule, web corpus, and all architectural modifications identical while comparing the full joint Matryoshka objective against a variant that optimizes layers and embedding dimensions independently. The results show a consistent additional lift on Bing-Click from the joint optimization, supporting the claim that the multi-granular pretraining contributes beyond the continual web pretraining alone. revision: yes
Referee: [Abstract] Abstract and Experiments: The outperformance claim on Bing-Click provides no details on baselines, exact metrics (e.g., recall@K, NDCG), statistical significance tests, data splits, or preprocessing, which are load-bearing for assessing whether the results support the misalignment-resolution hypothesis.

Authors: We acknowledge that the original submission lacked sufficient experimental detail. The revised manuscript now includes an expanded description in both the abstract and Section 4.2: we list all baselines with citations, report recall@K and NDCG@K for multiple K, include paired t-test p-values for statistical significance, describe the train/validation/test splits of Bing-Click, and detail the preprocessing pipeline. These additions allow readers to evaluate the strength of the misalignment-resolution hypothesis directly from the reported numbers. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pretraining evaluated on external data

full rationale

The paper introduces m3BERT via a three-stage pretraining pipeline (monolingual, multilingual adaptation, continual web pretraining) plus joint optimization across layers and embedding dimensions. All performance claims rest on direct empirical results against external industrial (Bing-Click) and public datasets rather than any mathematical derivation, fitted-parameter renaming, or self-citation chain that reduces the central claim to its own inputs. No equations appear that would allow a prediction to be recovered by construction from the training objective or prior self-work; the argument is therefore self-contained against independent benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, axioms, or invented entities are stated; the work builds on standard transformer pretraining and the known Matryoshka embedding technique.

pith-pipeline@v0.9.0 · 5788 in / 1114 out tokens · 45499 ms · 2026-05-20T05:59:42.866407+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

jointly optimizes representations across both transformer layers and multiple embedding dimensions... Ltotal = sum over l_i in L, d_j in D of L_MLM
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

three-stage pretraining: monolingual, multilingual, continual web-domain Inf-CL

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 15 internal anchors

[1]

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report.arXiv preprint arXiv:2309.16609(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. 2018. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv:1611.09268 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2018
[3]

Mu Cai, Jianwei Yang, Jianfeng Gao, and Yong Jae Lee. 2025. Matryoshka multi- modal models. InThe Thirteenth International Conference on Learning Representa- tions

work page 2025
[4]

Zesen Cheng, Hang Zhang, Kehan Li, Sicong Leng, Zhiqiang Hu, Fei Wu, Deli Zhao, Xin Li, and Lidong Bing. 2024. Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss.arXiv preprint arXiv:2410.17243(2024)

work page arXiv 2024
[5]

Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashat- tention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems35 (2022), 16344–16359

work page 2022
[7]

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient Finetuning of Quantized LLMs. InAdvances in Neural Informa- tion Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 10088–10115

work page 2023
[8]

Fnu Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hanna Hajishirzi, Sham Kakade, Ali Farhadi, et al. 2024. Matformer: Nested transformer for elastic inference.Advances in Neural Information Processing Systems(2024)

work page 2024
[9]

Elias Frantar and Dan Alistarh. 2023. SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot. InInternational Conference on Machine Learning, ICML 2023 (Proceedings of Machine Learning Research, Vol. 202). PMLR, 10323–10337

work page 2023
[10]

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2023. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. InInternational Conference on Learning Representations

work page 2023
[11]

Linfeng Gao, Yaoxiang Wang, Minlong Peng, Jialong Tang, Yuzhe Shang, Ming- ming Sun, and Jinsong Su. 2025. Tool Graph Retriever: Exploring Depen- dency Graph-based Tool Retrieval for Large Language Models.arXiv preprint arXiv:2508.05152(2025)

work page arXiv 2025
[12]

Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple Contrastive Learning of Sentence Embeddings. InProceedings of the 2021 Conference on Em- pirical Methods in Natural Language Processing (EMNLP)

work page 2021
[13]

Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, et al. 2024. Olmo: Accelerating the science of language models.arXiv preprint arXiv:2402.00838(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. 2024. MiniLLM: Knowledge Distillation of Large Language Models. InInternational Conference on Learning Representations

work page 2024
[15]

Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149(2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[16]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531(2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[17]

Cheng-Yu Hsieh, Chun-Liang Li, Chih-kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. 2023. Distilling Step- by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes. InFindings of the Association for Computational Linguistics: ACL 2023. Association for Computational...

work page 2023
[18]

Wenbo Hu, Zi-Yi Dou, Liunian Li, Amita Kamath, Nanyun Peng, and Kai-Wei Chang. 2024. Matryoshka query transformer for large vision-language models. Advances in Neural Information Processing Systems(2024)

work page 2024
[19]

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. 2018. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2704–2713

work page 2018
[20]

Gueyoung Jung, Matti A Hiltunen, Kaustubh R Joshi, Richard D Schlichting, and Calton Pu. 2010. Mistral: Dynamically managing power, performance, and adaptation cost in cloud infrastructures. In2010 IEEE 30th International Conference on Distributed Computing Systems. IEEE, 62–73

work page 2010
[21]

Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, Vol. 1. Minneapolis, Minnesota

work page 2019
[22]

Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, et al. 2022. Matryoshka representation learning.Advances in Neural Information Processing Systems35 (2022), 30233–30249

work page 2022
[23]

Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural Questions: A Benchmark for Question Answering Research.Tr...

work page 2019
[24]

Riwei Lai, Li Chen, Weixin Chen, and Rui Chen. 2024. Matryoshka Representation Learning for Recommendation.arXiv preprint arXiv:2406.07432(2024)

work page arXiv 2024
[25]

Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, and Jinsong Su. 2025. Llave: Large language and vision embedding models with hardness-weighted con- trastive learning.arXiv preprint arXiv:2503.04812(2025)

work page arXiv 2025
[26]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[27]

Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101(2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[28]

Zhongjian Miao, Wen Zhang, Jinsong Su, Xiang Li, Jian Luan, Yidong Chen, Bin Wang, and Min Zhang. 2023. Exploring all-in-one knowledge distillation framework for neural machine translation. InProceedings of the 2023 conference on empirical methods in natural language processing. 2929–2940

work page 2023
[29]

Multi-Linguality Multi-Functionality Multi-Granularity. 2024. M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. (2024)

work page 2024
[30]

OrdalieTech. [n. d.]. Solon Embeddings Large 0.1. https://huggingface.co/ OrdalieTech/Solon-embeddings-large-0.1

work page
[31]

Rajvardhan Patil, Sorio Boit, Venkat Gudivada, and Jagadeesh Nandigam. 2023. A survey of text representation and embedding techniques in nlp.IEEE Access11 (2023), 36120–36146

work page 2023
[32]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of the 2019 Conference on Em- pirical Methods in Natural Language Processing. Association for Computational Linguistics

work page 2019
[33]

Tal Schuster, Adam Fisch, Jai Gupta, Mostafa Dehghani, Dara Bahri, Vinh Tran, Yi Tay, and Donald Metzler. 2022. Confident Adaptive Language Modeling. InAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 17456–17472

work page 2022
[34]

Noam Shazeer. 2020. Glu variants improve transformer.arXiv preprint arXiv:2002.05202(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[35]

Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. 2024. Nemotron- CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset.arXiv preprint arXiv:2412.02595(2024)

work page arXiv 2024
[36]

Mingjie Sun, Zhuang Liu, Anna Bair, and Zico Kolter. 2024. A Simple and Effective Pruning Approach for Large Language Models. InInternational Conference on Representation Learning, B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun (Eds.). 4942–4964

work page 2024
[37]

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupati- raju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. [n. d.]. Gemma: Open models based on gemini research and technology, 2024.URL https://arxiv. org/abs/2403.082952 ([n. d.]), 10–19

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Ellen Voorhees, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, William R Hersh, Kyle Lo, Kirk Roberts, Ian Soboroff, and Lucy Lu Wang. 2020. TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection. arXiv:2005.04474

work page arXiv 2020
[40]

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[41]

Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F Wong, and Lidia S Chao. 2019. Learning deep transformer models for machine translation. arXiv preprint arXiv:1906.01787(2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[42]

Yaoxiang Wang, Qingguo Hu, Yucheng Ding, Ruizhe Wang, Yeyun Gong, Jian Jiao, Yelong Shen, Peng Cheng, and Jinsong Su. 2025. Training Matryoshka Mixture-of-Experts for Elastic Inference-Time Expert Utilization.arXiv preprint arXiv:2509.26520(2025)

work page arXiv 2025
[43]

Yueqi Wang, Zhenrui Yue, Huimin Zeng, Dong Wang, and Julian McAuley. 2024. Train once, deploy anywhere: Matryoshka representation learning for multimodal recommendation.arXiv preprint arXiv:2409.16627(2024)

work page arXiv 2024
[44]

Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hall- ström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, and Iacopo Poli. 2024. m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Smarter...

work page internal anchor Pith review arXiv 2024
[45]

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. 2021. Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[46]

Alexander Wettig, Tianyu Gao, Zexuan Zhong, and Danqi Chen. 2022. Should you mask 15% in masked language modeling?arXiv preprint arXiv:2202.08005 (2022)

work page arXiv 2022
[47]

Zhishang Xiang, Chuanjie Wu, Qinggang Zhang, Shengyuan Chen, Zijin Hong, Xiao Huang, and Jinsong Su. 2025. When to use graphs in rag: A compre- hensive analysis for graph retrieval-augmented generation.arXiv preprint arXiv:2506.05690(2025)

work page arXiv 2025
[48]

Liping Yi, Han Yu, Chao Ren, Gang Wang, Xiaoxiao Li, et al . 2024. Federated model heterogeneous matryoshka representation learning.Advances in Neural Information Processing Systems(2024)

work page 2024
[49]

Distill L12-D768 to All

Biao Zhang and Rico Sennrich. 2019. Root mean square layer normalization. Advances in Neural Information Processing Systems32 (2019). A Implementation Details A.1 Evaluation As mentioned in Section 4.1, we conduct experiments on four bench- mark datasets: BING-CLICK, MS MARCO Document Ranking, Nat- ural Questions, and TREC-COVID. All datasets are formatte...

work page 2019

[1] [1]

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report.arXiv preprint arXiv:2309.16609(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. 2018. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv:1611.09268 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2018

[3] [3]

Mu Cai, Jianwei Yang, Jianfeng Gao, and Yong Jae Lee. 2025. Matryoshka multi- modal models. InThe Thirteenth International Conference on Learning Representa- tions

work page 2025

[4] [4]

Zesen Cheng, Hang Zhang, Kehan Li, Sicong Leng, Zhiqiang Hu, Fei Wu, Deli Zhao, Xin Li, and Lidong Bing. 2024. Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss.arXiv preprint arXiv:2410.17243(2024)

work page arXiv 2024

[5] [5]

Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashat- tention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems35 (2022), 16344–16359

work page 2022

[7] [7]

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient Finetuning of Quantized LLMs. InAdvances in Neural Informa- tion Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 10088–10115

work page 2023

[8] [8]

Fnu Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hanna Hajishirzi, Sham Kakade, Ali Farhadi, et al. 2024. Matformer: Nested transformer for elastic inference.Advances in Neural Information Processing Systems(2024)

work page 2024

[9] [9]

Elias Frantar and Dan Alistarh. 2023. SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot. InInternational Conference on Machine Learning, ICML 2023 (Proceedings of Machine Learning Research, Vol. 202). PMLR, 10323–10337

work page 2023

[10] [10]

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2023. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. InInternational Conference on Learning Representations

work page 2023

[11] [11]

Linfeng Gao, Yaoxiang Wang, Minlong Peng, Jialong Tang, Yuzhe Shang, Ming- ming Sun, and Jinsong Su. 2025. Tool Graph Retriever: Exploring Depen- dency Graph-based Tool Retrieval for Large Language Models.arXiv preprint arXiv:2508.05152(2025)

work page arXiv 2025

[12] [12]

Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple Contrastive Learning of Sentence Embeddings. InProceedings of the 2021 Conference on Em- pirical Methods in Natural Language Processing (EMNLP)

work page 2021

[13] [13]

Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, et al. 2024. Olmo: Accelerating the science of language models.arXiv preprint arXiv:2402.00838(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. 2024. MiniLLM: Knowledge Distillation of Large Language Models. InInternational Conference on Learning Representations

work page 2024

[15] [15]

Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149(2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015

[16] [16]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531(2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015

[17] [17]

Cheng-Yu Hsieh, Chun-Liang Li, Chih-kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. 2023. Distilling Step- by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes. InFindings of the Association for Computational Linguistics: ACL 2023. Association for Computational...

work page 2023

[18] [18]

Wenbo Hu, Zi-Yi Dou, Liunian Li, Amita Kamath, Nanyun Peng, and Kai-Wei Chang. 2024. Matryoshka query transformer for large vision-language models. Advances in Neural Information Processing Systems(2024)

work page 2024

[19] [19]

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. 2018. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2704–2713

work page 2018

[20] [20]

Gueyoung Jung, Matti A Hiltunen, Kaustubh R Joshi, Richard D Schlichting, and Calton Pu. 2010. Mistral: Dynamically managing power, performance, and adaptation cost in cloud infrastructures. In2010 IEEE 30th International Conference on Distributed Computing Systems. IEEE, 62–73

work page 2010

[21] [21]

Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, Vol. 1. Minneapolis, Minnesota

work page 2019

[22] [22]

Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, et al. 2022. Matryoshka representation learning.Advances in Neural Information Processing Systems35 (2022), 30233–30249

work page 2022

[23] [23]

Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural Questions: A Benchmark for Question Answering Research.Tr...

work page 2019

[24] [24]

Riwei Lai, Li Chen, Weixin Chen, and Rui Chen. 2024. Matryoshka Representation Learning for Recommendation.arXiv preprint arXiv:2406.07432(2024)

work page arXiv 2024

[25] [25]

Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, and Jinsong Su. 2025. Llave: Large language and vision embedding models with hardness-weighted con- trastive learning.arXiv preprint arXiv:2503.04812(2025)

work page arXiv 2025

[26] [26]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019

[27] [27]

Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101(2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[28] [28]

Zhongjian Miao, Wen Zhang, Jinsong Su, Xiang Li, Jian Luan, Yidong Chen, Bin Wang, and Min Zhang. 2023. Exploring all-in-one knowledge distillation framework for neural machine translation. InProceedings of the 2023 conference on empirical methods in natural language processing. 2929–2940

work page 2023

[29] [29]

Multi-Linguality Multi-Functionality Multi-Granularity. 2024. M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. (2024)

work page 2024

[30] [30]

OrdalieTech. [n. d.]. Solon Embeddings Large 0.1. https://huggingface.co/ OrdalieTech/Solon-embeddings-large-0.1

work page

[31] [31]

Rajvardhan Patil, Sorio Boit, Venkat Gudivada, and Jagadeesh Nandigam. 2023. A survey of text representation and embedding techniques in nlp.IEEE Access11 (2023), 36120–36146

work page 2023

[32] [32]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of the 2019 Conference on Em- pirical Methods in Natural Language Processing. Association for Computational Linguistics

work page 2019

[33] [33]

Tal Schuster, Adam Fisch, Jai Gupta, Mostafa Dehghani, Dara Bahri, Vinh Tran, Yi Tay, and Donald Metzler. 2022. Confident Adaptive Language Modeling. InAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 17456–17472

work page 2022

[34] [34]

Noam Shazeer. 2020. Glu variants improve transformer.arXiv preprint arXiv:2002.05202(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020

[35] [35]

Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. 2024. Nemotron- CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset.arXiv preprint arXiv:2412.02595(2024)

work page arXiv 2024

[36] [36]

Mingjie Sun, Zhuang Liu, Anna Bair, and Zico Kolter. 2024. A Simple and Effective Pruning Approach for Large Language Models. InInternational Conference on Representation Learning, B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun (Eds.). 4942–4964

work page 2024

[37] [37]

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupati- raju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. [n. d.]. Gemma: Open models based on gemini research and technology, 2024.URL https://arxiv. org/abs/2403.082952 ([n. d.]), 10–19

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

Ellen Voorhees, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, William R Hersh, Kyle Lo, Kirk Roberts, Ian Soboroff, and Lucy Lu Wang. 2020. TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection. arXiv:2005.04474

work page arXiv 2020

[40] [40]

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[41] [41]

Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F Wong, and Lidia S Chao. 2019. Learning deep transformer models for machine translation. arXiv preprint arXiv:1906.01787(2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019

[42] [42]

Yaoxiang Wang, Qingguo Hu, Yucheng Ding, Ruizhe Wang, Yeyun Gong, Jian Jiao, Yelong Shen, Peng Cheng, and Jinsong Su. 2025. Training Matryoshka Mixture-of-Experts for Elastic Inference-Time Expert Utilization.arXiv preprint arXiv:2509.26520(2025)

work page arXiv 2025

[43] [43]

Yueqi Wang, Zhenrui Yue, Huimin Zeng, Dong Wang, and Julian McAuley. 2024. Train once, deploy anywhere: Matryoshka representation learning for multimodal recommendation.arXiv preprint arXiv:2409.16627(2024)

work page arXiv 2024

[44] [44]

Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hall- ström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, and Iacopo Poli. 2024. m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Smarter...

work page internal anchor Pith review arXiv 2024

[45] [45]

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. 2021. Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[46] [46]

Alexander Wettig, Tianyu Gao, Zexuan Zhong, and Danqi Chen. 2022. Should you mask 15% in masked language modeling?arXiv preprint arXiv:2202.08005 (2022)

work page arXiv 2022

[47] [47]

Zhishang Xiang, Chuanjie Wu, Qinggang Zhang, Shengyuan Chen, Zijin Hong, Xiao Huang, and Jinsong Su. 2025. When to use graphs in rag: A compre- hensive analysis for graph retrieval-augmented generation.arXiv preprint arXiv:2506.05690(2025)

work page arXiv 2025

[48] [48]

Liping Yi, Han Yu, Chao Ren, Gang Wang, Xiaoxiao Li, et al . 2024. Federated model heterogeneous matryoshka representation learning.Advances in Neural Information Processing Systems(2024)

work page 2024

[49] [49]

Distill L12-D768 to All

Biao Zhang and Rico Sennrich. 2019. Root mean square layer normalization. Advances in Neural Information Processing Systems32 (2019). A Implementation Details A.1 Evaluation As mentioned in Section 4.1, we conduct experiments on four bench- mark datasets: BING-CLICK, MS MARCO Document Ranking, Nat- ural Questions, and TREC-COVID. All datasets are formatte...

work page 2019