pith. sign in

arxiv: 2606.25674 · v1 · pith:U2HR7NIPnew · submitted 2026-06-24 · 💻 cs.CL · cs.IR

BitNet Text Embeddings

Pith reviewed 2026-06-25 20:46 UTC · model grok-4.3

classification 💻 cs.CL cs.IR
keywords text embeddingsquantizationBitNetLLMcontrastive learningdistillationlow-bit modelsretrieval
0
0 comments X

The pith

BITEMBED converts LLM backbones to ternary weights and quantized activations while recovering embedding quality through distillation and contrastive training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents BITEMBED as a framework that converts pretrained LLM backbones into low-bit embedding encoders using ternary weights and quantized activations. It adapts these models via continual contrastive pre-training and supervised fine-tuning that includes similarity-distribution distillation and attention-relation distillation from a full-precision teacher. Output embeddings are further trained to support multiple storage precisions. This addresses the high inference and storage costs of full-precision LLM embedders for retrieval and semantic tasks, with experiments showing largely comparable results on MMTEB benchmarks.

Core claim

BITEMBED converts pretrained LLM backbones into BitNet-style embedding encoders with ternary weights, quantized activations, and lightweight normalization refinement. The converted model is adapted to representation learning through continual contrastive pre-training, followed by supervised contrastive fine-tuning with both similarity-distribution distillation and attention-relation distillation from a full-precision teacher. Beyond quantizing the backbone, BITEMBED further trains output embeddings to support multiple storage precisions meeting different storage needs in various scenarios.

What carries the argument

BITEMBED framework that converts LLM backbones to BitNet-style ternary weights and quantized activations then recovers quality via contrastive pre-training and dual distillation from a full-precision teacher.

If this is right

  • BITEMBED achieves largely comparable performance to full-precision teacher embedders on MMTEB (eng, v2) with Qwen3-0.6B and Gemma3-270M.
  • The framework flexibly obtains text embeddings of various precisions to trade off performance against storage cost.
  • Quantizing the backbone to ternary weights and quantized activations reduces encoding inference costs while the adaptation steps preserve semantic quality.
  • Lightweight normalization refinement supports the backbone conversion without additional heavy components.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same quantization-plus-distillation pattern could extend to other dense retrieval or reranking models beyond the tested backbones.
  • Multiple-precision output training might allow dynamic switching of embedding storage in production indexes based on query load.
  • If the distillation losses generalize, similar techniques could reduce costs for other LLM downstream tasks that rely on internal representations.

Load-bearing premise

Continual contrastive pre-training plus similarity-distribution and attention-relation distillation from a full-precision teacher can recover representation quality after converting the backbone to ternary weights and quantized activations.

What would settle it

Running the same MMTEB (eng, v2) evaluation with Qwen3-0.6B or Gemma3-270M backbones and finding BITEMBED scores substantially below the full-precision teacher on retrieval metrics would falsify the comparability claim.

Figures

Figures reproduced from arXiv: 2606.25674 by Dongyan Zhao, Furu Wei, Huishuai Zhang, Liang Wang, Nan Yang, Shaohan Huang, Ting Song, Xin Huang, Xun Wu, Yan Xia, Zhen Li.

Figure 1
Figure 1. Figure 1: Performance-precision trade-off of BITEMBED on Qwen3-0.6B and Gemma3-270M. We report the average MMTEB (eng, v2) performance of 1-, 2-, 4-, 8-, and 16-bit output embeddings of BITEMBED [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Task-type sensitivity on MMTEB (eng, v2). Columns 1, 2, 4, and 8 report the performance [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Effect of the attention-relation distilla [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
read the original abstract

LLM-based text embedders have substantially improved retrieval and semantic representation quality, but their deployment remains costly: large backbone models slow down embedding inference, while high-dimensional full-precision embeddings impose substantial storage and bandwidth overhead on large-scale indexes. In this paper, we present BITEMBED, an extreme low-bit framework for LLM-based text embedding that jointly targets encoding efficiency and vector storage. BITEMBED converts pretrained LLM backbones into BitNet-style embedding encoders with ternary weights, quantized activations, and lightweight normalization refinement. The converted model is adapted to representation learning through continual contrastive pre-training, followed by supervised contrastive fine-tuning with both similarity-distribution distillation and attention-relation distillation from a full-precision teacher. Beyond quantizing the backbone, BITEMBED further trains output embeddings to support multiple storage precisions meeting different storage needs in various scenarios. Experiments on MMTEB (eng, v2) with Qwen3-0.6B and Gemma3-270M show that BITEMBED is largely comparable to full precision teacher embedders. Moreover, BITEMBED flexibly obtains text embeddings of various precisions, achieving a trade-off between performance and storage cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces BITEMBED, a framework for extreme low-bit LLM-based text embeddings. It converts pretrained LLM backbones (e.g., Qwen3-0.6B, Gemma3-270M) to BitNet-style encoders with ternary weights and quantized activations, applies continual contrastive pre-training followed by supervised contrastive fine-tuning with similarity-distribution and attention-relation distillation from a full-precision teacher, and trains output embeddings to support multiple storage precisions. Experiments on MMTEB (eng, v2) claim that the resulting models are largely comparable to full-precision teachers while enabling performance-storage trade-offs.

Significance. If the central performance claims hold after the described quantization and adaptation pipeline, the work would address practical deployment bottlenecks for embedding models by reducing both inference compute (via ternary weights) and index storage (via quantized embeddings). The multi-precision output training is a potentially useful extension beyond standard quantization. However, the absence of isolated ablation results or pre-adaptation baselines in the reported experiments limits the ability to quantify the contribution of the adaptation steps or to assess generalizability.

major comments (2)
  1. [Abstract] Abstract: The headline claim that BITEMBED remains 'largely comparable' to the full-precision teacher after ternary-weight + quantized-activation conversion rests on the two-stage adaptation (continual contrastive pre-training plus distillation). No MMTEB scores are reported for the converted model immediately after BitNet-style conversion but before any adaptation, so the magnitude of initial degradation and the actual recovery achieved cannot be assessed.
  2. [Abstract] Abstract (experiments paragraph): The statement that BITEMBED 'flexibly obtains text embeddings of various precisions' is presented without quantitative deltas, error bars, or dataset-split details on MMTEB (eng, v2). This prevents verification of whether the claimed trade-off between performance and storage cost is statistically meaningful or merely within noise of the teacher.
minor comments (1)
  1. [Abstract] The abstract mentions 'lightweight normalization refinement' without specifying the exact form or placement of this component relative to the ternary weights and quantized activations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline claim that BITEMBED remains 'largely comparable' to the full-precision teacher after ternary-weight + quantized-activation conversion rests on the two-stage adaptation (continual contrastive pre-training plus distillation). No MMTEB scores are reported for the converted model immediately after BitNet-style conversion but before any adaptation, so the magnitude of initial degradation and the actual recovery achieved cannot be assessed.

    Authors: We agree that reporting MMTEB performance for the model immediately after BitNet-style conversion (prior to the two-stage adaptation) would allow a clearer quantification of initial degradation and subsequent recovery. While the manuscript emphasizes end-to-end results for the full BITEMBED framework, we will add these pre-adaptation baseline scores on MMTEB (eng, v2) in the revised version to address this point directly. revision: yes

  2. Referee: [Abstract] Abstract (experiments paragraph): The statement that BITEMBED 'flexibly obtains text embeddings of various precisions' is presented without quantitative deltas, error bars, or dataset-split details on MMTEB (eng, v2). This prevents verification of whether the claimed trade-off between performance and storage cost is statistically meaningful or merely within noise of the teacher.

    Authors: We acknowledge that the current abstract lacks the requested quantitative details. In the revised manuscript we will expand the relevant section to report specific MMTEB scores across precisions, include error bars or standard deviations where available from our runs, and clarify the evaluation splits and protocol on MMTEB (eng, v2) so that the performance-storage trade-offs can be assessed rigorously. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results independent of inputs

full rationale

The paper describes an empirical pipeline: BitNet-style conversion of LLM backbones followed by continual contrastive pre-training and distillation from an external full-precision teacher. No equations, fitted parameters, or self-citations are presented in the provided text that would make the reported MMTEB performance equivalent to the inputs by construction. The adaptation process is a standard training procedure whose outcome is measured against external benchmarks rather than derived tautologically. The central claims rest on experimental comparisons, not on renaming or self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are stated. The framework implicitly assumes that ternary quantization plus contrastive adaptation preserves semantic quality, but this is not formalized.

pith-pipeline@v0.9.1-grok · 5752 in / 1049 out tokens · 21788 ms · 2026-06-25T20:46:52.466770+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

79 extracted references · 4 canonical work pages

  1. [1]

    Semeval-2012 task 6: A pilot on semantic textual similarity

    Eneko Agirre, Daniel Cer, Mona Diab, and Aitor Gonzalez-Agirre. Semeval-2012 task 6: A pilot on semantic textual similarity. in* sem 2012: The first joint conference on lexical and computational semantics–volume 1: Proceedings of the main conference and the shared task, and volume 2: Proceedings of the sixth international workshop on semantic evaluation (...

  2. [2]

    Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013

    Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013

  3. [3]

    Revela: Dense retriever learning via language modeling

    Fengyu Cai, Tong Chen, Xinran Zhao, Sihao Chen, Hongming Zhang, Sherry Tongshuang Wu, Iryna Gurevych, and Heinz Koeppl. Revela: Dense retriever learning via language modeling. arXiv preprint arXiv:2506.16552, 2025

  4. [4]

    Efficient intent detection with dual sentence encoders.arXiv preprint arXiv:2003.04807, 2020

    Iñigo Casanueva, Tadas Temˇcinas, Daniela Gerz, Matthew Henderson, and Ivan Vuli´c. Efficient intent detection with dual sentence encoders.arXiv preprint arXiv:2003.04807, 2020

  5. [5]

    Quartet: Native fp4 training can be optimal for large language models.Advances in Neural Information Processing Systems, 38:43552–43572, 2026

    Roberto Castro, Andrei Panferov, Rush Tabesh, Oliver Sieberling, Jiale Chen, Mahdi Nikdan, Saleh Ashkboos, and Dan Alistarh. Quartet: Native fp4 training can be optimal for large language models.Advances in Neural Information Processing Systems, 38:43552–43572, 2026

  6. [6]

    Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation.arXiv preprint arXiv:1708.00055, 2017

    Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation.arXiv preprint arXiv:1708.00055, 2017

  7. [7]

    Open-domain question answering

    Danqi Chen and Wen-tau Yih. Open-domain question answering. InProceedings of the 58th annual meeting of the association for computational linguistics: tutorial abstracts, pages 34–37, 2020

  8. [8]

    mme5: Improving multimodal multilingual embeddings via high-quality synthetic data

    Haonan Chen, Liang Wang, Nan Yang, Yutao Zhu, Ziliang Zhao, Furu Wei, and Zhicheng Dou. mme5: Improving multimodal multilingual embeddings via high-quality synthetic data. In Findings of the Association for Computational Linguistics: ACL 2025, pages 8254–8275, 2025

  9. [9]

    Efficientqat: Efficient quantization-aware training for large language models

    Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, and Ping Luo. Efficientqat: Efficient quantization-aware training for large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10081–10100, 2025

  10. [10]

    Semeval-2022 task 8: Multilingual news article similarity

    Xi Chen, Ali Zeynali, Chico Camargo, Fabian Flöck, Devin Gaffney, Przemyslaw Grabowicz, Scott A Hale, David Jurgens, and Mattia Samory. Semeval-2022 task 8: Multilingual news article similarity. InProceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pages 1094–1106, 2022

  11. [11]

    Linq-embed-mistral technical report.arXiv preprint arXiv:2412.03223, 2024

    Chanyeol Choi, Junseong Kim, Seolhwa Lee, Jihoon Kwon, Sangmo Gu, Yejin Kim, Minkyung Cho, and Jy-yong Sohn. Linq-embed-mistral technical report.arXiv preprint arXiv:2412.03223, 2024. 10

  12. [12]

    Specter: Document-level representation learning using citation-informed transformers.arXiv preprint arXiv:2004.07180, 2020

    Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel S Weld. Specter: Document-level representation learning using citation-informed transformers.arXiv preprint arXiv:2004.07180, 2020

  13. [13]

    Quora question pairs.https://kaggle.com/competitions/quora-question-pairs, 2017

    DataCanary, hilfialkaff, Lili Jiang, Meg Risdal, Nikhil Dandekar, and tomtung. Quora question pairs.https://kaggle.com/competitions/quora-question-pairs, 2017. Kaggle

  14. [14]

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems, 35: 30318–30332, 2022

  15. [15]

    BERT: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volu...

  16. [16]

    Bitdistiller: Unleashing the potential of sub-4-bit llms via self-distillation

    Dayou Du, Yijia Zhang, Shijie Cao, Jiaqi Guo, Ting Cao, Xiaowen Chu, and Ningyi Xu. Bitdistiller: Unleashing the potential of sub-4-bit llms via self-distillation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 102–116, 2024

  17. [17]

    Mmteb: Massive multilingual text embedding benchmark.arXiv preprint arXiv:2502.13595, 2025

    Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Márton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemi ´nski, Genta Indra Winata, et al. Mmteb: Massive multilingual text embedding benchmark.arXiv preprint arXiv:2502.13595, 2025

  18. [18]

    Eli5: Long form question answering.arXiv preprint arXiv:1907.09190, 2019

    Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. Eli5: Long form question answering.arXiv preprint arXiv:1907.09190, 2019

  19. [19]

    Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

  20. [20]

    Simcse: Simple contrastive learning of sentence embeddings.arXiv preprint arXiv:2104.08821, 2021

    Tianyu Gao, Xingcheng Yao, and Danqi Chen. Simcse: Simple contrastive learning of sentence embeddings.arXiv preprint arXiv:2104.08821, 2021

  21. [21]

    A survey of low-bit large language models: Basics, systems, and algorithms.Neural networks, page 107856, 2025

    Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Yang Yong, Shiqiao Gu, Haotong Qin, Jinyang Guo, et al. A survey of low-bit large language models: Basics, systems, and algorithms.Neural networks, page 107856, 2025

  22. [22]

    Fei Huang, Fan Wu, Zeqing Zhang, Qihao Wang, Long Zhang, Grant Michael Boquet, and Hongyang Chen. Geogpt. rag technical report.arXiv preprint arXiv:2509.09686, 2025

  23. [23]

    Quaff: Quantized parameter-efficient fine-tuning under outlier spatial stability hypothesis

    Hong Huang and Dapeng Wu. Quaff: Quantized parameter-efficient fine-tuning under outlier spatial stability hypothesis. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6481–6496, 2025

  24. [24]

    Dense passage retrieval for open-domain question answering

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick SH Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InEMNLP (1), pages 6769–6781, 2020

  25. [25]

    Colbert: Efficient and effective passage search via contextual- ized late interaction over bert

    Omar Khattab and Matei Zaharia. Colbert: Efficient and effective passage search via contextual- ized late interaction over bert. InProceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48, 2020

  26. [26]

    Ma- tryoshka representation learning.Advances in Neural Information Processing Systems, 35: 30233–30249, 2022

    Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, et al. Ma- tryoshka representation learning.Advances in Neural Information Processing Systems, 35: 30233–30249, 2022

  27. [27]

    Newsweeder: Learning to filter netnews

    Ken Lang. Newsweeder: Learning to filter netnews. InMachine learning proceedings 1995, pages 331–339. Elsevier, 1995. 11

  28. [28]

    Nv-embed: Improved techniques for training llms as generalist embedding models.arXiv preprint arXiv:2405.17428, 2024

    Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models.arXiv preprint arXiv:2405.17428, 2024

  29. [29]

    Gecko: Versatile text embeddings distilled from large language models.arXiv preprint arXiv:2403.20327, 2024

    Jinhyuk Lee, Zhuyun Dai, Xiaoqi Ren, Blair Chen, Daniel Cer, Jeremy R Cole, Kai Hui, Michael Boratko, Rajvi Kapadia, Wen Ding, et al. Gecko: Versatile text embeddings distilled from large language models.arXiv preprint arXiv:2403.20327, 2024

  30. [30]

    Llama2vec: Unsupervised adaptation of large language models for dense retrieval

    Chaofan Li, Zheng Liu, Shitao Xiao, Yingxia Shao, and Defu Lian. Llama2vec: Unsupervised adaptation of large language models for dense retrieval. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3490–3500, 2024

  31. [31]

    Making text embedders few-shot learners.arXiv preprint arXiv:2409.15700, 2024

    Chaofan Li, MingHao Qin, Shitao Xiao, Jianlyu Chen, Kun Luo, Yingxia Shao, Defu Lian, and Zheng Liu. Making text embedders few-shot learners.arXiv preprint arXiv:2409.15700, 2024

  32. [32]

    Mtop: A comprehensive multilingual task-oriented semantic parsing benchmark.arXiv preprint arXiv:2008.09335, 2020

    Haoran Li, Abhinav Arora, Shuohui Chen, Anchit Gupta, Sonal Gupta, and Yashar Mehdad. Mtop: A comprehensive multilingual task-oriented semantic parsing benchmark.arXiv preprint arXiv:2008.09335, 2020

  33. [33]

    Towards general text embeddings with multi-stage contrastive learning.arXiv preprint arXiv:2308.03281, 2023

    Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. Towards general text embeddings with multi-stage contrastive learning.arXiv preprint arXiv:2308.03281, 2023

  34. [35]

    Quantization meets reasoning: Exploring llm low-bit quantization degradation for mathematical reasoning.arXiv preprint arXiv:2501.03035, 2025

    Zhen Li, Yupeng Su, Runming Yang, Congkai Xie, Zheng Wang, Zhongwei Xie, Ngai Wong, and Hongxia Yang. Quantization meets reasoning: Exploring llm low-bit quantization degradation for mathematical reasoning.arXiv preprint arXiv:2501.03035, 2025

  35. [36]

    Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

  36. [37]

    Linkso: a dataset for learning to retrieve similar question answer pairs on software development forums

    Xueqing Liu, Chi Wang, Yue Leng, and ChengXiang Zhai. Linkso: a dataset for learning to retrieve similar question answer pairs on software development forums. InProceedings of the 4th ACM SIGSOFT International Workshop on NLP for Software Engineering, pages 2–5, 2018

  37. [38]

    Llm-qat: Data-free quantiza- tion aware training for large language models

    Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. Llm-qat: Data-free quantiza- tion aware training for large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 467–484, 2024

  38. [39]

    The era of 1-bit llms: All large language models are in 1.58 bits.arXiv preprint arXiv:2402.17764, 2024

    Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei. The era of 1-bit llms: All large language models are in 1.58 bits.arXiv preprint arXiv:2402.17764, 2024

  39. [40]

    Bitnet b1

    Shuming Ma, Hongyu Wang, Shaohan Huang, Xingxing Zhang, Ying Hu, Ting Song, Yan Xia, and Furu Wei. Bitnet b1. 58 2b4t technical report.arXiv preprint arXiv:2504.12285, 2025

  40. [41]

    Fine-tuning llama for multi-stage text retrieval

    Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, and Jimmy Lin. Fine-tuning llama for multi-stage text retrieval. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2421–2425, 2024

  41. [42]

    Learning word vectors for sentiment analysis

    Andrew Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. Learning word vectors for sentiment analysis. InProceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pages 142–150, 2011. 12

  42. [43]

    Tweet sentiment extraction

    Maggie, Phil Culliton, and Wei Chen. Tweet sentiment extraction. https://kaggle.com/ competitions/tweet-sentiment-extraction, 2020. Kaggle

  43. [44]

    Www’18 open challenge: financial opinion mining and question answering

    Macedo Maia, Siegfried Handschuh, André Freitas, Brian Davis, Ross McDermott, Manel Zarrouk, and Alexandra Balahur. Www’18 open challenge: financial opinion mining and question answering. InCompanion proceedings of the the web conference 2018, pages 1941– 1942, 2018

  44. [45]

    Hidden factors and hidden topics: understanding rating dimensions with review text

    Julian McAuley and Jure Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. InProceedings of the 7th ACM conference on Recommender systems, pages 165–172, 2013

  45. [46]

    Sfrembedding-mistral: enhance text retrieval with transfer learning.Salesforce AI Research Blog, 3:6, 2024

    Rui Meng, Ye Liu, Shafiq Rayhan Joty, Caiming Xiong, Yingbo Zhou, and Semih Yavuz. Sfrembedding-mistral: enhance text retrieval with transfer learning.Salesforce AI Research Blog, 3:6, 2024

  46. [47]

    Sgpt: Gpt sentence embeddings for semantic search.arXiv preprint arXiv:2202.08904, 2022

    Niklas Muennighoff. Sgpt: Gpt sentence embeddings for semantic search.arXiv preprint arXiv:2202.08904, 2022

  47. [48]

    MTEB : Massive Text Embedding Benchmark

    Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. MTEB: Massive text embedding benchmark. In Andreas Vlachos and Isabelle Augenstein, editors,Proceedings of the 17th Conference of the European Chapter of the Association for Computational Lin- guistics, pages 2014–2037, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics....

  48. [49]

    Generative representational instruction tuning

    Niklas Muennighoff, Hongjin Su, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and Douwe Kiela. Generative representational instruction tuning. InInternational Conference on Learning Representations, volume 2025, pages 45544–45613, 2025

  49. [50]

    Matryoshka quantization.arXiv preprint arXiv:2502.06786, 2025

    Pranav Nair, Puranjay Datta, Jeff Dean, Prateek Jain, and Aditya Kusupati. Matryoshka quantization.arXiv preprint arXiv:2502.06786, 2025

  50. [51]

    Ms marco: A human-generated machine reading comprehension dataset

    Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. Ms marco: A human-generated machine reading comprehension dataset. 2016

  51. [52]

    I wish i would have loved this one, but i didn’t–a multilingual dataset for counterfactual detection in product reviews.arXiv preprint arXiv:2104.06893, 2021

    James O’Neill, Polina Rozenshtein, Ryuichi Kiryo, Motoko Kubota, and Danushka Bollegala. I wish i would have loved this one, but i didn’t–a multilingual dataset for counterfactual detection in product reviews.arXiv preprint arXiv:2104.06893, 2021

  52. [53]

    Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

  53. [54]

    Sentence-bert: Sentence embeddings using siamese bert- networks

    Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP- IJCNLP), pages 3982–3992, 2019

  54. [55]

    Carer: Contextualized affect representations for emotion recognition

    Elvis Saravia, Hsien-Chi Toby Liu, Yen-Hao Huang, Junlin Wu, and Yi-Shin Chen. Carer: Contextualized affect representations for emotion recognition. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 3687–3697, 2018

  55. [56]

    Q-rag: Long context multi-step retrieval via value-based embedder training.arXiv preprint arXiv:2511.07328, 2025

    Artyom Sorokin, Nazar Buzun, Alexander Anokhin, Oleg Inozemcev, Egor Vedernikov, Petr Anokhin, Mikhail Burtsev, Trushkov Alexey, Yin Wenshuai, and Evgeny Burnaev. Q-rag: Long context multi-step retrieval via value-based embedder training.arXiv preprint arXiv:2511.07328, 2025

  56. [57]

    Automatic evaluate dialogue ap- propriateness by using dialogue act

    Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A. Smith, Luke Zettlemoyer, and Tao Yu. One embedder, any task: Instruction- finetuned text embeddings. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 1102–1121, Toronto...

  57. [58]

    Llms are also effective embedding models: An in-depth overview

    Chongyang Tao, Tao Shen, Shen Gao, Junshuo Zhang, Zhen Li, Kai Hua, Wenpeng Hu, Zhengwei Tao, and Shuai Ma. Llms are also effective embedding models: An in-depth overview. arXiv preprint arXiv:2412.12591, 2024

  58. [59]

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Bey...

  59. [60]

    Fever: a large-scale dataset for fact extraction and verification.arXiv preprint arXiv:1803.05355, 2018

    James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. Fever: a large-scale dataset for fact extraction and verification.arXiv preprint arXiv:1803.05355, 2018

  60. [61]

    Retrieval of the best counterargument without prior topic knowledge

    Henning Wachsmuth, Shahbaz Syed, and Benno Stein. Retrieval of the best counterargument without prior topic knowledge. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 241–251, 2018

  61. [62]

    Bitnet: Scaling 1-bit transformers for large language models.arXiv preprint arXiv:2310.11453, 2023

    Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei. Bitnet: Scaling 1-bit transformers for large language models.arXiv preprint arXiv:2310.11453, 2023

  62. [63]

    Text embeddings by weakly-supervised contrastive pre-training

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022

  63. [64]

    Improving text embeddings with large language models

    Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Improving text embeddings with large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for 14 Computational Linguistics (Volume 1: Long Papers), pages 11897–11916, Bangkok, Thailand, August 20...

  64. [65]

    Multilingual e5 text embeddings: A technical report.arXiv preprint arXiv:2402.05672, 2024

    Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Multilingual e5 text embeddings: A technical report.arXiv preprint arXiv:2402.05672, 2024

  65. [66]

    Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers.Advances in neural information processing systems, 33:5776–5788, 2020

    Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers.Advances in neural information processing systems, 33:5776–5788, 2020

  66. [67]

    Minilmv2: Multi-head self-attention relation distillation for compressing pretrained transformers

    Wenhui Wang, Hangbo Bao, Shaohan Huang, Li Dong, and Furu Wei. Minilmv2: Multi-head self-attention relation distillation for compressing pretrained transformers. InFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2140–2151, 2021

  67. [68]

    Yutong Wang, Haiyu Wang, and Sai Qian Zhang. Qsvd: Efficient low-rank approximation for unified query-key-value weight compression in low-precision vision-language models.Advances in Neural Information Processing Systems, 38:1789–1820, 2026

  68. [69]

    Bitnet distillation.arXiv preprint arXiv:2510.13998, 2025

    Xun Wu, Shaohan Huang, Wenhui Wang, Ting Song, Li Dong, Yan Xia, and Furu Wei. Bitnet distillation.arXiv preprint arXiv:2510.13998, 2025

  69. [70]

    Smoothquant: Accurate and efficient post-training quantization for large language models

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational conference on machine learning, pages 38087–38099. PMLR, 2023

  70. [71]

    Approximate nearest neighbor negative contrastive learning for dense text retrieval.arXiv preprint arXiv:2007.00808, 2020

    Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. Approximate nearest neighbor negative contrastive learning for dense text retrieval.arXiv preprint arXiv:2007.00808, 2020

  71. [72]

    Onebit: Towards extremely low-bit large language models.Advances in Neural Information Processing Systems, 37:66357–66382, 2024

    Yuzhuang Xu, Xu Han, Zonghan Yang, Shuo Wang, Qingfu Zhu, Zhiyuan Liu, Weidong Liu, and Wanxiang Che. Onebit: Towards extremely low-bit large language models.Advances in Neural Information Processing Systems, 37:66357–66382, 2024

  72. [73]

    Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  73. [74]

    Hotpotqa: A dataset for diverse, explainable multi-hop question answering.arXiv preprint arXiv:1809.09600, 2018

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhut- dinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering.arXiv preprint arXiv:1809.09600, 2018

  74. [75]

    Rptq: Reorder-based post-training quantization for large language models.arXiv preprint arXiv:2304.01089, 2023

    Zhihang Yuan, Lin Niu, Jiawei Liu, Wenyu Liu, Xinggang Wang, Yuzhang Shang, Guangyu Sun, Qiang Wu, Jiaxiang Wu, and Bingzhe Wu. Rptq: Reorder-based post-training quantization for large language models.arXiv preprint arXiv:2304.01089, 2023

  75. [76]

    Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025

  76. [77]

    Retrieval-augmented generation for ai-generated content: A survey.arXiv preprint arXiv:2402.19473, 2024

    Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, Jie Jiang, and Bin Cui. Retrieval-augmented generation for ai-generated content: A survey.arXiv preprint arXiv:2402.19473, 2024

  77. [78]

    Dense text retrieval based on pretrained language models: A survey.ACM Transactions on Information Systems, 42(4):1–60, 2024

    Wayne Xin Zhao, Jing Liu, Ruiyang Ren, and Ji-Rong Wen. Dense text retrieval based on pretrained language models: A survey.ACM Transactions on Information Systems, 42(4):1–60, 2024. 15

  78. [79]

    Embedding in recommender systems: A survey.arXiv preprint arXiv:2310.18608, 2023

    Xiangyu Zhao, Maolin Wang, Xinjian Zhao, Jiansheng Li, Shucheng Zhou, Dawei Yin, Qing Li, Jiliang Tang, and Ruocheng Guo. Embedding in recommender systems: A survey.arXiv preprint arXiv:2310.18608, 2023

  79. [80]

    Kalm-embedding-v2: Superior training techniques and data inspire a versatile embedding model.arXiv preprint arXiv:2506.20923, 2025

    Xinping Zhao, Xinshuo Hu, Zifei Shan, Shouzheng Huang, Yao Zhou, Xin Zhang, Zetian Sun, Zhenyu Liu, Dongfang Li, Xinyuan Wei, et al. Kalm-embedding-v2: Superior training techniques and data inspire a versatile embedding model.arXiv preprint arXiv:2506.20923, 2025. A Training Data Following BGE-en-ICL [31], our training data contains retrieval, reranking, ...