Rescaling MLM-Head for Neural Sparse Retrieval

Heuiseok Lim; Jonah Turner; Seongtae Hong; Youngjoon Jang

arxiv: 2606.18811 · v1 · pith:EHDYZX3Lnew · submitted 2026-06-17 · 💻 cs.IR · cs.AI

Rescaling MLM-Head for Neural Sparse Retrieval

Youngjoon Jang , Seongtae Hong , Jonah Turner , Heuiseok Lim This is my paper

Pith reviewed 2026-06-26 19:33 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords learned sparse retrievalSPLADEMLM head scaleneural information retrievalsparse representationspretrained encoderstraining stability

0 comments

The pith

A scale mismatch in the MLM head destabilizes SPLADE training with stronger encoders, resolved by rescaling the projection at initialization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that stronger pretrained encoders often cause SPLADE to degrade or collapse during training because their MLM heads produce outputs at an inflated scale. SPLADE builds sparse lexical representations directly from these outputs and scores relevance with an unnormalized dot product, so the mismatch amplifies activations and distorts contrastive signals. A one-time multiplication of the MLM-head projection matrix by a fixed constant before training restores stability. The corrected models then reach or exceed the original BERT-SPLADE performance on both in-domain and out-of-domain benchmarks. This indicates that the barrier to using newer encoders in learned sparse retrieval is head-scale calibration rather than raw encoder capacity.

Core claim

Large MLM-head L2 norms create a scale mismatch when their outputs are used directly to build sparse lexical representations; the resulting inflated activations distort unnormalized dot-product scores and destabilize contrastive training. Rescaling the MLM-head projection by a constant factor at initialization corrects the mismatch, restores stable training, and yields competitive or superior retrieval effectiveness across benchmarks without any change to model architecture or training objective.

What carries the argument

Constant-factor rescaling of the MLM-head projection matrix applied at initialization to align output scale with the requirements of unnormalized dot-product sparse matching.

If this is right

ModernBERT and Ettin backbones reach stable training and competitive effectiveness after the correction.
The adjustment works for both in-domain and out-of-domain retrieval tasks.
Corrected models match or surpass the BERT-SPLADE baseline in several settings.
The performance bottleneck is MLM-head scale calibration, not encoder capacity alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same constant rescaling could stabilize other learned sparse retrieval methods that reuse pretrained MLM heads.
Scale calibration may be a general prerequisite when moving pretrained language models into sparse representation pipelines.
Testing whether the required factor varies systematically with backbone size or pretraining objective would clarify how widely the fix applies.

Load-bearing premise

The observed degradation and collapse are caused specifically by MLM-head scale mismatch rather than other interactions with the training recipe, data, or optimizer.

What would settle it

Train SPLADE on a standard benchmark with a large-norm backbone such as ModernBERT once with the rescaling applied and once without, then check whether training collapse occurs only in the unscaled run.

Figures

Figures reproduced from arXiv: 2606.18811 by Heuiseok Lim, Jonah Turner, Seongtae Hong, Youngjoon Jang.

**Figure 1.** Figure 1: MLM Head L2 norm ||𝑊 || and BEIR-13 effectiveness of MS MARCO-trained SPLADE models for each backbone. are produced through the MLM head and compared using an unnormalized sparse dot product. Therefore, an overly large MLM-head scale can amplify sparse activations, inflate matching scores, and distort the optimization dynamics of contrastive retrieval training. Based on this observation, we propose a simp… view at source ↗

**Figure 2.** Figure 2: Effectiveness–sparsity trade-off on NanoBEIR under [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: Training loss curves of ModernBERT- and Ettin [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

read the original abstract

Learned sparse retrieval (LSR) models such as SPLADE have traditionally used BERT-style masked language models as backbone encoders. A natural expectation is that replacing BERT with stronger pretrained encoders should improve retrieval effectiveness. However, we find that under standard SPLADE training recipes, backbones with large MLM-head L2 norms can suffer performance degradation and even training collapse under standard SPLADE training recipes. We identify this failure as a scale mismatch in the MLM head: SPLADE directly uses MLM-head outputs to construct sparse lexical representations, and query-document relevance is computed by an unnormalized dot product over these representations. As a result, an inflated MLM-head scale can amplify sparse activations, distort matching scores, and destabilize contrastive training under common training settings. To address this issue, we introduce a simple initialization-time correction that rescales the MLM-head projection by a constant factor before SPLADE training. This zero-cost adjustment improves training stability without modifying the model architecture or training objective. Across both in-domain and out-of-domain retrieval benchmarks, this simple correction substantially improves large-norm backbones such as ModernBERT and Ettin, turning unstable training runs into competitive sparse retrievers. In several settings, the corrected models further match or surpass the classic BERT-SPLADE baseline. These findings suggest that the bottleneck in adapting pretrained encoders to LSR is not encoder capacity alone, but the calibration of the MLM-head scale used to construct sparse lexical representations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper spots a scale mismatch in the MLM head that blocks stronger backbones in SPLADE and offers a one-line rescaling fix at init, but the evidence does not fully isolate scale from other backbone differences.

read the letter

The main point is that this paper diagnoses why some larger-norm encoders break SPLADE training and gives a zero-cost rescaling of the MLM-head weights at initialization to fix it. They report that the adjustment stabilizes contrastive training and lets ModernBERT and Ettin reach or beat the BERT-SPLADE baseline on both in-domain and out-of-domain benchmarks.

What the work does cleanly is name the interaction between MLM-head L2 norm and the unnormalized dot-product scoring inside SPLADE. That specific link is not in the earlier SPLADE papers they cite, and the proposed correction is simple enough that anyone swapping encoders can try it immediately. The abstract indicates the fix works across the tested backbones without changing architecture or loss.

The soft spot is that the central claim rests on summarized outcomes rather than detailed isolation. The stressed backbones differ from BERT in architecture, pretraining data, and optimization history, so the observed collapse could trace to any of those factors rather than scale alone. A single constant rescaling helps, but without targeted ablations or error bars it is hard to know how general the diagnosis is. The abstract also gives no full protocol or variance numbers, which leaves the strength of the empirical support unclear.

This is for people already running or extending learned sparse retrieval who want to test newer encoders. It removes a practical obstacle but does not change the core modeling approach. The thinking is straightforward and the claim is falsifiable, so the paper deserves a serious referee even if the experiments need more controls.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that SPLADE training with backbones possessing large MLM-head L2 norms (e.g., ModernBERT, Ettin) suffers performance degradation or collapse because the unnormalized dot-product relevance score amplifies sparse activations; a constant rescaling factor applied to the MLM-head projection at initialization corrects the mismatch, stabilizes contrastive training, and yields competitive or superior results on in- and out-of-domain retrieval benchmarks relative to the BERT-SPLADE baseline.

Significance. If the central empirical observation holds after proper isolation, the result is significant for learned sparse retrieval: it supplies a zero-cost, architecture-preserving initialization adjustment that enables stronger pretrained encoders to be used in LSR pipelines. The work correctly highlights that MLM-head scale, rather than encoder capacity alone, is a practical bottleneck when the sparse representation is taken directly from the MLM output.

major comments (2)

[§4 (Experiments)] The attribution of failure specifically to MLM-head scale mismatch is not isolated from other backbone differences. ModernBERT and Ettin differ from BERT in architecture, pretraining corpus, and optimization history; the experiments compare full backbones without controls (e.g., norm-matched variants or synthetic rescaling of BERT) that would test whether the observed collapse is caused by the reported L2-norm inflation rather than correlated factors.
[§3 (Method)] No ablation or sensitivity analysis is reported for the choice or magnitude of the rescaling constant itself; the manuscript states that a single constant suffices across backbones, yet provides neither the value used nor evidence that performance is robust to small perturbations of that constant.

minor comments (2)

Results are presented as single-point estimates without error bars, standard deviations, or the number of random seeds; this makes it impossible to judge whether reported gains are statistically reliable.
The exact value of the rescaling factor and the precise initialization procedure (which layers, which parameters) should be stated explicitly rather than described only as 'a constant factor'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The two major comments highlight important gaps in experimental controls and reporting. We address each below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [§4 (Experiments)] The attribution of failure specifically to MLM-head scale mismatch is not isolated from other backbone differences. ModernBERT and Ettin differ from BERT in architecture, pretraining corpus, and optimization history; the experiments compare full backbones without controls (e.g., norm-matched variants or synthetic rescaling of BERT) that would test whether the observed collapse is caused by the reported L2-norm inflation rather than correlated factors.

Authors: We agree that the current experiments do not fully isolate the MLM-head scale from other backbone differences. The manuscript demonstrates that the rescaling correction stabilizes training and improves results for large-norm backbones, but lacks the suggested controls. In the revised version we will add a controlled experiment that applies synthetic rescaling to the BERT MLM-head to match the larger norms observed in ModernBERT/Ettin and shows the resulting training instability, thereby providing direct evidence that the scale mismatch is the causal factor. revision: yes
Referee: [§3 (Method)] No ablation or sensitivity analysis is reported for the choice or magnitude of the rescaling constant itself; the manuscript states that a single constant suffices across backbones, yet provides neither the value used nor evidence that performance is robust to small perturbations of that constant.

Authors: The referee correctly notes the missing details. The manuscript will be revised to explicitly state the rescaling constant employed in all reported runs and to include a sensitivity plot (or table) showing retrieval metrics for a range of constants around the chosen value. This will confirm that performance remains stable for modest perturbations and that a single constant works across the tested backbones. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical diagnosis and fix are independent of self-reference

full rationale

The paper reports observed training collapse with high-norm MLM heads under SPLADE's unnormalized dot-product matching, measures L2 norms, and applies a constant initialization rescaling. No equations, predictions, or uniqueness claims reduce by construction to fitted quantities or self-citations. The central claim rests on external training runs and benchmark results rather than any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests on standard domain assumptions about how SPLADE constructs and scores sparse vectors; the rescaling factor is introduced as a practical correction without new postulated entities.

free parameters (1)

rescaling constant
A fixed multiplier applied to the MLM-head projection at initialization; its specific value is not reported in the abstract.

axioms (1)

domain assumption SPLADE computes relevance via unnormalized dot product over MLM-head outputs
Explicitly stated in the abstract as the source of sensitivity to head scale.

pith-pipeline@v0.9.1-grok · 5796 in / 1329 out tokens · 32351 ms · 2026-06-26T19:33:29.158778+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 7 linked inside Pith

[1]

Parul Awasthy, Aashka Trivedi, Yulong Li, Meet Doshi, Riyaz Bhat, Vignesh P, Vishwajeet Kumar, Yushu Yang, Bhavani Iyer, Abraham Daniels, Rudra Murthy, Ken Barker, Martin Franz, Madison Lee, Todd Ward, Salim Roukos, David Cox, Luis Lastras, Jaydeep Sen, and Radu Florian. 2025. Granite Embedding R2 Models. arXiv:2508.21085 [cs.CL] https://arxiv.org/abs/2508.21085

arXiv 2025
[2]

Alejandro Fuster Baggetto and Victor Fresno. 2022. Is anisotropy really the cause of BERT embeddings not being semantic?. InFindings of the association for computational linguistics: EMNLP 2022. 4271–4281

2022
[3]

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. 2018. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv:1611.09268 [cs.CL] https://arxiv.org/abs/1611.09268

Pith/arXiv arXiv 2018
[5]

InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval

Efficient inverted indexes for approximate retrieval over learned sparse representations. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 152–162
[6]

Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, and Rossano Venturini
[7]

InProceedings of the 33rd ACM International Conference on Information and Knowledge Management

Pairing clustered inverted indexes with 𝜅-nn graphs for fast approximate retrieval over learned sparse representations. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management. 3642–3646
[8]

Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, Rossano Venturini, and Leonardo Venuta. 2025. Investigating the scalability of approximate sparse retrieval algorithms to massive datasets. InEuropean Conference on Information Retrieval. Springer, 437–445

2025
[9]

Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu
[10]

InFindings of the association for computational linguistics: ACL 2024

M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. InFindings of the association for computational linguistics: ACL 2024. 2318–2335

2024
[11]

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M Voorhees. 2020. Overview of the TREC 2019 deep learning track. InProceedings of the 28th Text REtrieval Conference (TREC 2019). NIST

2020
[12]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 4171–4186

2019
[13]

Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant
[14]

arXiv preprint arXiv:2109.10086(2021)

SPLADE v2: Sparse lexical and expansion model for information retrieval. arXiv preprint arXiv:2109.10086(2021)

arXiv 2021
[15]

Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. 2021. Splade: Sparse lexical and expansion model for first stage ranking. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2288–2292

2021
[16]

Hiun Kim, Tae Kwan Lee, and Taeryun Won. 2025. The Role of Vocabularies in Learning Sparse Representations for Ranking.arXiv preprint arXiv:2509.16621 (2025)

Pith/arXiv arXiv 2025
[17]

Hiun Kim, Tae Kwan Lee, and Taeryun Won. 2026. The Pre-Training Study of Expanded-SPLADE Models on Web Document Titles.arXiv preprint arXiv:2605.01407(2026)

Pith/arXiv arXiv 2026
[18]

Carlos Lassance, Hervé Déjean, Thibault Formal, and Stéphane Clinchant. 2024. SPLADE-v3: New baselines for SPLADE.arXiv preprint arXiv:2403.06789(2024)

arXiv 2024
[19]

Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. 2023. Towards general text embeddings with multi-stage contrastive learning.arXiv preprint arXiv:2308.03281(2023)

Pith/arXiv arXiv 2023
[20]

Simon Lupart, Maxime Louis, Thibault Formal, Hervé Déjean, and Stéphane Clinchant. 2026. On the Challenges and Opportunities of Learned Sparse Retrieval for Code.arXiv preprint arXiv:2603.22008(2026)

arXiv 2026
[21]

Joel Mackenzie, Shengyao Zhuang, and Guido Zuccon. 2023. Exploring the Representation Power of SPLADE Models. InProceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval. 143–147

2023
[22]

Thong Nguyen, Yibin Lei, Jia-Huei Ju, Eugene Yang, and Andrew Yates. 2025. Milco: Learned Sparse Retrieval Across Languages via a Multilingual Connector. arXiv preprint arXiv:2510.00671(2025)

arXiv 2025
[23]

Thong Nguyen, Sean MacAvaney, and Andrew Yates. 2023. A unified framework for learned sparse retrieval. InEuropean Conference on Information Retrieval. Springer, 101–116

2023
[24]

Biswajit Paria, Chih-Kuan Yeh, Ian E. H. Yen, Ning Xu, Pradeep Ravikumar, and Barnabás Póczos. 2020. Minimizing FLOPs to Learn Efficient Sparse Representa- tions. arXiv:2004.05665 [cs.LG] https://arxiv.org/abs/2004.05665

arXiv 2020
[25]

2009.The probabilistic relevance frame- work: BM25 and beyond

Stephen Robertson and Hugo Zaragoza. 2009.The probabilistic relevance frame- work: BM25 and beyond. Vol. 4. Now Publishers Inc

2009
[26]

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models.arXiv preprint arXiv:2104.08663(2021)

Pith/arXiv arXiv 2021
[27]

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533(2022)

Pith/arXiv arXiv 2022
[28]

Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hall- ström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, et al. 2025. Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference. InProceed- ings of the 63rd Annual Meeting of the...

2025
[29]

Orion Weller, Kathryn Ricci, Marc Marone, Antoine Chaffin, Dawn Lawrie, and Benjamin Van Durme. 2026. Seq vs Seq: An Open Suite of Paired Encoders and Decoders. arXiv:2507.11412 [cs.CL] https://arxiv.org/abs/2507.11412

arXiv 2026
[30]

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. 2025. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176(2025)

Pith/arXiv arXiv 2025

[1] [1]

Parul Awasthy, Aashka Trivedi, Yulong Li, Meet Doshi, Riyaz Bhat, Vignesh P, Vishwajeet Kumar, Yushu Yang, Bhavani Iyer, Abraham Daniels, Rudra Murthy, Ken Barker, Martin Franz, Madison Lee, Todd Ward, Salim Roukos, David Cox, Luis Lastras, Jaydeep Sen, and Radu Florian. 2025. Granite Embedding R2 Models. arXiv:2508.21085 [cs.CL] https://arxiv.org/abs/2508.21085

arXiv 2025

[2] [2]

Alejandro Fuster Baggetto and Victor Fresno. 2022. Is anisotropy really the cause of BERT embeddings not being semantic?. InFindings of the association for computational linguistics: EMNLP 2022. 4271–4281

2022

[3] [3]

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. 2018. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv:1611.09268 [cs.CL] https://arxiv.org/abs/1611.09268

Pith/arXiv arXiv 2018

[4] [5]

InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval

Efficient inverted indexes for approximate retrieval over learned sparse representations. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 152–162

[5] [6]

Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, and Rossano Venturini

[6] [7]

InProceedings of the 33rd ACM International Conference on Information and Knowledge Management

Pairing clustered inverted indexes with 𝜅-nn graphs for fast approximate retrieval over learned sparse representations. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management. 3642–3646

[7] [8]

Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, Rossano Venturini, and Leonardo Venuta. 2025. Investigating the scalability of approximate sparse retrieval algorithms to massive datasets. InEuropean Conference on Information Retrieval. Springer, 437–445

2025

[8] [9]

Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu

[9] [10]

InFindings of the association for computational linguistics: ACL 2024

M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. InFindings of the association for computational linguistics: ACL 2024. 2318–2335

2024

[10] [11]

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M Voorhees. 2020. Overview of the TREC 2019 deep learning track. InProceedings of the 28th Text REtrieval Conference (TREC 2019). NIST

2020

[11] [12]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 4171–4186

2019

[12] [13]

Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant

[13] [14]

arXiv preprint arXiv:2109.10086(2021)

SPLADE v2: Sparse lexical and expansion model for information retrieval. arXiv preprint arXiv:2109.10086(2021)

arXiv 2021

[14] [15]

Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. 2021. Splade: Sparse lexical and expansion model for first stage ranking. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2288–2292

2021

[15] [16]

Hiun Kim, Tae Kwan Lee, and Taeryun Won. 2025. The Role of Vocabularies in Learning Sparse Representations for Ranking.arXiv preprint arXiv:2509.16621 (2025)

Pith/arXiv arXiv 2025

[16] [17]

Hiun Kim, Tae Kwan Lee, and Taeryun Won. 2026. The Pre-Training Study of Expanded-SPLADE Models on Web Document Titles.arXiv preprint arXiv:2605.01407(2026)

Pith/arXiv arXiv 2026

[17] [18]

Carlos Lassance, Hervé Déjean, Thibault Formal, and Stéphane Clinchant. 2024. SPLADE-v3: New baselines for SPLADE.arXiv preprint arXiv:2403.06789(2024)

arXiv 2024

[18] [19]

Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. 2023. Towards general text embeddings with multi-stage contrastive learning.arXiv preprint arXiv:2308.03281(2023)

Pith/arXiv arXiv 2023

[19] [20]

Simon Lupart, Maxime Louis, Thibault Formal, Hervé Déjean, and Stéphane Clinchant. 2026. On the Challenges and Opportunities of Learned Sparse Retrieval for Code.arXiv preprint arXiv:2603.22008(2026)

arXiv 2026

[20] [21]

Joel Mackenzie, Shengyao Zhuang, and Guido Zuccon. 2023. Exploring the Representation Power of SPLADE Models. InProceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval. 143–147

2023

[21] [22]

Thong Nguyen, Yibin Lei, Jia-Huei Ju, Eugene Yang, and Andrew Yates. 2025. Milco: Learned Sparse Retrieval Across Languages via a Multilingual Connector. arXiv preprint arXiv:2510.00671(2025)

arXiv 2025

[22] [23]

Thong Nguyen, Sean MacAvaney, and Andrew Yates. 2023. A unified framework for learned sparse retrieval. InEuropean Conference on Information Retrieval. Springer, 101–116

2023

[23] [24]

Biswajit Paria, Chih-Kuan Yeh, Ian E. H. Yen, Ning Xu, Pradeep Ravikumar, and Barnabás Póczos. 2020. Minimizing FLOPs to Learn Efficient Sparse Representa- tions. arXiv:2004.05665 [cs.LG] https://arxiv.org/abs/2004.05665

arXiv 2020

[24] [25]

2009.The probabilistic relevance frame- work: BM25 and beyond

Stephen Robertson and Hugo Zaragoza. 2009.The probabilistic relevance frame- work: BM25 and beyond. Vol. 4. Now Publishers Inc

2009

[25] [26]

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models.arXiv preprint arXiv:2104.08663(2021)

Pith/arXiv arXiv 2021

[26] [27]

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533(2022)

Pith/arXiv arXiv 2022

[27] [28]

Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hall- ström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, et al. 2025. Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference. InProceed- ings of the 63rd Annual Meeting of the...

2025

[28] [29]

Orion Weller, Kathryn Ricci, Marc Marone, Antoine Chaffin, Dawn Lawrie, and Benjamin Van Durme. 2026. Seq vs Seq: An Open Suite of Paired Encoders and Decoders. arXiv:2507.11412 [cs.CL] https://arxiv.org/abs/2507.11412

arXiv 2026

[29] [30]

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. 2025. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176(2025)

Pith/arXiv arXiv 2025