pith. sign in

arxiv: 2606.18811 · v1 · pith:EHDYZX3Lnew · submitted 2026-06-17 · 💻 cs.IR · cs.AI

Rescaling MLM-Head for Neural Sparse Retrieval

Pith reviewed 2026-06-26 19:33 UTC · model grok-4.3

classification 💻 cs.IR cs.AI
keywords learned sparse retrievalSPLADEMLM head scaleneural information retrievalsparse representationspretrained encoderstraining stability
0
0 comments X

The pith

A scale mismatch in the MLM head destabilizes SPLADE training with stronger encoders, resolved by rescaling the projection at initialization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that stronger pretrained encoders often cause SPLADE to degrade or collapse during training because their MLM heads produce outputs at an inflated scale. SPLADE builds sparse lexical representations directly from these outputs and scores relevance with an unnormalized dot product, so the mismatch amplifies activations and distorts contrastive signals. A one-time multiplication of the MLM-head projection matrix by a fixed constant before training restores stability. The corrected models then reach or exceed the original BERT-SPLADE performance on both in-domain and out-of-domain benchmarks. This indicates that the barrier to using newer encoders in learned sparse retrieval is head-scale calibration rather than raw encoder capacity.

Core claim

Large MLM-head L2 norms create a scale mismatch when their outputs are used directly to build sparse lexical representations; the resulting inflated activations distort unnormalized dot-product scores and destabilize contrastive training. Rescaling the MLM-head projection by a constant factor at initialization corrects the mismatch, restores stable training, and yields competitive or superior retrieval effectiveness across benchmarks without any change to model architecture or training objective.

What carries the argument

Constant-factor rescaling of the MLM-head projection matrix applied at initialization to align output scale with the requirements of unnormalized dot-product sparse matching.

If this is right

  • ModernBERT and Ettin backbones reach stable training and competitive effectiveness after the correction.
  • The adjustment works for both in-domain and out-of-domain retrieval tasks.
  • Corrected models match or surpass the BERT-SPLADE baseline in several settings.
  • The performance bottleneck is MLM-head scale calibration, not encoder capacity alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same constant rescaling could stabilize other learned sparse retrieval methods that reuse pretrained MLM heads.
  • Scale calibration may be a general prerequisite when moving pretrained language models into sparse representation pipelines.
  • Testing whether the required factor varies systematically with backbone size or pretraining objective would clarify how widely the fix applies.

Load-bearing premise

The observed degradation and collapse are caused specifically by MLM-head scale mismatch rather than other interactions with the training recipe, data, or optimizer.

What would settle it

Train SPLADE on a standard benchmark with a large-norm backbone such as ModernBERT once with the rescaling applied and once without, then check whether training collapse occurs only in the unscaled run.

Figures

Figures reproduced from arXiv: 2606.18811 by Heuiseok Lim, Jonah Turner, Seongtae Hong, Youngjoon Jang.

Figure 1
Figure 1. Figure 1: MLM Head L2 norm ||𝑊 || and BEIR-13 effectiveness of MS MARCO-trained SPLADE models for each backbone. are produced through the MLM head and compared using an unnor￾malized sparse dot product. Therefore, an overly large MLM-head scale can amplify sparse activations, inflate matching scores, and distort the optimization dynamics of contrastive retrieval training. Based on this observation, we propose a simp… view at source ↗
Figure 2
Figure 2. Figure 2: Effectiveness–sparsity trade-off on NanoBEIR under [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training loss curves of ModernBERT- and Ettin [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
read the original abstract

Learned sparse retrieval (LSR) models such as SPLADE have traditionally used BERT-style masked language models as backbone encoders. A natural expectation is that replacing BERT with stronger pretrained encoders should improve retrieval effectiveness. However, we find that under standard SPLADE training recipes, backbones with large MLM-head L2 norms can suffer performance degradation and even training collapse under standard SPLADE training recipes. We identify this failure as a scale mismatch in the MLM head: SPLADE directly uses MLM-head outputs to construct sparse lexical representations, and query-document relevance is computed by an unnormalized dot product over these representations. As a result, an inflated MLM-head scale can amplify sparse activations, distort matching scores, and destabilize contrastive training under common training settings. To address this issue, we introduce a simple initialization-time correction that rescales the MLM-head projection by a constant factor before SPLADE training. This zero-cost adjustment improves training stability without modifying the model architecture or training objective. Across both in-domain and out-of-domain retrieval benchmarks, this simple correction substantially improves large-norm backbones such as ModernBERT and Ettin, turning unstable training runs into competitive sparse retrievers. In several settings, the corrected models further match or surpass the classic BERT-SPLADE baseline. These findings suggest that the bottleneck in adapting pretrained encoders to LSR is not encoder capacity alone, but the calibration of the MLM-head scale used to construct sparse lexical representations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that SPLADE training with backbones possessing large MLM-head L2 norms (e.g., ModernBERT, Ettin) suffers performance degradation or collapse because the unnormalized dot-product relevance score amplifies sparse activations; a constant rescaling factor applied to the MLM-head projection at initialization corrects the mismatch, stabilizes contrastive training, and yields competitive or superior results on in- and out-of-domain retrieval benchmarks relative to the BERT-SPLADE baseline.

Significance. If the central empirical observation holds after proper isolation, the result is significant for learned sparse retrieval: it supplies a zero-cost, architecture-preserving initialization adjustment that enables stronger pretrained encoders to be used in LSR pipelines. The work correctly highlights that MLM-head scale, rather than encoder capacity alone, is a practical bottleneck when the sparse representation is taken directly from the MLM output.

major comments (2)
  1. [§4 (Experiments)] The attribution of failure specifically to MLM-head scale mismatch is not isolated from other backbone differences. ModernBERT and Ettin differ from BERT in architecture, pretraining corpus, and optimization history; the experiments compare full backbones without controls (e.g., norm-matched variants or synthetic rescaling of BERT) that would test whether the observed collapse is caused by the reported L2-norm inflation rather than correlated factors.
  2. [§3 (Method)] No ablation or sensitivity analysis is reported for the choice or magnitude of the rescaling constant itself; the manuscript states that a single constant suffices across backbones, yet provides neither the value used nor evidence that performance is robust to small perturbations of that constant.
minor comments (2)
  1. Results are presented as single-point estimates without error bars, standard deviations, or the number of random seeds; this makes it impossible to judge whether reported gains are statistically reliable.
  2. The exact value of the rescaling factor and the precise initialization procedure (which layers, which parameters) should be stated explicitly rather than described only as 'a constant factor'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The two major comments highlight important gaps in experimental controls and reporting. We address each below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [§4 (Experiments)] The attribution of failure specifically to MLM-head scale mismatch is not isolated from other backbone differences. ModernBERT and Ettin differ from BERT in architecture, pretraining corpus, and optimization history; the experiments compare full backbones without controls (e.g., norm-matched variants or synthetic rescaling of BERT) that would test whether the observed collapse is caused by the reported L2-norm inflation rather than correlated factors.

    Authors: We agree that the current experiments do not fully isolate the MLM-head scale from other backbone differences. The manuscript demonstrates that the rescaling correction stabilizes training and improves results for large-norm backbones, but lacks the suggested controls. In the revised version we will add a controlled experiment that applies synthetic rescaling to the BERT MLM-head to match the larger norms observed in ModernBERT/Ettin and shows the resulting training instability, thereby providing direct evidence that the scale mismatch is the causal factor. revision: yes

  2. Referee: [§3 (Method)] No ablation or sensitivity analysis is reported for the choice or magnitude of the rescaling constant itself; the manuscript states that a single constant suffices across backbones, yet provides neither the value used nor evidence that performance is robust to small perturbations of that constant.

    Authors: The referee correctly notes the missing details. The manuscript will be revised to explicitly state the rescaling constant employed in all reported runs and to include a sensitivity plot (or table) showing retrieval metrics for a range of constants around the chosen value. This will confirm that performance remains stable for modest perturbations and that a single constant works across the tested backbones. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical diagnosis and fix are independent of self-reference

full rationale

The paper reports observed training collapse with high-norm MLM heads under SPLADE's unnormalized dot-product matching, measures L2 norms, and applies a constant initialization rescaling. No equations, predictions, or uniqueness claims reduce by construction to fitted quantities or self-citations. The central claim rests on external training runs and benchmark results rather than any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests on standard domain assumptions about how SPLADE constructs and scores sparse vectors; the rescaling factor is introduced as a practical correction without new postulated entities.

free parameters (1)
  • rescaling constant
    A fixed multiplier applied to the MLM-head projection at initialization; its specific value is not reported in the abstract.
axioms (1)
  • domain assumption SPLADE computes relevance via unnormalized dot product over MLM-head outputs
    Explicitly stated in the abstract as the source of sensitivity to head scale.

pith-pipeline@v0.9.1-grok · 5796 in / 1329 out tokens · 32351 ms · 2026-06-26T19:33:29.158778+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 7 linked inside Pith

  1. [1]

    Parul Awasthy, Aashka Trivedi, Yulong Li, Meet Doshi, Riyaz Bhat, Vignesh P, Vishwajeet Kumar, Yushu Yang, Bhavani Iyer, Abraham Daniels, Rudra Murthy, Ken Barker, Martin Franz, Madison Lee, Todd Ward, Salim Roukos, David Cox, Luis Lastras, Jaydeep Sen, and Radu Florian. 2025. Granite Embedding R2 Models. arXiv:2508.21085 [cs.CL] https://arxiv.org/abs/2508.21085

  2. [2]

    Alejandro Fuster Baggetto and Victor Fresno. 2022. Is anisotropy really the cause of BERT embeddings not being semantic?. InFindings of the association for computational linguistics: EMNLP 2022. 4271–4281

  3. [3]

    Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. 2018. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv:1611.09268 [cs.CL] https://arxiv.org/abs/1611.09268

  4. [5]

    InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval

    Efficient inverted indexes for approximate retrieval over learned sparse representations. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 152–162

  5. [6]

    Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, and Rossano Venturini

  6. [7]

    InProceedings of the 33rd ACM International Conference on Information and Knowledge Management

    Pairing clustered inverted indexes with 𝜅-nn graphs for fast approximate retrieval over learned sparse representations. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management. 3642–3646

  7. [8]

    Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, Rossano Venturini, and Leonardo Venuta. 2025. Investigating the scalability of approximate sparse retrieval algorithms to massive datasets. InEuropean Conference on Information Retrieval. Springer, 437–445

  8. [9]

    Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu

  9. [10]

    InFindings of the association for computational linguistics: ACL 2024

    M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. InFindings of the association for computational linguistics: ACL 2024. 2318–2335

  10. [11]

    Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M Voorhees. 2020. Overview of the TREC 2019 deep learning track. InProceedings of the 28th Text REtrieval Conference (TREC 2019). NIST

  11. [12]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 4171–4186

  12. [13]

    Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant

  13. [14]

    arXiv preprint arXiv:2109.10086(2021)

    SPLADE v2: Sparse lexical and expansion model for information retrieval. arXiv preprint arXiv:2109.10086(2021)

  14. [15]

    Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. 2021. Splade: Sparse lexical and expansion model for first stage ranking. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2288–2292

  15. [16]

    Hiun Kim, Tae Kwan Lee, and Taeryun Won. 2025. The Role of Vocabularies in Learning Sparse Representations for Ranking.arXiv preprint arXiv:2509.16621 (2025)

  16. [17]

    Hiun Kim, Tae Kwan Lee, and Taeryun Won. 2026. The Pre-Training Study of Expanded-SPLADE Models on Web Document Titles.arXiv preprint arXiv:2605.01407(2026)

  17. [18]

    Carlos Lassance, Hervé Déjean, Thibault Formal, and Stéphane Clinchant. 2024. SPLADE-v3: New baselines for SPLADE.arXiv preprint arXiv:2403.06789(2024)

  18. [19]

    Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. 2023. Towards general text embeddings with multi-stage contrastive learning.arXiv preprint arXiv:2308.03281(2023)

  19. [20]

    Simon Lupart, Maxime Louis, Thibault Formal, Hervé Déjean, and Stéphane Clinchant. 2026. On the Challenges and Opportunities of Learned Sparse Retrieval for Code.arXiv preprint arXiv:2603.22008(2026)

  20. [21]

    Joel Mackenzie, Shengyao Zhuang, and Guido Zuccon. 2023. Exploring the Representation Power of SPLADE Models. InProceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval. 143–147

  21. [22]

    Thong Nguyen, Yibin Lei, Jia-Huei Ju, Eugene Yang, and Andrew Yates. 2025. Milco: Learned Sparse Retrieval Across Languages via a Multilingual Connector. arXiv preprint arXiv:2510.00671(2025)

  22. [23]

    Thong Nguyen, Sean MacAvaney, and Andrew Yates. 2023. A unified framework for learned sparse retrieval. InEuropean Conference on Information Retrieval. Springer, 101–116

  23. [24]

    Biswajit Paria, Chih-Kuan Yeh, Ian E. H. Yen, Ning Xu, Pradeep Ravikumar, and Barnabás Póczos. 2020. Minimizing FLOPs to Learn Efficient Sparse Representa- tions. arXiv:2004.05665 [cs.LG] https://arxiv.org/abs/2004.05665

  24. [25]

    2009.The probabilistic relevance frame- work: BM25 and beyond

    Stephen Robertson and Hugo Zaragoza. 2009.The probabilistic relevance frame- work: BM25 and beyond. Vol. 4. Now Publishers Inc

  25. [26]

    Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models.arXiv preprint arXiv:2104.08663(2021)

  26. [27]

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533(2022)

  27. [28]

    Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hall- ström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, et al. 2025. Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference. InProceed- ings of the 63rd Annual Meeting of the...

  28. [29]

    Orion Weller, Kathryn Ricci, Marc Marone, Antoine Chaffin, Dawn Lawrie, and Benjamin Van Durme. 2026. Seq vs Seq: An Open Suite of Paired Encoders and Decoders. arXiv:2507.11412 [cs.CL] https://arxiv.org/abs/2507.11412

  29. [30]

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. 2025. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176(2025)