pith. sign in

arxiv: 2605.16608 · v2 · pith:XRQACSIDnew · submitted 2026-05-15 · 💻 cs.LG · cs.CL

To MRL or not to MRL: Text Embeddings are Robust to Truncation Without Matryoshka Learning, Except In Heavy Truncation Scenarios

Pith reviewed 2026-05-20 20:12 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords text embeddingstruncation robustnessMatryoshka Representation LearningMRLembedding reductiondownstream tasksmodel training
0
0 comments X

The pith

Text embeddings from standard models stay competitive when truncated unless reduced by 80 percent or more, so MRL training is often unnecessary.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether Matryoshka Representation Learning is required for text embeddings to remain useful after truncation to smaller sizes. It runs the same truncation schedule on both MRL-trained models and ordinary models across multiple encoders and tasks. Results show non-MRL embeddings perform as well as or better than MRL ones until truncation reaches at least 80 percent size reduction. A reader would care because this implies the extra training cost of MRL may not be justified unless very small vectors are required.

Core claim

By applying identical truncation schedules from MRL training to models trained with and without MRL, the experiments demonstrate that non-MRL embeddings are competitive with and frequently outperform MRL embeddings on downstream tasks when size reduction stays below 80 percent, indicating that truncation robustness arises from standard embedding training rather than from the MRL procedure itself.

What carries the argument

Identical truncation schedule taken from MRL training and applied to both MRL and non-MRL text embedding vectors.

If this is right

  • Standard embedding training suffices for most truncation levels without added MRL cost.
  • MRL training becomes relevant only when applications demand very heavy truncation.
  • Truncation robustness appears to be a general property of text embeddings rather than something MRL must instill.
  • Model selection can prioritize standard objectives when moderate-sized vectors meet needs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Deployers of embedding systems could save training compute by skipping MRL unless extreme size reduction is planned.
  • The result invites similar tests on image or multimodal embeddings to check if robustness is modality-specific.
  • Practitioners might experiment with even simpler truncation methods on existing models to confirm the pattern holds.

Load-bearing premise

That applying the truncation sizes and method chosen for MRL creates a fair test of whether MRL training itself is needed for robustness.

What would settle it

Finding that non-MRL models underperform MRL models by a large margin at truncation levels below 80 percent reduction on the same tasks would disprove the central result.

Figures

Figures reproduced from arXiv: 2605.16608 by Daniel Ruffinelli, Simone Paolo Ponzetto, Sotaro Takeshita, Yurina Takeshita.

Figure 1
Figure 1. Figure 1: (Top) Robustness of open text encoders as truncation levels increase looks the same whether trained with or without MRL. (Bottom) When models differ only in their use of MRL, truncation on non-MRL models is superior unless heavy truncation is applied. more flexibility in this regard, Matryoshka Rep￾resentation Learning (MRL) (Kusupati et al., 2022) is an approach that adds additional terms to the training … view at source ↗
Figure 2
Figure 2. Figure 2: Performance on NanoBEIR (top) and MTEB (bottom) of text embeddings truncated at various sizes, relative to the performance of the corresponding full￾size embeddings. et al. (2025), as other aspects typically differen￾tiate new models from prior work, e.g. training recipe (Neelakantan et al., 2022; Sturua et al., 2024). This makes a proper comparison prohib￾itely expensive. However, we do conduct a more con… view at source ↗
Figure 4
Figure 4. Figure 4: Standard deviation across embedding dimen [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Validation loss curve for contrastive learning with and without MRL for all model pairs. Our training [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Absolute performance on NanoBEIR (top) and MTEB (bottom) of text embeddings by smaller models [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Absolute performance on NanoBEIR (top) and MTEB (bottom) of text embeddings by larger models [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Performance on BEIR and MTEB benchmarks of five pairs of encoders trained with and without MRL. [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Standard deviations of values taken by each dimension when encoding different texts. We observe that [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Performance of smaller open text encoders in NanoBEIR datasets. [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Performance of larger open text encoders in NanoBEIR datasets. [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Performance of smaller open text encoders in MTEB datasets. [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Performance of larger open text encoders in MTEB datasets. [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: BERT base performance on each of the BEIR datasets. [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: BERT large performance on each of the NanoBEIR datasets. [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: RoBERTa base performance on each of the BEIR datasets. [PITH_FULL_IMAGE:figures/full_fig_p015_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: RoBERTa large performance on each of the NanoBEIR datasets. [PITH_FULL_IMAGE:figures/full_fig_p016_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: T5 base performance on each of the NanoBEIR datasets. [PITH_FULL_IMAGE:figures/full_fig_p016_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: BERT base performance on each of the MTEB datasets. [PITH_FULL_IMAGE:figures/full_fig_p016_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: BERT large performance on each of the MTEB datasets. [PITH_FULL_IMAGE:figures/full_fig_p017_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: RoBERTa base performance on each of the MTEB datasets. [PITH_FULL_IMAGE:figures/full_fig_p017_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: RoBERTa large performance on each of the MTEB datasets. [PITH_FULL_IMAGE:figures/full_fig_p018_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: T5 base performance on each of the MTEB datasets. [PITH_FULL_IMAGE:figures/full_fig_p018_23.png] view at source ↗
read the original abstract

Matryoshka Representation Learning (MRL) is a widely adopted approach for training text encoders so they provide useful text representations at various sizes, available by simply truncating the resulting vectors at sizes pre-determined at training time. Recent works have shown that randomly truncating text embeddings has minimal impact in downstream performance unless vectors are reduced in size by at least 70%, suggesting that embeddings are already robust to truncation without the use of MRL. However, no prior work has compared random truncation to MRL, so it is unclear how the two methods compare as effective embedding reduction methods. In this paper, we study this by applying the same truncation used by MRL to models trained with and without MRL. Our results across several models and downstream tasks show that, unless heavily truncating embeddings (i.e. reducing their size by at least 80%), truncated embeddings of non-MRL models are competitive with, and often outperform models trained with MRL. This suggests that truncation robustness may not necessarily come from MRL, and that the choice of spending the additional training cost of MRL depends on whether heavy truncation is desired. We make our code available for reproduction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript examines whether Matryoshka Representation Learning (MRL) is required to produce truncation-robust text embeddings. By applying the identical truncation schedule used during MRL training to both MRL-trained and standard (non-MRL) text encoders across multiple models and downstream tasks, the authors report that non-MRL embeddings remain competitive with—and frequently outperform—MRL embeddings unless the embedding dimension is reduced by at least 80%. The central conclusion is that the extra training cost of MRL is justified only in heavy-truncation regimes.

Significance. If the empirical comparison holds after addressing the noted experimental gaps, the result would have clear practical value for embedding-model training pipelines: it indicates that standard contrastive or masked-language-model training already yields sufficient robustness for moderate truncation, thereby questioning the routine adoption of MRL when only modest size reduction is needed. The work also supplies a useful baseline for future studies on embedding compression and dimensionality.

major comments (2)
  1. [Section 3 (Experimental Setup) and Section 4 (Results)] The experimental design applies the MRL-derived truncation points (prefix cuts at the sizes chosen during MRL training) directly to non-MRL embeddings without an ablation that tests alternative dimension-selection strategies (e.g., variance-ranked or random selection) at the same target sizes. Because MRL explicitly optimizes nested representations for precisely those cutoffs, the observed competitiveness of non-MRL models could be an artifact of the schedule rather than intrinsic robustness; this directly affects the claim that MRL training itself is not required.
  2. [Section 4 and associated tables/figures] The abstract and results sections state that non-MRL truncated embeddings “often outperform” MRL models, yet the manuscript provides neither error bars nor statistical significance tests for the pairwise comparisons. Without these, it is difficult to assess whether the reported outperformance is reliable or within the noise of the evaluation.
minor comments (2)
  1. [Section 3.2] The description of the exact truncation percentages and the corresponding absolute dimensions (e.g., 768 → 128) should be tabulated for each model so readers can reproduce the reduction ratios precisely.
  2. [Figures 2–4] Figure captions would benefit from explicitly labeling which curves correspond to MRL versus non-MRL models and whether the plotted points reflect mean performance across seeds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We address the major concerns point by point below and have made revisions to the manuscript to improve clarity and rigor where appropriate.

read point-by-point responses
  1. Referee: [Section 3 (Experimental Setup) and Section 4 (Results)] The experimental design applies the MRL-derived truncation points (prefix cuts at the sizes chosen during MRL training) directly to non-MRL embeddings without an ablation that tests alternative dimension-selection strategies (e.g., variance-ranked or random selection) at the same target sizes. Because MRL explicitly optimizes nested representations for precisely those cutoffs, the observed competitiveness of non-MRL models could be an artifact of the schedule rather than intrinsic robustness; this directly affects the claim that MRL training itself is not required.

    Authors: We chose to apply the MRL truncation schedule to non-MRL embeddings precisely to perform a controlled comparison at the dimensions for which MRL provides optimized representations. This setup directly tests whether the additional MRL training objective is necessary to achieve good performance at those specific sizes. If non-MRL embeddings perform competitively even when truncated at MRL's chosen cutoffs, it suggests that the robustness is largely intrinsic to standard training rather than dependent on MRL's nested optimization. Alternative selection strategies such as variance-based ranking would address a different question—namely, how to best truncate a fixed non-MRL embedding—rather than whether MRL training is required. We have added a clarifying paragraph in Section 3 of the revised manuscript to better articulate this experimental rationale and its relation to our central claim. revision: partial

  2. Referee: [Section 4 and associated tables/figures] The abstract and results sections state that non-MRL truncated embeddings “often outperform” MRL models, yet the manuscript provides neither error bars nor statistical significance tests for the pairwise comparisons. Without these, it is difficult to assess whether the reported outperformance is reliable or within the noise of the evaluation.

    Authors: We agree that the lack of error bars and statistical tests limits the strength of the outperformance claims. In the revised manuscript, we have added error bars representing standard deviation across multiple random seeds or evaluation runs to all relevant figures and tables. Additionally, we have included results of statistical significance tests (e.g., paired t-tests) for the key comparisons between MRL and non-MRL at each truncation level. These updates confirm that the reported advantages of non-MRL embeddings in moderate truncation regimes are statistically significant in the majority of cases. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison without derivation or self-referential structure

full rationale

The paper advances an empirical claim based on direct head-to-head experiments that apply the same truncation schedule to both MRL-trained and non-MRL models across multiple encoders and downstream tasks. No equations, fitted parameters renamed as predictions, or self-definitional steps appear in the abstract or described method. The central result—that non-MRL truncated embeddings remain competitive except under heavy (>80%) truncation—is presented as an observation from those comparisons rather than a quantity derived from prior outputs of the same model. Any self-citations to the original MRL work are external and non-load-bearing; the present study does not invoke uniqueness theorems or ansatzes from the authors' own prior publications to justify its conclusions. The argument is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on standard machine-learning experimental assumptions rather than new free parameters, axioms, or invented entities.

axioms (1)
  • domain assumption Standard assumptions in machine learning about fair model comparison and downstream task evaluation
    The paper relies on typical practices for training embeddings and measuring performance on downstream tasks.

pith-pipeline@v0.9.0 · 5758 in / 1181 out tokens · 71056 ms · 2026-05-20T20:12:28.042855+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.