pith. machine review for the scientific record. sign in

arxiv: 2604.27398 · v1 · submitted 2026-04-30 · 💻 cs.CL

Recognition: unknown

Why Mean Pooling Works: Quantifying Second-Order Collapse in Text Embeddings

Hiroto Kurita, Kentaro Inui, Masaaki Imaizumi, Sho Yokoi, Tomomasa Hara

Pith reviewed 2026-05-07 08:50 UTC · model grok-4.3

classification 💻 cs.CL
keywords mean poolingtext embeddingssecond-order statisticscontrastive fine-tuningtoken embeddingsinformation collapseembedding robustnessdownstream performance
0
0 comments X

The pith

Mean pooling collapses second-order statistics of token embeddings but modern encoders resist this via concentrated tokens, with contrastive fine-tuning increasing resistance and task performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a metric to quantify how mean pooling can erase the spatial structure among token embeddings, potentially mapping different texts to similar averages. It then measures this effect across real models and finds limited collapse overall. Contrastively fine-tuned encoders show less collapse than their pretrained bases because their token embeddings cluster more tightly within each input. The amount of resistance also tracks downstream task accuracy, suggesting that mean pooling succeeds in practice because encoders are already trained to preserve what averaging would otherwise lose.

Core claim

Mean pooling can map distinct token embedding distributions to similar text embeddings by discarding second-order statistics, yet actual encoders prove robust to this collapse; the robustness stems from concentration of token embeddings inside each text, is stronger after contrastive fine-tuning, and correlates with downstream task performance.

What carries the argument

A metric that quantifies second-order collapse by measuring how much the spatial structure among token embeddings is lost under mean pooling.

If this is right

  • Contrastive fine-tuning reduces the second-order collapse that mean pooling would otherwise cause.
  • The concentration of token embeddings within each text is what prevents the collapse.
  • Encoders whose embeddings resist this collapse achieve stronger results on downstream tasks.
  • Mean pooling remains effective because training already aligns token distributions to survive averaging.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same concentration principle could be tested as a training objective for pooling in vision or speech models.
  • The metric might serve as a diagnostic during pretraining to encourage better preservation of distributional information.
  • If collapse resistance is the key, then other simple pooling methods like max or attention-based might be compared directly using the same measure.

Load-bearing premise

The chosen metric captures information loss that actually affects downstream task performance rather than depending on the similarity measure or embedding space used.

What would settle it

An experiment showing that encoders with low collapse scores perform no better on tasks than those with high collapse scores, or that forcing higher collapse does not degrade task results, would disprove the link.

Figures

Figures reproduced from arXiv: 2604.27398 by Hiroto Kurita, Kentaro Inui, Masaaki Imaizumi, Sho Yokoi, Tomomasa Hara.

Figure 1
Figure 1. Figure 1: Overview of this work. Top: Mean pooling can map distinct token embedding distributions to simi￾lar text embeddings. This is because mean pooling sum￾marizes distributions using only their first-order statis￾tics, collapsing higher-order statistics. Bottom: We empirically find that modern fine-tuned text encoders are robust to such a collapse (§ 5). Each panel visual￾izes token and text embeddings for two … view at source ↗
Figure 2
Figure 2. Figure 2: When mean pooling collapse arises (red) and view at source ↗
Figure 3
Figure 3. Figure 3: SOCM values for each combination of (dµ, dΣ). The figure confirms that SOCM satisfies all properties (a)–(e). The four corners correspond to the scenarios in view at source ↗
Figure 4
Figure 4. Figure 4: Layer-wise analysis of token embedding concentration for BERT and GTE view at source ↗
Figure 5
Figure 5. Figure 5: Scatter plot of average SOCM and MTEB (eng, v2) score. Each point represents a model. Marker types indicate the backbones. Models with a BERT backbone are annotated with their names. ρ SOCM −0.678 S(X)/∥µ(X)∥ 2 2 −0.622 view at source ↗
Figure 6
Figure 6. Figure 6: Layer-wise analysis of token embedding concentration for BERT and E5 view at source ↗
Figure 7
Figure 7. Figure 7: Layer-wise analysis of token embedding concentration for BERT and Unsupervised SimCSE. (a) Avg. view at source ↗
Figure 8
Figure 8. Figure 8: Layer-wise analysis of token embedding concentration for MiniLM and GTE view at source ↗
Figure 9
Figure 9. Figure 9: Layer-wise analysis of token embedding concentration for MiniLM and E5 view at source ↗
Figure 10
Figure 10. Figure 10: Layer-wise analysis of token embedding concentration for MiniLM and all-MiniLM-L12-v2. (a) Avg. view at source ↗
Figure 11
Figure 11. Figure 11: Layer-wise average of 1 n2 P j,k cos(xj , xk) for BERT and GTEbase on the Wikipedia dataset. 2 4 6 8 10 12 0 1 E5 base BERT Avg. ! " ! ∑ cos ( &# , & $ ) # , $ Layer view at source ↗
Figure 12
Figure 12. Figure 12: Layer-wise average of 1 n2 P j,k cos(xj , xk) for BERT and E5base on the Wikipedia dataset. Computing the squared norm of the mean The squared norm of the mean can be expressed as: ∥µ(Xi)∥ 2 2 = 1 n 2 i Xni j=1 Xni k=1 x ⊤ i,jxi,k (63) = 1 n 2 i   Xni j=1 ∥xi,j∥ 2 2 + 2X j<k x ⊤ i,jxi,k   (64) = 1 n 2 i  nid(γ 2 + β 2 ) (65) + 2X j<k d(γ 2 + β 2 ) cos(xi,j , xi,k)  = d(γ 2 + β 2 ) ni  1 (66) + (ni … view at source ↗
Figure 15
Figure 15. Figure 15: Layer-wise average of 1 n2 P j,k cos(xj , xk) for MiniLM and E5small on the Wikipedia dataset. 2 4 6 8 10 12 0 1 all-MiniLM-L12-v2 MiniLM Avg. ! " ! ∑ cos ( &# , & $ ) # , $ Layer view at source ↗
Figure 16
Figure 16. Figure 16: Layer-wise average of 1 n2 P j,k cos(xj , xk) for MiniLM and all-MiniLM-L12-v2 on the Wikipedia dataset. Bounding the trace This expression is monotoni￾cally decreasing in Ej<k[cos(xi,j , xi,k)]. To verify this, we compute: ∂ ∂Ej<k[cos(xi,j , xi,k)]tr(Σ(Xnorm i )) (72) = − ni (1 + (ni − 1)Ej<k[cos(xi,j , xi,k)])2 < 0. Under Assumption 2, when Ej<k[cos(xi,j , xi,k)] = 1/3, we have: tr(Σ(Xnorm i )) = (ni − … view at source ↗
Figure 17
Figure 17. Figure 17: Scatter plots of dµ and dΣ for the models examined in § 5 on Wikipedia. Each point represents a text pair, colored by SOCM. The top and right marginal histograms show the distributions of dµ and dΣ view at source ↗
read the original abstract

For constructing text embeddings, mean pooling, which averages token embeddings, is the standard approach. This paper examines whether mean pooling actually works well in real models. First, we note that mean pooling can collapse information beyond the first-order statistics of the token embeddings, such as second-order statistics that capture their spatial structure, potentially mapping distinct token embedding distributions to similar text embeddings. Motivated by this concern, we propose a simple metric to quantify such a collapse induced by mean pooling. Then, using this metric, we empirically measure how often this collapse occurs in actual models and texts, and find that modern text encoders are robust to this collapse. In particular, contrastive fine-tuned text encoders tend to be less prone to the collapse than their pretrained backbone models. We also find that the robustness of these text encoders lies in the concentration of token embeddings within each text. In addition, we find that robustness to the collapse, as quantified by our proposed metric, correlates with downstream task performance. Overall, our findings offer a new perspective on why modern text encoders remain effective despite relying on seemingly coarse mean pooling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that mean pooling of token embeddings can induce second-order collapse (distinct token distributions mapping to similar means), proposes a metric based on second moments or distribution distances to quantify this, and empirically finds that modern text encoders—particularly contrastive fine-tuned ones—are robust to it due to high token embedding concentration within texts. It further reports that lower collapse scores correlate with better downstream task performance, offering an explanation for why mean pooling remains effective.

Significance. If the metric successfully isolates second-order effects that are causally linked to performance (independent of first-order concentration and training objectives), the work provides a useful diagnostic for embedding quality and a partial explanation for mean pooling's success. The empirical observation that contrastive models show both lower collapse and higher concentration is consistent with known geometry optimization in contrastive training; the correlation with tasks is potentially actionable for model selection but requires disambiguation from confounding factors to be load-bearing.

major comments (2)
  1. [Empirical results / correlation analysis] The central claim that the proposed metric quantifies second-order collapse whose absence explains mean pooling's utility rests on the correlation with downstream performance. However, contrastive fine-tuning directly optimizes embedding concentration and geometry; any metric sensitive to these properties will correlate with performance by construction. No ablation or control is described that isolates the second-order component (e.g., by holding first-order statistics fixed or comparing to non-contrastive baselines with matched concentration) to show that the metric captures task-relevant information loss beyond known geometric effects.
  2. [Analysis of token concentration] The robustness finding attributes lower collapse in contrastive models to 'concentration of token embeddings within each text.' This risks circularity: concentration is both the mechanism claimed to reduce collapse and a direct outcome of the contrastive objective. A concrete test (e.g., measuring collapse after explicitly controlling or randomizing concentration while preserving means) is needed to establish that second-order preservation, rather than concentration per se, drives the performance correlation.
minor comments (2)
  1. [Abstract / Methods] The abstract and high-level description omit equations for the proposed metric and details of the similarity measure or distribution distance used; these should be stated explicitly early in the methods section for reproducibility.
  2. [Experimental evaluation] Downstream task correlations should report effect sizes, confidence intervals, and controls for model size or training data volume, as these are common confounders when comparing pretrained vs. fine-tuned encoders.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger isolation of second-order effects in our empirical analysis. We agree that additional controls are required to substantiate the claims and will revise the manuscript to include them. Below we respond point by point to the major comments.

read point-by-point responses
  1. Referee: The central claim that the proposed metric quantifies second-order collapse whose absence explains mean pooling's utility rests on the correlation with downstream performance. However, contrastive fine-tuning directly optimizes embedding concentration and geometry; any metric sensitive to these properties will correlate with performance by construction. No ablation or control is described that isolates the second-order component (e.g., by holding first-order statistics fixed or comparing to non-contrastive baselines with matched concentration) to show that the metric captures task-relevant information loss beyond known geometric effects.

    Authors: We acknowledge that the absence of explicit ablations leaves the isolation of second-order effects incomplete. Our metric is constructed from second-moment discrepancies and distribution distances that are mathematically orthogonal to the mean, yet we agree that empirical disambiguation from concentration is necessary. In the revision we will add two controls: (1) synthetic embeddings with fixed means and variances but varied higher-order structure to measure metric sensitivity, and (2) comparison against non-contrastive models whose concentration is matched via post-hoc scaling. These additions will test whether the metric retains predictive power for downstream performance after first-order statistics are held constant. revision: yes

  2. Referee: The robustness finding attributes lower collapse in contrastive models to 'concentration of token embeddings within each text.' This risks circularity: concentration is both the mechanism claimed to reduce collapse and a direct outcome of the contrastive objective. A concrete test (e.g., measuring collapse after explicitly controlling or randomizing concentration while preserving means) is needed to establish that second-order preservation, rather than concentration per se, drives the performance correlation.

    Authors: We recognize the circularity concern. While our observational results link concentration to reduced collapse, we did not perform interventional tests. In the revision we will introduce a controlled experiment that preserves per-text means while systematically varying intra-text concentration (via variance scaling and controlled randomization of token positions). We will then recompute the collapse metric and re-evaluate its correlation with downstream tasks on these modified embeddings. This will clarify whether second-order preservation, enabled by but separable from concentration, is the operative factor. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical metric proposal or observed correlations

full rationale

The paper proposes a new metric for second-order collapse induced by mean pooling, then reports direct empirical measurements of this metric across models and texts, along with observed correlations to downstream task performance. No derivation, prediction, or central claim reduces by construction to fitted parameters, self-referential definitions, or load-bearing self-citations. All steps are falsifiable computations against external benchmarks and task data, making the analysis self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the metric is described only at a conceptual level without formulation details.

pith-pipeline@v0.9.0 · 5505 in / 1041 out tokens · 63837 ms · 2026-05-07T08:50:19.044350+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 7 canonical work pages · 5 internal anchors

  1. [1]

    MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

    MS MARCO: A Human Gen- erated MAchine Reading COmprehension Dataset. Preprint, arXiv:1611.09268. Juan Antonio Cuesta-Albertos, C Matrán-Bea, and A Tuero-Dı́az

  2. [2]

    InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), pages 4171–4186

    BERT: Pre-training of Deep Bidirectional Transformers for Language Un- derstanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), pages 4171–4186. Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré,...

  3. [3]

    The Faiss library

    The Faiss library.Preprint, arXiv:2401.08281. D.C Dowson and B.V Landau

  4. [4]

    How Contextual are Con- textualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 55–65. Yair Feldman and Yoav Artzi

  5. [5]

    No Mean Feat: Simple, Strong Baselines for Context Compression

    Simple Context Compression: Mean-Pooling and Multi-Ratio Train- ing.Preprint, arXiv:2510.20797. Tianyu Gao, Xingcheng Yao, and Danqi Chen

  6. [6]

    InProceedings of the 2021 Conference on Empirical Methods in Natural Language Process- ing (EMNLP), pages 6894–6910

    SimCSE: Simple Contrastive Learning of Sentence Embeddings. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Process- ing (EMNLP), pages 6894–6910. Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter

  7. [7]

    Gemini embedding: Generalizable embeddings from gemini.arXiv:2503.07891, 2025

    Gemini Embedding: Generalizable Em- beddings from Gemini.Preprint, arXiv:2503.07891. Seonghyeon Lee, Dongha Lee, Seongbo Jang, and Hwanjo Yu

  8. [8]

    Towards General Text Embeddings with Multi-stage Contrastive Learning

    Towards General Text Embeddings with Multi-stage Con- trastive Learning.Preprint, arXiv:2308.03281. Zach Nussbaum, John Xavier Morris, Andriy Mul- yar, and Brandon Duderstadt

  9. [9]

    Peyr´ e and M

    Computational Optimal Transport.Preprint, arXiv:1803.00567. Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie

  10. [10]

    InProceedings of the 2020 Conference on Empirical Methods in Natural Language Process- ing (EMNLP), pages 2685–2702

    COMET: A Neural Framework for MT Evaluation. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Process- ing (EMNLP), pages 2685–2702. Nils Reimers and Iryna Gurevych

  11. [11]

    Sentence- BERT: Sentence Embeddings using Siamese BERT- Networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natu- ral Language Processing (EMNLP-IJCNLP), pages 3982–3992. Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia

  12. [12]

    InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies (NAACL), pages 3715–

    Col- BERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies (NAACL), pages 3715–

  13. [13]

    InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 27705–27726

    Randomly Re- moving 50% of Dimensions in Text Embeddings has Minimal Impact on Retrieval and Classification Tasks. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 27705–27726. Nandan Thakur, Nils Reimers, Andreas Rücklé, Ab- hishek Srivastava, and Iryna Gurevych

  14. [14]

    InAdvances in Neural Information Pro- cessing Systems 30 (NIPS)

    Attention is All you Need. InAdvances in Neural Information Pro- cessing Systems 30 (NIPS). Cedric Villani. 2009.Optimal Transport: Old and New, 1 edition, volume 338 ofGrundlehren der mathema- tischen Wissenschaften. Springer Berlin Heidelberg. Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei

  15. [15]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    Text Embeddings by Weakly- Supervised Contrastive Pre-training.Preprint, arXiv:2212.03533. Tongzhou Wang and Phillip Isola

  16. [16]

    InFindings of the Association for Computational Linguistics: ACL 2023, pages 12266– 12283

    On Isotropy, Contextualization and Learning Dynamics of Contrastive-based Sentence Represen- tation Learning. InFindings of the Association for Computational Linguistics: ACL 2023, pages 12266– 12283. Shohei Yoda, Hayato Tsukagoshi, Ryohei Sasano, and Koichi Takeda

  17. [17]

    InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2944–2960

    Word Rotator’s Distance. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2944–2960. Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi

  18. [18]

    MoverScore: Text Generation Evaluating with Contextualized Em- beddings and Earth Mover Distance. InProceedings of the 2019 Conference on Empirical Methods in Nat- ural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 563–578. A Gaussian Characterization As noted in § 4, our method evaluat...

  19. [19]

    Note that some text embedders are designed to utilize such prefixes to distinguish be- tween different text types (Wang et al., 2024; Li et al., 2023)

    D.1 Dataset Details PreprocessingFor comparison across different models and datasets, we did not use any task- specific prefixes (e.g., query:, passage:) when encoding texts. Note that some text embedders are designed to utilize such prefixes to distinguish be- tween different text types (Wang et al., 2024; Li et al., 2023). Dataset URLsTable 3 shows the ...

  20. [20]

    H.2 Within-Text Token Embedding Concentration In § 6, we connected our findings to the anisotropy of token embeddings within each text reported by Xiao et al. (2023). As anisotropy is com- monly measured using cosine similarity (Xiao et al., 2023), we quantify within-text token em- bedding concentration via the layer-wise average of 1 n2 P j,k cos(xj,x k)...

  21. [21]

    Across all model pairs, fine-tuned text encoders tend to exhibit smalld Σ values

    ObservationsFigure 17 shows scatter plots of dµ and dΣ for all examined models on the Wikipedia dataset. Across all model pairs, fine-tuned text encoders tend to exhibit smalld Σ values. Connection to Token Embedding Concentra- tionThis tendency is consistent with the re- sults of § 6, which showed that fine-tuned text 5Note that this quantity is not iden...

  22. [22]

    Figure 7: Layer-wise analysis of token embedding concentration for BERT and Unsupervised SimCSE. (a) Avg. λ: a smaller value indicates greater concentration of Z relative to the input H. (b) Avg. r: a smaller value indicates a smaller influence of the spread in H on the residual output Y=Z+H . (c) Avg. C: a smaller value indicates a smaller increase in re...

  23. [23]

    Figure 9: Layer-wise analysis of token embedding concentration for MiniLM and E5 small. (a) Avg. λ: a smaller value indicates greater concentration of Z relative to the input H. (b) Avg. r: a smaller value indicates a smaller influence of the spread in H on the residual output Y=Z+H . (c) Avg. C: a smaller value indicates a smaller increase in relative sp...

  24. [24]

    Specifi- cally, the expected cosine similarity between to- ken embeddings within the same text satisfies Ej<k [cos(xi,j,x i,k)]≥1/3

    Assumption 2Token embeddings within the same text exhibit sufficient similarity. Specifi- cally, the expected cosine similarity between to- ken embeddings within the same text satisfies Ej<k [cos(xi,j,x i,k)]≥1/3 . This assumption is justified by empirical observations that con- textualized embeddings within the same text are anisotropic (Ethayarajh, 2019...

  25. [25]

    In § 5, computing SOCM values for all text pairs required approximately 2 hours per model

    K Computational Resources All experiments in this paper were conducted using a single NVIDIA RTX 6000 Ada graphics card. In § 5, computing SOCM values for all text pairs required approximately 2 hours per model. The analyses with MTEB (eng, v2) in § 7 required ap- proximately 6 hours per model. L Use of AI Assistants In preparing this paper, we utilized A...