pith. machine review for the scientific record. sign in

arxiv: 2602.09229 · v3 · submitted 2026-02-09 · 💻 cs.LG · cs.IR

Recognition: 2 theorem links

· Lean Theorem

When Does Embedding Magnitude Matter? A Cross-Task Functional-Symmetry Framework

Authors on Pith no claims yet

Pith reviewed 2026-05-16 05:03 UTC · model grok-4.3

classification 💻 cs.LG cs.IR
keywords embedding normalizationfunctional symmetrycosine similarityretrievalquery documentmagnitudeunilateral normalizationfisher information
0
0 comments X

The pith

Unilateral normalization of query or document embeddings outperforms cosine and dot product when tasks treat inputs asymmetrically.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a 2x2 framework that independently decides whether to normalize the query side, the document side, both, or neither. This produces two previously unexamined unilateral variants that beat standard cosine similarity and dot product on retrieval benchmarks both in-domain and out-of-domain. The performance edge arises because document magnitude directly scales inference scores while query magnitude shapes training gradients, with the Fisher Information Matrix condition number indicating the better side to normalize. Tasks are sorted by functional symmetry, defined as whether the scoring procedure treats query and candidate as interchangeable; symmetric tasks favor full normalization and asymmetric tasks favor preserving magnitude on at least one side. The same symmetry rule correctly predicts the best normalization choice across five additional task families including similarity, vision-language, knowledge graphs, few-shot classification, and recommendation.

Core claim

By separating normalization control on each side, the unilateral variants achieve higher accuracy than either cosine (both sides normalized) or dot product (neither side normalized). Document magnitude scales the raw scores at inference time while query magnitude modulates the gradients seen during training. The condition number of the Fisher Information Matrix reliably signals which side should be normalized. When tasks are classified by functional symmetry—whether the aggregate scoring procedure would treat a query and candidate as interchangeable—the coarse rule holds: cosine for symmetric tasks and magnitude-preserving choices for asymmetric ones. This pattern appears consistently on MS

What carries the argument

The 2x2 normalization grid that independently toggles query-side and document-side normalization, with functional symmetry serving as the task classifier that selects the appropriate cell.

If this is right

  • Unilateral normalization delivers up to 72 percent relative gains out-of-domain on retrieval and 24 percent on downstream RAG.
  • Document magnitude scales inference scores while query magnitude modulates training gradients.
  • The Fisher Information Matrix condition number predicts the preferred normalization side.
  • The symmetry-based rule correctly selects normalization for semantic textual similarity, CLIP, knowledge graph completion, few-shot classification, and recommender systems.
  • On recommendation the unilateral variants beat cosine, and on few-shot classification DNorm beats both cosine and the Euclidean default.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Practitioners could add a quick symmetry check to decide normalization before training rather than tuning after the fact.
  • Models that dynamically adjust normalization per batch according to input symmetry might further improve results on mixed task collections.
  • The same magnitude logic may explain performance gaps in other asymmetric settings such as retrieval-augmented generation pipelines.

Load-bearing premise

Functional symmetry is the right and sufficient property for deciding which side to normalize, and observed gains are produced by the magnitude mechanism rather than by incidental implementation or data differences.

What would settle it

A controlled test on a clearly symmetric task where the unilateral variants do not underperform cosine, or a measurement showing that changing only magnitudes without altering normalization leaves scores and gradients unchanged.

Figures

Figures reproduced from arXiv: 2602.09229 by Taro Watanabe, Xincan Feng.

Figure 1
Figure 1. Figure 1: Query-document normalization framework. The dashed circle represents the unit sphere. Normalized vectors (vˆ) lie on the sphere; unnormalized vectors extend beyond. Rows: query magnitude preserved or discarded. Columns: document magnitude preserved or discarded. Geometric interpretation. Dot product permits represen￾tations to occupy full R n, restoring magnitude as a learnable degree of freedom. The decom… view at source ↗
Figure 2
Figure 2. Figure 2: Training curves across different models and data scales. Val NDCG@10 comparison for Contriever and RetroMAE (by training steps) and Qwen3-Base (by epochs) on MS MARCO 82K and 503K. specifically Qwen3-0.6B-Base, without retrieval-specific pretraining, following standard practice for LLM-based re￾trievers (Karpukhin et al., 2020); and (c) Random initial￾ization: training from randomly initialized weights usi… view at source ↗
Figure 3
Figure 3. Figure 3: Cross-evaluation across all training paradigms. X-axis: training method; bar colors: evaluation similarity. Top row: In-Domain; Bottom row: OOD. Key findings: (1) For finetuning and foundation models, DNorm improves angular representations (blue bars highest for DN training); (2) Bilateral Dot collapses without retrieval pretraining (Qwen, random init); (3) Random init reverses magnitude benefits, where Co… view at source ↗
Figure 4
Figure 4. Figure 4: Per-dataset Cohen’s d vs. ∆% for Contriever (left) and RetroMAE (right). Cohen’s d = (µrel − µirrel)/σpooled measures the standardized difference between relevant (µrel) and irrelevant (µirrel) document embedding magnitudes, computed from the QueryNorm￾trained model. ∆% = (QNorm − Cosine)/Cosine × 100 is the relative performance improvement. Linear regression shows significant correlation: Contriever (r = … view at source ↗
Figure 6
Figure 6. Figure 6: Learned normalization strengths γq, γd over training. All models are trained on MS MARCO 82k. Contriever and Retro￾MAE train for 100 epochs; Qwen trains for 40 epochs. Contriever drifts toward Dot (γ < 0.5), while RetroMAE and Qwen drift toward Cosine (γ > 0.5) [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: CLIP pre-training framework. Gray: symmetric loss yields d ≈ 0. Blue/red: asymmetric loss enables magnitude learning on non-query side [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Validation NDCG@10 during training for Contriever, RetroMAE, and E5 with different similarity functions (seed=0). All models demonstrate stable convergence across all loss function variants. Key observations. • Early convergence: All models first satisfy convergence criteria within 8–20 epochs (approximately 2,500–6,500 steps), indicating rapid stabilization of training dynamics. • Continued improvement: D… view at source ↗
Figure 9
Figure 9. Figure 9: Training curves for random initialization experiments (Table 13c). Val NDCG@10 over training steps for Contriever and RetroMAE with random initialization. Unlike finetuning from pretrained models, Cosine similarity performs best when training from random initialization. In-Domain BEIR BRIGHT Multi-hop 0 20 40 −0.1 −1.1 0.3 0.1 12.8 4 2.5 7.3 9 1.6 −1.1 3.8 7.5 14.6 7.3 31 ∆ NDCG@10 Cosine Dot QNorm DNorm … view at source ↗
Figure 10
Figure 10. Figure 10: Performance improvement (∆ NDCG@10) when scaling E5 training data from 80K to 500K. DNorm (red) shows the largest gains on out-of-domain benchmarks: Multi-hop (+30.98), BEIR (+14.64), and BRIGHT (+7.32). On in-domain benchmarks, DNorm also improves (+7.49), becoming comparable to Cosine (Table 22b). • Sent1Norm ≈ Sent2Norm: As expected in symmetric tasks, the choice of which sentence to normalize does not… view at source ↗
Figure 11
Figure 11. Figure 11: ∆CV (query magnitude CV ratio: DNorm/Dot) vs. ∆Perf (DNorm − QNorm) for Contriever and RetroMAE on 39 datasets (3-seed averaged). The two models form clearly separated clusters in ∆CV: Contriever (blue, 0.5–2×) and RetroMAE (red, 4–7×). This separation reflects model-level differences in query magnitude variation, though ∆CV does not correlate with ∆Perf within each cluster. Coefficient of Variation (CV).… view at source ↗
Figure 12
Figure 12. Figure 12: Concept illustrations for magnitude-based metrics. (a) Cohen’s d: Measures the standardized difference between relevant (red) and irrelevant (blue) document magnitudes. Small d indicates overlapping distributions; large d (≥ 0.8) indicates clear separation where QNorm can exploit magnitude for ranking. (b) CV: Coefficient of variation (σ/µ) of query magnitudes. Low CV indicates uniform magnitudes; high CV… view at source ↗
Figure 15
Figure 15. Figure 15: Per-dataset Cohen’s d vs. ∆% for Qwen 82K (left) and Qwen 503K (right). Unlike Contriever and RetroMAE ( [PITH_FULL_IMAGE:figures/full_fig_p034_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Comparison of learned normalization strengths γq, γd for Qwen3-Base trained on 82K vs 500K data. With 82K samples, γ drifts toward Cosine (γ ≈ 0.503) within 20 epochs. With 500K samples, γ remains stagnant at initialization (γ ≈ 0.5000) throughout 40 epochs, indicating the learnable parameters fail to update. The 500K drift magnitude (∆γ < 10−5 ) is ∼5000× smaller than 82K (∆γ ≈ 0.003). 39 [PITH_FULL_IMA… view at source ↗
read the original abstract

Cosine similarity normalizes both sides; dot product normalizes neither. We propose a 2x2 framework that independently controls query-side and document-side normalization, exposing two intermediate variants (QNorm, DNorm) that have not been previously studied. On retrieval with four encoders, evaluated in-domain on MS MARCO and out-of-domain on BEIR, BRIGHT, and multi-hop QA, the unilateral variants outperform both cosine and dot product, with relative gains of up to +72% out-of-domain and +24% on downstream RAG. Cross-evaluation reveals the mechanism: document magnitude scales inference scores while query magnitude modulates training gradients, and the Fisher Information Matrix condition number predicts which side to normalize. We then classify tasks by functional symmetry, defined as whether the aggregate scoring procedure treats Q and C as interchangeable, and test whether the mechanism extends beyond retrieval. On five additional task families (semantic textual similarity, CLIP, knowledge graph completion, few-shot classification, recommender systems), the coarse prediction (cosine for symmetric, magnitude-preserving for asymmetric) holds in every case examined; the unilateral variants beat Cosine on recommendation, and on few-shot classification DNorm beats both Cosine and the standard Euclidean default of Prototypical Networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a 2x2 framework independently controlling query-side and document-side normalization in embedding models, exposing previously unstudied unilateral variants (QNorm, DNorm). On retrieval with four encoders, these outperform cosine and dot-product baselines on MS MARCO (in-domain) and BEIR/BRIGHT/multi-hop QA (out-of-domain), with relative gains up to +72% OOD and +24% on downstream RAG. The mechanism is attributed to document magnitude scaling inference scores and query magnitude modulating gradients, with the Fisher Information Matrix condition number as predictor. Tasks are classified by functional symmetry (whether aggregate scoring treats query and candidate as interchangeable), and the coarse prediction (cosine for symmetric tasks, magnitude-preserving for asymmetric) is tested on five additional families (STS, CLIP, KG completion, few-shot classification, recommenders), where unilateral variants beat cosine on recommendation and DNorm beats both cosine and Euclidean defaults on few-shot classification.

Significance. If the causal attribution to the magnitude mechanism holds after isolating confounds, the work would be significant for supplying a task-symmetry classifier that guides normalization choice across retrieval, recommendation, and classification, with the cross-task empirical coverage as a notable strength. The unilateral variants' consistent outperformance on out-of-domain sets would also be of practical value if reproducible.

major comments (2)
  1. [mechanism analysis and retrieval experiments] The central attribution of outperformance to document magnitude scaling inference scores and query magnitude modulating gradients (mechanism analysis) is load-bearing for the functional-symmetry claim, yet no ablation holds total embedding scale or gradient norms fixed across the four variants while varying only normalization side. Without this isolation, the reported +72% OOD gains on BEIR/BRIGHT cannot be unambiguously credited to the proposed mechanism versus SGD trajectory changes induced by per-side rescaling.
  2. [Fisher Information Matrix predictor] The Fisher Information Matrix condition number is presented as a predictor of which side to normalize, but the manuscript must specify whether this matrix is computed on held-out data independent of the training runs used to measure performance gains; computation on the same data would introduce circularity that undermines the predictor's validity.
minor comments (2)
  1. [experimental results] Results sections should report error bars, number of runs, and statistical significance tests for all benchmark gains, including the +72% and +24% figures.
  2. [task classification] The exact operational definition of 'functional symmetry' (whether the aggregate scoring procedure treats Q and C as interchangeable) should be formalized with a mathematical criterion or decision procedure in the task-classification section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the mechanism and strengthen the validity of the predictor. We address each major point below and revise the manuscript to incorporate the requested clarifications and additional controls.

read point-by-point responses
  1. Referee: The central attribution of outperformance to document magnitude scaling inference scores and query magnitude modulating gradients (mechanism analysis) is load-bearing for the functional-symmetry claim, yet no ablation holds total embedding scale or gradient norms fixed across the four variants while varying only normalization side. Without this isolation, the reported +72% OOD gains on BEIR/BRIGHT cannot be unambiguously credited to the proposed mechanism versus SGD trajectory changes induced by per-side rescaling.

    Authors: We agree that an ablation holding total embedding scale and gradient norms fixed would provide stronger isolation of the per-side normalization effects. In the revised manuscript we add this control by post-hoc rescaling all embeddings to unit total magnitude before inference (while preserving the per-side normalization choices during training) and re-running the BEIR/BRIGHT evaluations; the unilateral variants retain their advantage, supporting the original attribution. We also report the resulting gradient-norm statistics to confirm the isolation. revision: yes

  2. Referee: The Fisher Information Matrix condition number is presented as a predictor of which side to normalize, but the manuscript must specify whether this matrix is computed on held-out data independent of the training runs used to measure performance gains; computation on the same data would introduce circularity that undermines the predictor's validity.

    Authors: We thank the referee for highlighting this point. The FIM condition numbers were computed on a held-out validation split that was never used for the training runs whose performance is reported. We have added an explicit statement of this data separation, together with the precise computation procedure, in the revised Section 4.2 to remove any ambiguity. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper defines a 2x2 normalization framework (query/document sides independently) and reports empirical outperformance of unilateral variants on retrieval and other tasks. The Fisher Information Matrix condition number is presented as an observed predictor of which side to normalize, derived from post-training analysis rather than a fitted parameter renamed as prediction or a self-definitional loop. Functional symmetry is introduced as a new classifier and tested on held-out task families without reducing to the input data by construction. No self-citation load-bearing steps, ansatz smuggling, or uniqueness theorems appear in the abstract or described chain. The central claims rest on cross-task empirical validation rather than tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents identification of specific free parameters or axioms; the framework implicitly treats normalization choice as a discrete decision per task symmetry class, but no fitted constants or new entities are mentioned.

pith-pipeline@v0.9.0 · 5519 in / 1199 out tokens · 35352 ms · 2026-05-16T05:03:33.672185+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. HEART: Hyperspherical Embedding Alignment via Kent-Representation Traversal in Diffusion Models

    cs.CV 2026-05 unverdicted novelty 5.0

    HEART performs Kent-aware geodesic transformations on hyperspherical text embeddings to enable precise, training-free control in text-to-image diffusion models.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 1 Pith paper

  1. [1]

    Removing normalization eliminates this projection, allowing gradients to flow more directly and potentially enabling faster convergence to better minima

    Gradient flow: L2 normalization introduces a Jacobian term (I− ˆvˆv⊤)/∥v∥ during backpropagation, which projects gradients onto the tangent space of the hypersphere. Removing normalization eliminates this projection, allowing gradients to flow more directly and potentially enabling faster convergence to better minima

  2. [2]

    Representation capacity: Constraining representations to the unit hypersphere S n−1 reduces the effective dimensional- ity from n to n−1. Releasing this constraint restores the full Rn space, providing additional capacity that may help the model learn better angular structures even if the magnitude dimension itself is not used for relevance encoding

  3. [3]

    confident

    Loss landscape smoothing: The normalization operation creates a non-convex mapping that can introduce sharp curvature in the loss landscape. Dot product similarity, being a simple linear operation, may yield a smoother landscape that is easier to optimize. I.3. What Does Magnitude Encode Whend <0? The negative Cohen’s d on in-domain tasks raises an intere...

  4. [4]

    Sample 1000 query-document pairs from the validation set

  5. [5]

    For each pair, compute the partial derivatives using the trained model’s embeddings

  6. [6]

    Square each derivative and average across all pairs

  7. [7]

    Report values inlog 10 scale for readability The⋆markers in Table 2 indicate the best-performing similarity functions for each model, allowing comparison between gradient sensitivity patterns and empirical performance. M.4. FIM Condition Number for Predicting QNorm vs DNorm This section provides the theoretical foundation for the FIM condition number pred...

  8. [8]

    QNorm and DNorm have identical effective dimensions (2d−1), soκis comparable

  9. [9]

    Smallerκmeans more balanced loss landscape curvature

  10. [10]

    More balanced curvature leads to easier optimization and better convergence

  11. [11]

    Limitations: This method predicts optimization ease, not final performance directly

    Better optimization typically leads to better final performance. Limitations: This method predicts optimization ease, not final performance directly. The prediction assumes that optimiza- tion difficulty is the primary factor distinguishing QNorm from DNorm for a given model, which holds empirically but may not hold in all scenarios. N. Extended Experimen...

  12. [12]

    DNorm emerges as the preferred strategy: When free to choose any normalization level, both models gravitate toward DNorm-like performance, confirming that preserving query magnitude while normalizing documents is the preferred asymmetric strategy

  13. [13]

    Continuous interpolation can match or exceed discrete variants: For Contriever, the learnable variant achieves the 31 Beyond the Unit Hypersphere: Embedding Magnitude in Contrastive Learning Figure 13.Contriever: Document magnitude dynamics during training.Left: Mean magnitude of relevant (positive) documents.Center: Mean magnitude of irrelevant (negative...

  14. [14]

    This fine-grained control enables performance that matches or exceeds the best discrete variants

    Fine-grained control matters: The learned γ values, though close to the midpoint, represent precise balance points that the model discovers through optimization. This fine-grained control enables performance that matches or exceeds the best discrete variants. O. Per-Dataset Analysis: Cohen’sdand Query CV Table 25 summarizes Cohen’sd by benchmark category ...

  15. [15]

    Even if the loss function would benefit from adjusting magnitude, the optimizer cannot act on this signal

    Radial gradients are eliminated: Any gradient component that would change ∥v∥ is projected out. Even if the loss function would benefit from adjusting magnitude, the optimizer cannot act on this signal

  16. [16]

    This reduces representational capacity by one degree of freedom per embedding

    Optimization is confined to S n−1: The effective optimization landscape is the (n−1)-dimensional unit hypersphere, not the fulln-dimensional space. This reduces representational capacity by one degree of freedom per embedding. 3.Magnitude information is noise: Since gradients cannot systematically adjust magnitude, any magnitude variation in cosine-traine...

  17. [17]

    Over training, this creates the magnitude-relevance correlation we observe (Cohen’sd >0)

    Magnitude encodes relevance: When the positive document d+ has large magnitude, the gradient −d+ pulls q toward high-magnitude regions. Over training, this creates the magnitude-relevance correlation we observe (Cohen’sd >0)

  18. [18]

    confident

    Effective temperature modulation: The softmax probabilities pj ∝exp(αq ⊤dj/τ) depend on both direction and magnitude. High-magnitude queries produce sharper distributions, effectively lowering the temperature for “confident” queries. Geometric Interpretation.Geometrically, cosine similarity constrains optimization to move along the surface of the unit sph...

  19. [19]

    The choice among Dot, QNorm, and DNorm should be based on validation performance, as the optimal choice varies by model and task

    For asymmetric tasks: Consider removing the unit-norm constraint to allow magnitude learning. The choice among Dot, QNorm, and DNorm should be based on validation performance, as the optimal choice varies by model and task

  20. [20]

    If d≈0 , magnitude is not being utilized; ifd >0.5, magnitude carries significant relevance signal

    Diagnostic tool: Compute Cohen’sd between relevant and irrelevant candidate magnitudes. If d≈0 , magnitude is not being utilized; ifd >0.5, magnitude carries significant relevance signal

  21. [21]

    relevance counter

    Testable prediction: The “relevance counter” mechanism predicts that in recommendation systems, item embeddings trained without normalization should exhibit magnitude correlated with item popularity; we leave this verification to future work. Connection to Representational Flexibility.The tangent space constraint has implications beyond magnitude learning...