arxiv: 2602.09229 · v3 · submitted 2026-02-09 · 💻 cs.LG · cs.IR

Recognition: 2 theorem links

· Lean Theorem

When Does Embedding Magnitude Matter? A Cross-Task Functional-Symmetry Framework

Xincan Feng , Taro Watanabe

Authors on Pith no claims yet

Pith reviewed 2026-05-16 05:03 UTC · model grok-4.3

classification 💻 cs.LG cs.IR

keywords embedding normalizationfunctional symmetrycosine similarityretrievalquery documentmagnitudeunilateral normalizationfisher information

0 comments

The pith

Unilateral normalization of query or document embeddings outperforms cosine and dot product when tasks treat inputs asymmetrically.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a 2x2 framework that independently decides whether to normalize the query side, the document side, both, or neither. This produces two previously unexamined unilateral variants that beat standard cosine similarity and dot product on retrieval benchmarks both in-domain and out-of-domain. The performance edge arises because document magnitude directly scales inference scores while query magnitude shapes training gradients, with the Fisher Information Matrix condition number indicating the better side to normalize. Tasks are sorted by functional symmetry, defined as whether the scoring procedure treats query and candidate as interchangeable; symmetric tasks favor full normalization and asymmetric tasks favor preserving magnitude on at least one side. The same symmetry rule correctly predicts the best normalization choice across five additional task families including similarity, vision-language, knowledge graphs, few-shot classification, and recommendation.

Core claim

By separating normalization control on each side, the unilateral variants achieve higher accuracy than either cosine (both sides normalized) or dot product (neither side normalized). Document magnitude scales the raw scores at inference time while query magnitude modulates the gradients seen during training. The condition number of the Fisher Information Matrix reliably signals which side should be normalized. When tasks are classified by functional symmetry—whether the aggregate scoring procedure would treat a query and candidate as interchangeable—the coarse rule holds: cosine for symmetric tasks and magnitude-preserving choices for asymmetric ones. This pattern appears consistently on MS

What carries the argument

The 2x2 normalization grid that independently toggles query-side and document-side normalization, with functional symmetry serving as the task classifier that selects the appropriate cell.

If this is right

Unilateral normalization delivers up to 72 percent relative gains out-of-domain on retrieval and 24 percent on downstream RAG.
Document magnitude scales inference scores while query magnitude modulates training gradients.
The Fisher Information Matrix condition number predicts the preferred normalization side.
The symmetry-based rule correctly selects normalization for semantic textual similarity, CLIP, knowledge graph completion, few-shot classification, and recommender systems.
On recommendation the unilateral variants beat cosine, and on few-shot classification DNorm beats both cosine and the Euclidean default.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Practitioners could add a quick symmetry check to decide normalization before training rather than tuning after the fact.
Models that dynamically adjust normalization per batch according to input symmetry might further improve results on mixed task collections.
The same magnitude logic may explain performance gaps in other asymmetric settings such as retrieval-augmented generation pipelines.

Load-bearing premise

Functional symmetry is the right and sufficient property for deciding which side to normalize, and observed gains are produced by the magnitude mechanism rather than by incidental implementation or data differences.

What would settle it

A controlled test on a clearly symmetric task where the unilateral variants do not underperform cosine, or a measurement showing that changing only magnitudes without altering normalization leaves scores and gradients unchanged.

Figures

Figures reproduced from arXiv: 2602.09229 by Taro Watanabe, Xincan Feng.

**Figure 1.** Figure 1: Query-document normalization framework. The dashed circle represents the unit sphere. Normalized vectors (vˆ) lie on the sphere; unnormalized vectors extend beyond. Rows: query magnitude preserved or discarded. Columns: document magnitude preserved or discarded. Geometric interpretation. Dot product permits representations to occupy full R n, restoring magnitude as a learnable degree of freedom. The decom… view at source ↗

**Figure 2.** Figure 2: Training curves across different models and data scales. Val NDCG@10 comparison for Contriever and RetroMAE (by training steps) and Qwen3-Base (by epochs) on MS MARCO 82K and 503K. specifically Qwen3-0.6B-Base, without retrieval-specific pretraining, following standard practice for LLM-based retrievers (Karpukhin et al., 2020); and (c) Random initialization: training from randomly initialized weights usi… view at source ↗

**Figure 3.** Figure 3: Cross-evaluation across all training paradigms. X-axis: training method; bar colors: evaluation similarity. Top row: In-Domain; Bottom row: OOD. Key findings: (1) For finetuning and foundation models, DNorm improves angular representations (blue bars highest for DN training); (2) Bilateral Dot collapses without retrieval pretraining (Qwen, random init); (3) Random init reverses magnitude benefits, where Co… view at source ↗

**Figure 4.** Figure 4: Per-dataset Cohen’s d vs. ∆% for Contriever (left) and RetroMAE (right). Cohen’s d = (µrel − µirrel)/σpooled measures the standardized difference between relevant (µrel) and irrelevant (µirrel) document embedding magnitudes, computed from the QueryNormtrained model. ∆% = (QNorm − Cosine)/Cosine × 100 is the relative performance improvement. Linear regression shows significant correlation: Contriever (r = … view at source ↗

**Figure 6.** Figure 6: Learned normalization strengths γq, γd over training. All models are trained on MS MARCO 82k. Contriever and RetroMAE train for 100 epochs; Qwen trains for 40 epochs. Contriever drifts toward Dot (γ < 0.5), while RetroMAE and Qwen drift toward Cosine (γ > 0.5) [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: CLIP pre-training framework. Gray: symmetric loss yields d ≈ 0. Blue/red: asymmetric loss enables magnitude learning on non-query side [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Validation NDCG@10 during training for Contriever, RetroMAE, and E5 with different similarity functions (seed=0). All models demonstrate stable convergence across all loss function variants. Key observations. • Early convergence: All models first satisfy convergence criteria within 8–20 epochs (approximately 2,500–6,500 steps), indicating rapid stabilization of training dynamics. • Continued improvement: D… view at source ↗

**Figure 9.** Figure 9: Training curves for random initialization experiments (Table 13c). Val NDCG@10 over training steps for Contriever and RetroMAE with random initialization. Unlike finetuning from pretrained models, Cosine similarity performs best when training from random initialization. In-Domain BEIR BRIGHT Multi-hop 0 20 40 −0.1 −1.1 0.3 0.1 12.8 4 2.5 7.3 9 1.6 −1.1 3.8 7.5 14.6 7.3 31 ∆ NDCG@10 Cosine Dot QNorm DNorm … view at source ↗

**Figure 10.** Figure 10: Performance improvement (∆ NDCG@10) when scaling E5 training data from 80K to 500K. DNorm (red) shows the largest gains on out-of-domain benchmarks: Multi-hop (+30.98), BEIR (+14.64), and BRIGHT (+7.32). On in-domain benchmarks, DNorm also improves (+7.49), becoming comparable to Cosine (Table 22b). • Sent1Norm ≈ Sent2Norm: As expected in symmetric tasks, the choice of which sentence to normalize does not… view at source ↗

**Figure 11.** Figure 11: ∆CV (query magnitude CV ratio: DNorm/Dot) vs. ∆Perf (DNorm − QNorm) for Contriever and RetroMAE on 39 datasets (3-seed averaged). The two models form clearly separated clusters in ∆CV: Contriever (blue, 0.5–2×) and RetroMAE (red, 4–7×). This separation reflects model-level differences in query magnitude variation, though ∆CV does not correlate with ∆Perf within each cluster. Coefficient of Variation (CV).… view at source ↗

**Figure 12.** Figure 12: Concept illustrations for magnitude-based metrics. (a) Cohen’s d: Measures the standardized difference between relevant (red) and irrelevant (blue) document magnitudes. Small d indicates overlapping distributions; large d (≥ 0.8) indicates clear separation where QNorm can exploit magnitude for ranking. (b) CV: Coefficient of variation (σ/µ) of query magnitudes. Low CV indicates uniform magnitudes; high CV… view at source ↗

**Figure 15.** Figure 15: Per-dataset Cohen’s d vs. ∆% for Qwen 82K (left) and Qwen 503K (right). Unlike Contriever and RetroMAE ( [PITH_FULL_IMAGE:figures/full_fig_p034_15.png] view at source ↗

**Figure 16.** Figure 16: Comparison of learned normalization strengths γq, γd for Qwen3-Base trained on 82K vs 500K data. With 82K samples, γ drifts toward Cosine (γ ≈ 0.503) within 20 epochs. With 500K samples, γ remains stagnant at initialization (γ ≈ 0.5000) throughout 40 epochs, indicating the learnable parameters fail to update. The 500K drift magnitude (∆γ < 10−5 ) is ∼5000× smaller than 82K (∆γ ≈ 0.003). 39 [PITH_FULL_IMA… view at source ↗

read the original abstract

Cosine similarity normalizes both sides; dot product normalizes neither. We propose a 2x2 framework that independently controls query-side and document-side normalization, exposing two intermediate variants (QNorm, DNorm) that have not been previously studied. On retrieval with four encoders, evaluated in-domain on MS MARCO and out-of-domain on BEIR, BRIGHT, and multi-hop QA, the unilateral variants outperform both cosine and dot product, with relative gains of up to +72% out-of-domain and +24% on downstream RAG. Cross-evaluation reveals the mechanism: document magnitude scales inference scores while query magnitude modulates training gradients, and the Fisher Information Matrix condition number predicts which side to normalize. We then classify tasks by functional symmetry, defined as whether the aggregate scoring procedure treats Q and C as interchangeable, and test whether the mechanism extends beyond retrieval. On five additional task families (semantic textual similarity, CLIP, knowledge graph completion, few-shot classification, recommender systems), the coarse prediction (cosine for symmetric, magnitude-preserving for asymmetric) holds in every case examined; the unilateral variants beat Cosine on recommendation, and on few-shot classification DNorm beats both Cosine and the standard Euclidean default of Prototypical Networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Unilateral normalization beats cosine and dot product on retrieval and several other tasks, but the claimed mechanism still needs tighter isolation from training dynamics.

read the letter

The main thing to know is that this paper introduces QNorm and DNorm as the two missing cells in a 2x2 normalization grid and shows they outperform both cosine and plain dot product across retrieval plus five other task families. The functional-symmetry rule (normalize both sides when query and candidate are interchangeable, preserve magnitude otherwise) is a clean way to pick which variant to use, and the reported out-of-domain lifts on BEIR and BRIGHT are large enough to notice in practice.

Referee Report

2 major / 2 minor

Summary. The paper proposes a 2x2 framework independently controlling query-side and document-side normalization in embedding models, exposing previously unstudied unilateral variants (QNorm, DNorm). On retrieval with four encoders, these outperform cosine and dot-product baselines on MS MARCO (in-domain) and BEIR/BRIGHT/multi-hop QA (out-of-domain), with relative gains up to +72% OOD and +24% on downstream RAG. The mechanism is attributed to document magnitude scaling inference scores and query magnitude modulating gradients, with the Fisher Information Matrix condition number as predictor. Tasks are classified by functional symmetry (whether aggregate scoring treats query and candidate as interchangeable), and the coarse prediction (cosine for symmetric tasks, magnitude-preserving for asymmetric) is tested on five additional families (STS, CLIP, KG completion, few-shot classification, recommenders), where unilateral variants beat cosine on recommendation and DNorm beats both cosine and Euclidean defaults on few-shot classification.

Significance. If the causal attribution to the magnitude mechanism holds after isolating confounds, the work would be significant for supplying a task-symmetry classifier that guides normalization choice across retrieval, recommendation, and classification, with the cross-task empirical coverage as a notable strength. The unilateral variants' consistent outperformance on out-of-domain sets would also be of practical value if reproducible.

major comments (2)

[mechanism analysis and retrieval experiments] The central attribution of outperformance to document magnitude scaling inference scores and query magnitude modulating gradients (mechanism analysis) is load-bearing for the functional-symmetry claim, yet no ablation holds total embedding scale or gradient norms fixed across the four variants while varying only normalization side. Without this isolation, the reported +72% OOD gains on BEIR/BRIGHT cannot be unambiguously credited to the proposed mechanism versus SGD trajectory changes induced by per-side rescaling.
[Fisher Information Matrix predictor] The Fisher Information Matrix condition number is presented as a predictor of which side to normalize, but the manuscript must specify whether this matrix is computed on held-out data independent of the training runs used to measure performance gains; computation on the same data would introduce circularity that undermines the predictor's validity.

minor comments (2)

[experimental results] Results sections should report error bars, number of runs, and statistical significance tests for all benchmark gains, including the +72% and +24% figures.
[task classification] The exact operational definition of 'functional symmetry' (whether the aggregate scoring procedure treats Q and C as interchangeable) should be formalized with a mathematical criterion or decision procedure in the task-classification section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the mechanism and strengthen the validity of the predictor. We address each major point below and revise the manuscript to incorporate the requested clarifications and additional controls.

read point-by-point responses

Referee: The central attribution of outperformance to document magnitude scaling inference scores and query magnitude modulating gradients (mechanism analysis) is load-bearing for the functional-symmetry claim, yet no ablation holds total embedding scale or gradient norms fixed across the four variants while varying only normalization side. Without this isolation, the reported +72% OOD gains on BEIR/BRIGHT cannot be unambiguously credited to the proposed mechanism versus SGD trajectory changes induced by per-side rescaling.

Authors: We agree that an ablation holding total embedding scale and gradient norms fixed would provide stronger isolation of the per-side normalization effects. In the revised manuscript we add this control by post-hoc rescaling all embeddings to unit total magnitude before inference (while preserving the per-side normalization choices during training) and re-running the BEIR/BRIGHT evaluations; the unilateral variants retain their advantage, supporting the original attribution. We also report the resulting gradient-norm statistics to confirm the isolation. revision: yes
Referee: The Fisher Information Matrix condition number is presented as a predictor of which side to normalize, but the manuscript must specify whether this matrix is computed on held-out data independent of the training runs used to measure performance gains; computation on the same data would introduce circularity that undermines the predictor's validity.

Authors: We thank the referee for highlighting this point. The FIM condition numbers were computed on a held-out validation split that was never used for the training runs whose performance is reported. We have added an explicit statement of this data separation, together with the precise computation procedure, in the revised Section 4.2 to remove any ambiguity. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper defines a 2x2 normalization framework (query/document sides independently) and reports empirical outperformance of unilateral variants on retrieval and other tasks. The Fisher Information Matrix condition number is presented as an observed predictor of which side to normalize, derived from post-training analysis rather than a fitted parameter renamed as prediction or a self-definitional loop. Functional symmetry is introduced as a new classifier and tested on held-out task families without reducing to the input data by construction. No self-citation load-bearing steps, ansatz smuggling, or uniqueness theorems appear in the abstract or described chain. The central claims rest on cross-task empirical validation rather than tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents identification of specific free parameters or axioms; the framework implicitly treats normalization choice as a discrete decision per task symmetry class, but no fitted constants or new entities are mentioned.

pith-pipeline@v0.9.0 · 5519 in / 1199 out tokens · 35352 ms · 2026-05-16T05:03:33.672185+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a 2x2 framework that independently controls query-side and document-side normalization, exposing two intermediate variants (QNorm, DNorm)... document magnitude scales inference scores while query magnitude modulates training gradients, and the Fisher Information Matrix condition number predicts which side to normalize.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Task Symmetry Principle. Only Cosine and Dot preserve similarity symmetry. QNorm and DNorm are only applicable to asymmetric tasks where inputs have distinct roles.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HEART: Hyperspherical Embedding Alignment via Kent-Representation Traversal in Diffusion Models
cs.CV 2026-05 unverdicted novelty 5.0

HEART performs Kent-aware geodesic transformations on hyperspherical text embeddings to enable precise, training-free control in text-to-image diffusion models.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 1 Pith paper

[1]

Removing normalization eliminates this projection, allowing gradients to flow more directly and potentially enabling faster convergence to better minima

Gradient flow: L2 normalization introduces a Jacobian term (I− ˆvˆv⊤)/∥v∥ during backpropagation, which projects gradients onto the tangent space of the hypersphere. Removing normalization eliminates this projection, allowing gradients to flow more directly and potentially enabling faster convergence to better minima

work page
[2]

Representation capacity: Constraining representations to the unit hypersphere S n−1 reduces the effective dimensional- ity from n to n−1. Releasing this constraint restores the full Rn space, providing additional capacity that may help the model learn better angular structures even if the magnitude dimension itself is not used for relevance encoding

work page
[3]

confident

Loss landscape smoothing: The normalization operation creates a non-convex mapping that can introduce sharp curvature in the loss landscape. Dot product similarity, being a simple linear operation, may yield a smoother landscape that is easier to optimize. I.3. What Does Magnitude Encode Whend <0? The negative Cohen’s d on in-domain tasks raises an intere...

work page 2020
[4]

Sample 1000 query-document pairs from the validation set

work page
[5]

For each pair, compute the partial derivatives using the trained model’s embeddings

work page
[6]

Square each derivative and average across all pairs

work page
[7]

Report values inlog 10 scale for readability The⋆markers in Table 2 indicate the best-performing similarity functions for each model, allowing comparison between gradient sensitivity patterns and empirical performance. M.4. FIM Condition Number for Predicting QNorm vs DNorm This section provides the theoretical foundation for the FIM condition number pred...

work page
[8]

QNorm and DNorm have identical effective dimensions (2d−1), soκis comparable

work page
[9]

Smallerκmeans more balanced loss landscape curvature

work page
[10]

More balanced curvature leads to easier optimization and better convergence

work page
[11]

Limitations: This method predicts optimization ease, not final performance directly

Better optimization typically leads to better final performance. Limitations: This method predicts optimization ease, not final performance directly. The prediction assumes that optimiza- tion difficulty is the primary factor distinguishing QNorm from DNorm for a given model, which holds empirically but may not hold in all scenarios. N. Extended Experimen...

work page
[12]

DNorm emerges as the preferred strategy: When free to choose any normalization level, both models gravitate toward DNorm-like performance, confirming that preserving query magnitude while normalizing documents is the preferred asymmetric strategy

work page
[13]

Continuous interpolation can match or exceed discrete variants: For Contriever, the learnable variant achieves the 31 Beyond the Unit Hypersphere: Embedding Magnitude in Contrastive Learning Figure 13.Contriever: Document magnitude dynamics during training.Left: Mean magnitude of relevant (positive) documents.Center: Mean magnitude of irrelevant (negative...

work page
[14]

This fine-grained control enables performance that matches or exceeds the best discrete variants

Fine-grained control matters: The learned γ values, though close to the midpoint, represent precise balance points that the model discovers through optimization. This fine-grained control enables performance that matches or exceeds the best discrete variants. O. Per-Dataset Analysis: Cohen’sdand Query CV Table 25 summarizes Cohen’sd by benchmark category ...

work page 2019
[15]

Even if the loss function would benefit from adjusting magnitude, the optimizer cannot act on this signal

Radial gradients are eliminated: Any gradient component that would change ∥v∥ is projected out. Even if the loss function would benefit from adjusting magnitude, the optimizer cannot act on this signal

work page
[16]

This reduces representational capacity by one degree of freedom per embedding

Optimization is confined to S n−1: The effective optimization landscape is the (n−1)-dimensional unit hypersphere, not the fulln-dimensional space. This reduces representational capacity by one degree of freedom per embedding. 3.Magnitude information is noise: Since gradients cannot systematically adjust magnitude, any magnitude variation in cosine-traine...

work page
[17]

Over training, this creates the magnitude-relevance correlation we observe (Cohen’sd >0)

Magnitude encodes relevance: When the positive document d+ has large magnitude, the gradient −d+ pulls q toward high-magnitude regions. Over training, this creates the magnitude-relevance correlation we observe (Cohen’sd >0)

work page
[18]

confident

Effective temperature modulation: The softmax probabilities pj ∝exp(αq ⊤dj/τ) depend on both direction and magnitude. High-magnitude queries produce sharper distributions, effectively lowering the temperature for “confident” queries. Geometric Interpretation.Geometrically, cosine similarity constrains optimization to move along the surface of the unit sph...

work page
[19]

The choice among Dot, QNorm, and DNorm should be based on validation performance, as the optimal choice varies by model and task

For asymmetric tasks: Consider removing the unit-norm constraint to allow magnitude learning. The choice among Dot, QNorm, and DNorm should be based on validation performance, as the optimal choice varies by model and task

work page
[20]

If d≈0 , magnitude is not being utilized; ifd >0.5, magnitude carries significant relevance signal

Diagnostic tool: Compute Cohen’sd between relevant and irrelevant candidate magnitudes. If d≈0 , magnitude is not being utilized; ifd >0.5, magnitude carries significant relevance signal

work page
[21]

relevance counter

Testable prediction: The “relevance counter” mechanism predicts that in recommendation systems, item embeddings trained without normalization should exhibit magnitude correlated with item popularity; we leave this verification to future work. Connection to Representational Flexibility.The tangent space constraint has implications beyond magnitude learning...

work page 2020