Improving Relative Representations with Learned Anchors and Whitened Inner Products

Fabian Mager; Hiba Nassar; Nikolaj Holst Jakobsen; Oscar Thorsted Svendsen

arxiv: 2605.30596 · v1 · pith:4B4KOPXZnew · submitted 2026-05-28 · 💻 cs.LG

Improving Relative Representations with Learned Anchors and Whitened Inner Products

Oscar Thorsted Svendsen , Nikolaj Holst Jakobsen , Fabian Mager , Hiba Nassar This is my paper

Pith reviewed 2026-06-29 08:19 UTC · model grok-4.3

classification 💻 cs.LG

keywords relative representationslearned anchorswhitened inner productscross-model communicationzero-shot transfermodel compatibilityneural representationstransformer geometries

0 comments

The pith

Learned anchors as semantic prototypes and whitened inner products enable nearly lossless cross-model communication via relative representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Independently trained neural networks develop incompatible internal spaces that block modular AI systems from sharing knowledge directly. Relative representations address this by expressing each point through its similarities to shared anchors instead of absolute coordinates, yet random anchors and cosine similarity often fail on the anisotropic spaces typical of transformers. The paper replaces random anchors with learned semantic prototypes and substitutes cosine similarity with a whitened inner product that keeps magnitude information while remaining invariant to affine shifts. This change produces large gains in consistency on vision and language tasks and supports nearly lossless information transfer together with stable zero-shot communication even between small language models of different scales.

Core claim

By learning anchors as robust semantic prototypes and employing a geometry-aware whitened inner product similarity metric that preserves magnitude information and remains invariant to affine shifts, relative representations can achieve significant performance gains and enable nearly lossless information transfer and stable zero-shot communication between highly heterogeneous neural architectures such as small language models of varying scales.

What carries the argument

Learned semantic prototype anchors paired with whitened inner products for similarity measurement.

If this is right

Significant gains in performance and consistency across vision and language tasks.
Nearly lossless information transfer between independently trained models.
Stable zero-shot communication between highly heterogeneous architectures such as small language models of varying scales.
Improved handling of anisotropic geometries found in modern transformer models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the method works, independently trained modules could be assembled into larger systems without separate alignment training.
The same anchor-learning and metric changes might apply to modalities beyond vision and language.
Scaling the approach to much larger models could test whether the consistency gains persist.

Load-bearing premise

That learning anchors as semantic prototypes and switching to whitened inner products will reliably overcome the anisotropic geometries that defeat random anchors and cosine similarity.

What would settle it

High error rates or unstable zero-shot performance when transferring between small language models of different scales using the learned anchors and whitened inner products.

Figures

Figures reproduced from arXiv: 2605.30596 by Fabian Mager, Hiba Nassar, Nikolaj Holst Jakobsen, Oscar Thorsted Svendsen.

**Figure 2.** Figure 2: demonstrates the impact of s(·, ·) on RR geometry. Euclidean distance based measures tend to create distorted spaces heavily affected by the distance between the anchors. Cosine similarity results in a warped space where all datapoints lie on an approximately elliptical shell (when d ≤ m). In contrast, WIP yields a more consistent cluster geometry across the deformed embedding spaces [PITH_FULL_IMAGE:figu… view at source ↗

**Figure 3.** Figure 3: Ablation across anchor counts for anchor construction (random vs. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of support-set size (|Xsub|) on CIFAR-100 zero-shot performance. After roughly 5,000–10,000 parallel points (≈10–20% of the dataset), gains diminish substantially, with performance saturating as more points are added [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: MNIST reconstruction with an anisotropic latent space. The bottom 2 rows are zero-shot stitching using RR. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

read the original abstract

Independently trained neural models typically converge to incompatible latent representations, creating a fundamental barrier to highly modular AI systems. While Relative Representations (RR) address this by mapping absolute coordinates to a shared space defined by similarities to common anchor points, traditional implementations rely on randomly sampled anchors and cosine similarity, which frequently fail to capture the anisotropic geometries of modern architectures like Transformers. In this work, we propose a robust framework for cross-model communication based on two improvements. We learn anchors as robust semantic prototypes and utilize a geometry-aware similarity metric which preserves discriminative magnitude information and is invariant to affine shifts. Our approach demonstrates significant gains in performance and consistency across vision and language tasks. Notably, it enables nearly lossless information transfer and stable zero-shot communication even between highly heterogeneous architectures, such as small language models of varying scales.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Incremental fix to relative representations via learned anchors and whitened inner products, but the abstract gives no numbers or experiment details to back the nearly lossless transfer claim.

read the letter

The main thing to know is that this extends relative representations by swapping random anchors for learned semantic prototypes and cosine similarity for a whitened inner product. The abstract says this handles anisotropic geometries in transformers better and delivers big gains plus nearly lossless zero-shot transfer across heterogeneous models.

The changes are concrete. Learned anchors make sense as a way to pick more stable reference points instead of hoping random samples work. The whitened metric is meant to keep magnitude information and stay invariant to affine shifts, which addresses a real limitation of standard cosine on modern embeddings. The paper lays out the motivation clearly against the traditional random-anchor baseline.

The soft spot is obvious from the abstract alone: it asserts significant performance gains and stable communication between small language models of varying scales, yet shows no metrics, no baselines, no statistical tests, and no description of the vision or language experiments. Without those, the central claim cannot be checked. The stress-test found no internal inconsistency in the setup, but that does not substitute for evidence.

This is for researchers working on model interoperability and modular systems who already know the relative representations literature. Someone looking for a practical tweak to alignment methods might pick up the anchor-learning idea or the whitened metric, but the lack of results limits how far it moves the needle.

The math and construction look standard and non-circular based on the description. If the full paper contains reproducible experiments that actually measure the transfer, it would be worth referee time because the problem it targets is practical. Based on the abstract, though, the evidence is too thin to judge yet.

Referee Report

1 major / 0 minor

Summary. The paper claims that independently trained neural models produce incompatible latent representations, and that Relative Representations can be improved by learning anchors as semantic prototypes and replacing cosine similarity with a whitened inner-product metric that preserves magnitude and is invariant to affine shifts. These changes are asserted to yield significant gains in performance and consistency on vision and language tasks, enabling nearly lossless information transfer and stable zero-shot communication even between highly heterogeneous architectures such as small language models of varying scales.

Significance. If the empirical results hold with proper controls and metrics, the work could meaningfully advance modular AI by reducing the barrier of representation incompatibility. The geometry-aware similarity addresses a known limitation of standard RR implementations on anisotropic spaces such as those produced by Transformers.

major comments (1)

[Abstract] Abstract: the central claims of 'significant gains in performance and consistency' and 'nearly lossless information transfer' are stated without any quantitative metrics, baselines, statistical tests, or experimental details. This prevents assessment of whether the reported improvements are load-bearing or merely incremental.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for greater specificity in the abstract. We agree that the current abstract is too high-level and will revise it to incorporate key quantitative results from our experiments.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims of 'significant gains in performance and consistency' and 'nearly lossless information transfer' are stated without any quantitative metrics, baselines, statistical tests, or experimental details. This prevents assessment of whether the reported improvements are load-bearing or merely incremental.

Authors: We agree with this observation. While the body of the manuscript contains detailed experimental results with metrics, baselines, and comparisons across vision and language tasks, the abstract does not reference any specific numbers. In the revised manuscript we will update the abstract to include concrete performance figures (e.g., accuracy or transfer fidelity on representative benchmarks) along with brief mention of the evaluation protocol, thereby allowing readers to assess the magnitude of the reported gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with independent experimental validation

full rationale

The manuscript proposes an empirical improvement to relative representations via learned anchors and whitened inner-product similarity, then reports performance gains on vision and language tasks. No mathematical derivation chain, equations, or self-citations are presented that reduce the claimed gains or zero-shot transfer results to quantities defined by construction from fitted parameters, prior self-referential normalizations, or load-bearing self-citations. The central claims rest on experimental outcomes rather than any self-definitional or fitted-input-called-prediction pattern, rendering the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; the central claim rests on the premise that random anchors and cosine similarity are inadequate for anisotropic geometries, but no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.1-grok · 5673 in / 1093 out tokens · 31998 ms · 2026-06-29T08:19:48.390153+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 11 canonical work pages · 5 internal anchors

[1]

Relative representations enable zero-shot latent space communication

Luca Moschella, Valentino Maiorca, Marco Fumero, Antonio Norelli, Francesco Locatello, and Emanuele Rodolà. Relative representations enable zero-shot latent space communication. InInternational Conference on Learning Representations, 2023. arXiv:2209.15430. 8 Improving Relative Representations with Learned Anchors and WIP

work page arXiv 2023
[2]

On the Importance of Embedding Norms in Self-Supervised Learning

Andrew Draganov, Sharvaree Vadgama, Sebastian Damrich, Jan Niklas Böhm, Lucas Maes, Dmitry Kobak, and Erik Bekkers. On the importance of embedding norms in self-supervised learning, 2025. arXiv:2502.09252

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings

Kawin Ethayarajh. How contextual are contextualized word representations? comparing the geometry of BERT, ELMo, and GPT-2 embeddings. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 2019. arXiv:1909.00512

work page arXiv 2019
[4]

Anisotropy is inherent to self-attention in transformers

Nathan Godey, Éric de la Clergerie, and Benoît Sagot. Anisotropy is inherent to self-attention in transformers. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics,
[5]

Latent space translation via inverse relative projection, 2024

Valentino Maiorca, Luca Moschella, Marco Fumero, Francesco Locatello, and Emanuele Rodolà. Latent space translation via inverse relative projection, 2024. arXiv:2406.15057

work page arXiv 2024
[6]

Similarity of Neural Network Representations Revisited

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InProceedings of the 36th International Conference on Machine Learning, 2019. arXiv:1905.00414

work page internal anchor Pith review Pith/arXiv arXiv 2019
[7]

The Platonic Representation Hypothesis

Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis, 2024. arXiv:2405.07987

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Relative representations of latent spaces enable efficient semantic channel equalization, 2024

Tomás Hüttebräucker, Simone Fiorellino, Mohamed Sana, Paolo Di Lorenzo, and Emilio Calvanese Strinati. Relative representations of latent spaces enable efficient semantic channel equalization, 2024. arXiv:2411.19719

work page arXiv 2024
[9]

Representation learning with contrastive predictive coding,

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding,
[10]

Learning multiple layers of features from tiny images

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

2009
[11]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021. arXiv:2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2021
[12]

Rethinking channel dimensions for efficient model design

Dongyoon Han, Sangdoo Yun, Byeongho Heo, and YoungJoon Yoo. Rethinking channel dimensions for efficient model design. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. arXiv:2007.00992

work page arXiv 2021
[13]

Phillip Keung, Yichao Lu, György Szarvas, and Noah A. Smith. The multilingual amazon reviews corpus. InPro- ceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020. arXiv:2010.02573

work page arXiv 2020
[14]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 2019. arXiv:1810.04805. 9 Improving Relative Representations with Learned Anchors and WIP Ap...

work page internal anchor Pith review Pith/arXiv arXiv 2019

[1] [1]

Relative representations enable zero-shot latent space communication

Luca Moschella, Valentino Maiorca, Marco Fumero, Antonio Norelli, Francesco Locatello, and Emanuele Rodolà. Relative representations enable zero-shot latent space communication. InInternational Conference on Learning Representations, 2023. arXiv:2209.15430. 8 Improving Relative Representations with Learned Anchors and WIP

work page arXiv 2023

[2] [2]

On the Importance of Embedding Norms in Self-Supervised Learning

Andrew Draganov, Sharvaree Vadgama, Sebastian Damrich, Jan Niklas Böhm, Lucas Maes, Dmitry Kobak, and Erik Bekkers. On the importance of embedding norms in self-supervised learning, 2025. arXiv:2502.09252

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings

Kawin Ethayarajh. How contextual are contextualized word representations? comparing the geometry of BERT, ELMo, and GPT-2 embeddings. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 2019. arXiv:1909.00512

work page arXiv 2019

[4] [4]

Anisotropy is inherent to self-attention in transformers

Nathan Godey, Éric de la Clergerie, and Benoît Sagot. Anisotropy is inherent to self-attention in transformers. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics,

[5] [5]

Latent space translation via inverse relative projection, 2024

Valentino Maiorca, Luca Moschella, Marco Fumero, Francesco Locatello, and Emanuele Rodolà. Latent space translation via inverse relative projection, 2024. arXiv:2406.15057

work page arXiv 2024

[6] [6]

Similarity of Neural Network Representations Revisited

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InProceedings of the 36th International Conference on Machine Learning, 2019. arXiv:1905.00414

work page internal anchor Pith review Pith/arXiv arXiv 2019

[7] [7]

The Platonic Representation Hypothesis

Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis, 2024. arXiv:2405.07987

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Relative representations of latent spaces enable efficient semantic channel equalization, 2024

Tomás Hüttebräucker, Simone Fiorellino, Mohamed Sana, Paolo Di Lorenzo, and Emilio Calvanese Strinati. Relative representations of latent spaces enable efficient semantic channel equalization, 2024. arXiv:2411.19719

work page arXiv 2024

[9] [9]

Representation learning with contrastive predictive coding,

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding,

[10] [10]

Learning multiple layers of features from tiny images

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

2009

[11] [11]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021. arXiv:2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2021

[12] [12]

Rethinking channel dimensions for efficient model design

Dongyoon Han, Sangdoo Yun, Byeongho Heo, and YoungJoon Yoo. Rethinking channel dimensions for efficient model design. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. arXiv:2007.00992

work page arXiv 2021

[13] [13]

Phillip Keung, Yichao Lu, György Szarvas, and Noah A. Smith. The multilingual amazon reviews corpus. InPro- ceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020. arXiv:2010.02573

work page arXiv 2020

[14] [14]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 2019. arXiv:1810.04805. 9 Improving Relative Representations with Learned Anchors and WIP Ap...

work page internal anchor Pith review Pith/arXiv arXiv 2019