pith. sign in

arxiv: 2605.30596 · v1 · pith:4B4KOPXZnew · submitted 2026-05-28 · 💻 cs.LG

Improving Relative Representations with Learned Anchors and Whitened Inner Products

Pith reviewed 2026-06-29 08:19 UTC · model grok-4.3

classification 💻 cs.LG
keywords relative representationslearned anchorswhitened inner productscross-model communicationzero-shot transfermodel compatibilityneural representationstransformer geometries
0
0 comments X

The pith

Learned anchors as semantic prototypes and whitened inner products enable nearly lossless cross-model communication via relative representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Independently trained neural networks develop incompatible internal spaces that block modular AI systems from sharing knowledge directly. Relative representations address this by expressing each point through its similarities to shared anchors instead of absolute coordinates, yet random anchors and cosine similarity often fail on the anisotropic spaces typical of transformers. The paper replaces random anchors with learned semantic prototypes and substitutes cosine similarity with a whitened inner product that keeps magnitude information while remaining invariant to affine shifts. This change produces large gains in consistency on vision and language tasks and supports nearly lossless information transfer together with stable zero-shot communication even between small language models of different scales.

Core claim

By learning anchors as robust semantic prototypes and employing a geometry-aware whitened inner product similarity metric that preserves magnitude information and remains invariant to affine shifts, relative representations can achieve significant performance gains and enable nearly lossless information transfer and stable zero-shot communication between highly heterogeneous neural architectures such as small language models of varying scales.

What carries the argument

Learned semantic prototype anchors paired with whitened inner products for similarity measurement.

If this is right

  • Significant gains in performance and consistency across vision and language tasks.
  • Nearly lossless information transfer between independently trained models.
  • Stable zero-shot communication between highly heterogeneous architectures such as small language models of varying scales.
  • Improved handling of anisotropic geometries found in modern transformer models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the method works, independently trained modules could be assembled into larger systems without separate alignment training.
  • The same anchor-learning and metric changes might apply to modalities beyond vision and language.
  • Scaling the approach to much larger models could test whether the consistency gains persist.

Load-bearing premise

That learning anchors as semantic prototypes and switching to whitened inner products will reliably overcome the anisotropic geometries that defeat random anchors and cosine similarity.

What would settle it

High error rates or unstable zero-shot performance when transferring between small language models of different scales using the learned anchors and whitened inner products.

Figures

Figures reproduced from arXiv: 2605.30596 by Fabian Mager, Hiba Nassar, Nikolaj Holst Jakobsen, Oscar Thorsted Svendsen.

Figure 1
Figure 1. Figure 1: Shows conditions that may arise using cosine similarity as [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: demonstrates the impact of s(·, ·) on RR geometry. Euclidean distance based measures tend to create distorted spaces heavily affected by the distance between the anchors. Cosine similarity results in a warped space where all datapoints lie on an approximately elliptical shell (when d ≤ m). In contrast, WIP yields a more consistent cluster geometry across the deformed embedding spaces [PITH_FULL_IMAGE:figu… view at source ↗
Figure 3
Figure 3. Figure 3: Ablation across anchor counts for anchor construction (random vs. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Effect of support-set size (|Xsub|) on CIFAR-100 zero-shot performance. After roughly 5,000–10,000 parallel points (≈10–20% of the dataset), gains diminish substantially, with performance saturating as more points are added [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: MNIST reconstruction with an anisotropic latent space. The bottom 2 rows are zero-shot stitching using RR. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
read the original abstract

Independently trained neural models typically converge to incompatible latent representations, creating a fundamental barrier to highly modular AI systems. While Relative Representations (RR) address this by mapping absolute coordinates to a shared space defined by similarities to common anchor points, traditional implementations rely on randomly sampled anchors and cosine similarity, which frequently fail to capture the anisotropic geometries of modern architectures like Transformers. In this work, we propose a robust framework for cross-model communication based on two improvements. We learn anchors as robust semantic prototypes and utilize a geometry-aware similarity metric which preserves discriminative magnitude information and is invariant to affine shifts. Our approach demonstrates significant gains in performance and consistency across vision and language tasks. Notably, it enables nearly lossless information transfer and stable zero-shot communication even between highly heterogeneous architectures, such as small language models of varying scales.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper claims that independently trained neural models produce incompatible latent representations, and that Relative Representations can be improved by learning anchors as semantic prototypes and replacing cosine similarity with a whitened inner-product metric that preserves magnitude and is invariant to affine shifts. These changes are asserted to yield significant gains in performance and consistency on vision and language tasks, enabling nearly lossless information transfer and stable zero-shot communication even between highly heterogeneous architectures such as small language models of varying scales.

Significance. If the empirical results hold with proper controls and metrics, the work could meaningfully advance modular AI by reducing the barrier of representation incompatibility. The geometry-aware similarity addresses a known limitation of standard RR implementations on anisotropic spaces such as those produced by Transformers.

major comments (1)
  1. [Abstract] Abstract: the central claims of 'significant gains in performance and consistency' and 'nearly lossless information transfer' are stated without any quantitative metrics, baselines, statistical tests, or experimental details. This prevents assessment of whether the reported improvements are load-bearing or merely incremental.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for greater specificity in the abstract. We agree that the current abstract is too high-level and will revise it to incorporate key quantitative results from our experiments.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claims of 'significant gains in performance and consistency' and 'nearly lossless information transfer' are stated without any quantitative metrics, baselines, statistical tests, or experimental details. This prevents assessment of whether the reported improvements are load-bearing or merely incremental.

    Authors: We agree with this observation. While the body of the manuscript contains detailed experimental results with metrics, baselines, and comparisons across vision and language tasks, the abstract does not reference any specific numbers. In the revised manuscript we will update the abstract to include concrete performance figures (e.g., accuracy or transfer fidelity on representative benchmarks) along with brief mention of the evaluation protocol, thereby allowing readers to assess the magnitude of the reported gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with independent experimental validation

full rationale

The manuscript proposes an empirical improvement to relative representations via learned anchors and whitened inner-product similarity, then reports performance gains on vision and language tasks. No mathematical derivation chain, equations, or self-citations are presented that reduce the claimed gains or zero-shot transfer results to quantities defined by construction from fitted parameters, prior self-referential normalizations, or load-bearing self-citations. The central claims rest on experimental outcomes rather than any self-definitional or fitted-input-called-prediction pattern, rendering the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; the central claim rests on the premise that random anchors and cosine similarity are inadequate for anisotropic geometries, but no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.1-grok · 5673 in / 1093 out tokens · 31998 ms · 2026-06-29T08:19:48.390153+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 11 canonical work pages · 5 internal anchors

  1. [1]

    Relative representations enable zero-shot latent space communication

    Luca Moschella, Valentino Maiorca, Marco Fumero, Antonio Norelli, Francesco Locatello, and Emanuele Rodolà. Relative representations enable zero-shot latent space communication. InInternational Conference on Learning Representations, 2023. arXiv:2209.15430. 8 Improving Relative Representations with Learned Anchors and WIP

  2. [2]

    On the Importance of Embedding Norms in Self-Supervised Learning

    Andrew Draganov, Sharvaree Vadgama, Sebastian Damrich, Jan Niklas Böhm, Lucas Maes, Dmitry Kobak, and Erik Bekkers. On the importance of embedding norms in self-supervised learning, 2025. arXiv:2502.09252

  3. [3]

    How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings

    Kawin Ethayarajh. How contextual are contextualized word representations? comparing the geometry of BERT, ELMo, and GPT-2 embeddings. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 2019. arXiv:1909.00512

  4. [4]

    Anisotropy is inherent to self-attention in transformers

    Nathan Godey, Éric de la Clergerie, and Benoît Sagot. Anisotropy is inherent to self-attention in transformers. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics,

  5. [5]

    Latent space translation via inverse relative projection, 2024

    Valentino Maiorca, Luca Moschella, Marco Fumero, Francesco Locatello, and Emanuele Rodolà. Latent space translation via inverse relative projection, 2024. arXiv:2406.15057

  6. [6]

    Similarity of Neural Network Representations Revisited

    Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InProceedings of the 36th International Conference on Machine Learning, 2019. arXiv:1905.00414

  7. [7]

    The Platonic Representation Hypothesis

    Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis, 2024. arXiv:2405.07987

  8. [8]

    Relative representations of latent spaces enable efficient semantic channel equalization, 2024

    Tomás Hüttebräucker, Simone Fiorellino, Mohamed Sana, Paolo Di Lorenzo, and Emilio Calvanese Strinati. Relative representations of latent spaces enable efficient semantic channel equalization, 2024. arXiv:2411.19719

  9. [9]

    Representation learning with contrastive predictive coding,

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding,

  10. [10]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

  11. [11]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021. arXiv:2010.11929

  12. [12]

    Rethinking channel dimensions for efficient model design

    Dongyoon Han, Sangdoo Yun, Byeongho Heo, and YoungJoon Yoo. Rethinking channel dimensions for efficient model design. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. arXiv:2007.00992

  13. [13]

    Phillip Keung, Yichao Lu, György Szarvas, and Noah A. Smith. The multilingual amazon reviews corpus. InPro- ceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020. arXiv:2010.02573

  14. [14]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 2019. arXiv:1810.04805. 9 Improving Relative Representations with Learned Anchors and WIP Ap...