arxiv: 2604.08579 · v1 · submitted 2026-03-28 · 💻 cs.LG · cs.AI

Recognition: 1 theorem link

· Lean Theorem

On the Spectral Geometry of Cross-Modal Representations: A Functional Map Diagnostic for Multimodal Alignment

Krisanu Sarkar

Authors on Pith no claims yet

Pith reviewed 2026-05-14 23:16 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords cross-modal alignmentfunctional mapsspectral geometrymultimodal representationsLaplacian eigenbaseseigenvector alignmentrepresentation manifolds

0 comments

The pith

Independently trained vision and language encoders develop manifolds of similar complexity but with unaligned eigenvector bases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies the functional map framework from computational geometry to compare representation manifolds of a pretrained vision encoder and a language encoder. It establishes that the Laplacian eigenvalue spectra of the two models are close, with normalized spectral distance 0.043, showing they capture comparable intrinsic complexity. At the same time the functional map between their eigenbases shows near-zero diagonal dominance below 0.05 and orthogonality error of 70.15, indicating the bases are unaligned. The authors name this mismatch the spectral complexity-orientation gap and introduce three diagnostic quantities to measure cross-modal compatibility. The gap supplies a boundary condition that explains limited performance of spectral alignment techniques relative to Procrustes and relative representations.

Core claim

The central claim is that the Laplacian eigenvalue spectra of independently trained vision and language encoders are quantitatively similar while the eigenvector bases remain effectively unaligned under the functional map operator, a decoupling the authors term the spectral complexity-orientation gap.

What carries the argument

The functional map, a compact linear operator between the graph Laplacian eigenbases of two representation manifolds.

If this is right

Spectral alignment methods encounter a boundary condition set by the complexity-orientation gap.
The three diagnostics (diagonal dominance, orthogonality deviation, Laplacian commutativity error) characterize cross-modal representation compatibility.
Functional maps underperform Procrustes and relative representations for cross-modal retrieval at all supervision budgets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same gap may appear across other modality pairs or training regimes, pointing to a broader property of deep representation spaces.
Separate correction for orientation after matching complexity could improve spectral alignment performance.
The diagnostics offer a practical way to select or adapt encoders before multimodal training.

Load-bearing premise

The functional map framework applied to these pretrained encoders and graph constructions reveals general structural properties of cross-modal manifolds rather than model-specific artifacts.

What would settle it

Finding high diagonal dominance and low orthogonality error in the functional map for other pairs of independently trained vision and language encoders would falsify the existence of a general spectral complexity-orientation gap.

Figures

Figures reproduced from arXiv: 2604.08579 by Krisanu Sarkar.

**Figure 2.** Figure 2: Image-to-text recall as a function of spectral dimen [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Spectral diagnostics for the DINOv2–MiniLM encoder pair. Top left: Laplacian eigenvalue spectra are quantitatively [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

We study cross-modal alignment between independently pretrained vision (DINOv2) and language (all-MiniLM-L6-v2) encoders using the functional map framework from computational geometry, which represents correspondence between representation manifolds as a compact linear operator between graph Laplacian eigenbases. While the framework underperforms Procrustes alignment and relative representations for cross-modal retrieval across all supervision budgets, it reveals a structural property of multimodal representations. We find that the Laplacian eigenvalue spectra of the two encoders are quantitatively similar (normalized spectral distance 0.043), indicating that independently trained models develop manifolds of comparable intrinsic complexity. However, the functional map exhibits near-zero diagonal dominance (mean below 0.05) and large orthogonality error (70.15), showing that the eigenvector bases are effectively unaligned. We term this decoupling the spectral complexity--orientation gap: models converge in how much structure they capture but not in how they organize it. This gap defines a boundary condition for spectral alignment methods and motivates three diagnostic quantities : diagonal dominance, orthogonality deviation, and Laplacian commutativity error for characterizing cross-modal representation compatibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies functional maps to DINOv2 and MiniLM embeddings and reports similar Laplacian spectra but badly misaligned bases, which it labels a complexity-orientation gap; the numbers are concrete but the claim needs controls.

read the letter

The core observation is straightforward: the normalized Laplacian eigenvalue distance between the vision and language encoders is only 0.043, yet the functional map between their eigenbases shows diagonal dominance below 0.05 and orthogonality error around 70. The authors call this decoupling the spectral complexity-orientation gap and propose three diagnostics (diagonal dominance, orthogonality deviation, Laplacian commutativity error) to measure it. They also note that the functional map itself underperforms Procrustes and relative representations on retrieval tasks. That framing and the specific quantities are not in the prior functional-map or multimodal literature they cite, so the work adds a new diagnostic angle rather than a new method. The measurements are direct on fixed pretrained models with no fitted parameters, which keeps the circularity low. The limitation is that the abstract and stress-test note give no null-model baselines (same-modality pairs, shuffled correspondences, or random manifolds) and no details on graph construction, eigenvector count, or sampling. Without those, it is hard to know whether the gap is a general property of cross-modal manifolds or an artifact of these two encoders and the kNN setup. If the full paper supplies the missing controls and error bars, the diagnostics could be useful for people building spectral alignment techniques. This is worth a serious referee for the observation itself, even if the boundary-condition claim needs tightening.

Referee Report

2 major / 1 minor

Summary. The paper applies the functional map framework to analyze alignment between independently pretrained DINOv2 vision and all-MiniLM-L6-v2 language encoders. It reports quantitatively similar Laplacian eigenvalue spectra (normalized distance 0.043) indicating comparable intrinsic complexity, but near-zero functional-map diagonal dominance (<0.05) and large orthogonality error (70.15) indicating unaligned eigenvector bases; this decoupling is termed the spectral complexity-orientation gap. The framework underperforms Procrustes and relative representations on cross-modal retrieval across supervision budgets, and the gap is positioned as a boundary condition motivating three diagnostic quantities.

Significance. If the reported gap is shown to be a general structural property of cross-modal manifolds rather than an artifact of the specific encoders and graph construction, the work supplies concrete diagnostics (diagonal dominance, orthogonality deviation, Laplacian commutativity error) that could guide when spectral alignment methods are applicable and could motivate new representation-learning objectives that explicitly address basis orientation.

major comments (2)

[Abstract] Abstract and results: the normalized spectral distance of 0.043 and orthogonality error of 70.15 are presented as evidence of a general complexity-orientation gap, yet no null-model baselines (same-modality pairs, random Gaussian manifolds, or shuffled correspondences) or statistical tests are described, leaving open whether these values are distinctive to cross-modal pairs or produced by the chosen kNN graph construction and sampling.
[Methods] Methods/results: the manuscript provides no details on the number of eigenvectors retained, the precise kNN graph construction (including k and distance metric), dataset sampling procedure, or error bars on the reported quantities, which are load-bearing for the quantitative claims that the spectra are 'quantitatively similar' and the bases 'effectively unaligned'.

minor comments (1)

[Abstract] The abstract states that the framework 'underperforms' the baselines but does not report the actual retrieval metrics or margins; a table comparing mAP or recall@K across methods and supervision budgets would strengthen the comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify gaps in baselines and methodological transparency that weaken the current presentation of the spectral complexity-orientation gap. We will revise the manuscript to incorporate null-model comparisons and full experimental details, thereby strengthening the evidence that the reported decoupling is characteristic of cross-modal pairs.

read point-by-point responses

Referee: [Abstract] Abstract and results: the normalized spectral distance of 0.043 and orthogonality error of 70.15 are presented as evidence of a general complexity-orientation gap, yet no null-model baselines (same-modality pairs, random Gaussian manifolds, or shuffled correspondences) or statistical tests are described, leaving open whether these values are distinctive to cross-modal pairs or produced by the chosen kNN graph construction and sampling.

Authors: We agree that the absence of null-model baselines leaves the distinctiveness of the gap open to question. In the revision we will add three controls: (i) same-modality pairs (DINOv2 vs. another vision encoder and all-MiniLM vs. another language encoder), (ii) random Gaussian manifolds matched in dimension and eigenvalue decay, and (iii) shuffled correspondence matrices. We will also report p-values from permutation tests against these null distributions. These additions will demonstrate that the observed normalized spectral distance of 0.043 and orthogonality error of 70.15 are statistically larger than those arising from the kNN construction alone, thereby supporting the claim that the complexity-orientation gap is a cross-modal phenomenon. revision: yes
Referee: [Methods] Methods/results: the manuscript provides no details on the number of eigenvectors retained, the precise kNN graph construction (including k and distance metric), dataset sampling procedure, or error bars on the reported quantities, which are load-bearing for the quantitative claims that the spectra are 'quantitatively similar' and the bases 'effectively unaligned'.

Authors: We acknowledge that these implementation details are essential for reproducibility and for assessing the robustness of the quantitative claims. The revised manuscript will state that the top 128 eigenvectors are retained, that kNN graphs are built with k=10 using cosine distance on L2-normalized embeddings, that 5,000 points are sampled uniformly from the respective datasets, and that all reported scalars (spectral distance, diagonal dominance, orthogonality error) are means with standard deviations over five independent sampling and graph-construction seeds. These specifications will be placed in a new “Experimental Setup” subsection and will be accompanied by the corresponding error bars in all tables and figures. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical measurements on fixed models

full rationale

The paper applies the external functional map framework to compute Laplacian spectra and correspondence operators on fixed pretrained encoders (DINOv2, all-MiniLM-L6-v2) with kNN graphs. Reported values (normalized spectral distance 0.043, diagonal dominance <0.05, orthogonality error 70.15) are direct outputs of these definitions with no fitted parameters, no self-citation of uniqueness theorems, and no ansatz smuggled in. The 'spectral complexity--orientation gap' is a post-hoc descriptive label for the observed quantities rather than a derived claim that reduces to its inputs by construction. The derivation chain is self-contained against external benchmarks and contains no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central observations rest on the domain assumption that graph Laplacians computed from sampled embeddings faithfully approximate the intrinsic geometry of the representation manifolds; no free parameters are reported as fitted to the gap itself.

axioms (1)

domain assumption Representation manifolds of neural encoders can be approximated by graph Laplacians derived from finite samples of embeddings.
Standard assumption when applying spectral geometry to high-dimensional data embeddings.

pith-pipeline@v0.9.0 · 5490 in / 1406 out tokens · 60018 ms · 2026-05-14T23:16:41.855206+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The Laplacian eigenvalue spectra of the two encoders are quantitatively similar (normalized spectral distance 0.043)... functional map exhibits near-zero diagonal dominance (mean below 0.05) and large orthogonality error (70.15)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 1 internal anchor

[1]

Yamini Bansal, Preetum Nakkiran, and Boaz Barak. 2021. Revisiting Model Stitching to Compare Neural Representations. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 34. Curran Associates, Inc., Red Hook, NY, USA, 225–236

work page 2021
[2]

Mikhail Belkin and Partha Niyogi. 2003. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation.Neural Computation15, 6 (2003), 1373–1396

work page 2003
[3]

Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2018. Word Translation Without Parallel Data. InProceedings of the 6th International Conference on Learning Representations (ICLR). OpenReview.net, Vancouver, Canada, 1–14

work page 2018
[4]

Nicolas Donati, Abhishek Sharma, and Maks Ovsjanikov. 2020. Deep Geometric Functional Maps: Robust Feature Learning for Shape Correspondence. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Seattle, WA, USA, 8592–8601

work page 2020
[5]

Harold Hotelling. 1936. Relations Between Two Sets of Variates.Biometrika28, 3/4 (1936), 321–377

work page 1936
[6]

Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. 2024. Position: The Platonic Representation Hypothesis. InProceedings of the 41st International Conference on Machine Learning (ICML) (Proceedings of Machine Learning Research, Vol. 235). PMLR, Vienna, Austria, 20617–20642

work page 2024
[7]

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. 2019. Similarity of Neural Network Representations Revisited. InProceedings of the 36th International Conference on Machine Learning (ICML). PMLR, Long Beach, CA, USA, 3519–3529

work page 2019
[8]

Bronstein, and Michael M

Or Litany, Tal Remez, Emanuele Rodolà, Alex M. Bronstein, and Michael M. Bronstein. 2017. Deep Functional Maps: Structured Prediction for Dense Shape Correspondence. InProceedings of the IEEE International Conference on Computer Vision (ICCV). IEEE, Venice, Italy, 5659–5667

work page 2017
[9]

Simone Melzi, Jing Ren, Emanuele Rodolà, Abhishek Sharma, Peter Wonka, and Maks Ovsjanikov. 2019. ZoomOut: Spectral Upsampling for Efficient Shape Correspondence.ACM Transactions on Graphics (TOG)38, 6 (2019), 1–14

work page 2019
[10]

Exploiting Similarities among Languages for Machine Translation

Tomas Mikolov, Quoc V. Le, and Ilya Sutskever. 2013. Exploiting Similarities among Languages for Machine Translation.arXiv preprint arXiv:1309.4168 abs/1309.4168 (2013), 1–10

work page internal anchor Pith review Pith/arXiv arXiv 2013
[11]

Luca Moschella, Valentino Maiorca, Marco Fumero, Antonio Norelli, Francesco Locatello, and Emanuele Rodolà. 2023. Relative Representations Enable Zero-Shot Latent Space Communication. InProceedings of the 11th International Conference on Learning Representations (ICLR). OpenReview.net, Kigali, Rwanda, 1–27

work page 2023
[12]

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick Lab...

work page 2024
[13]

Maks Ovsjanikov, Mirela Ben-Chen, Justin Solomon, Adrian Butscher, and Leonidas Guibas. 2012. Functional Maps: A Flexible Representation of Maps Between Shapes.ACM Transactions on Graphics (TOG)31, 4 (2012), 1–11

work page 2012
[14]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. InProceedings of the 38th International Conference on Machine Learning (ICML). PMLR, Virtual, ...

work page 2021
[15]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Hong Kong, China, 3982–3992

work page 2019
[16]

Schönemann

Peter H. Schönemann. 1966. A Generalized Solution of the Orthogonal Procrustes Problem.Psychometrika31, 1 (1966), 1–10

work page 1966
[17]

Jian Sun, Maks Ovsjanikov, and Leonidas Guibas. 2009. A Concise and Provably Informative Multi-Scale Signature Based on Heat Diffusion. InProceedings of the Symposium on Geometry Processing (SGP). Eurographics Association, Aire-la-Ville, Switzerland, 1383–1392

work page 2009
[18]

Ulrike von Luxburg. 2007. A Tutorial on Spectral Clustering.Statistics and Computing17, 4 (2007), 395–416

work page 2007
[19]

Ulrike von Luxburg, Mikhail Belkin, and Olivier Bousquet. 2008. Consistency of Spectral Clustering.The Annals of Statistics36, 2 (2008), 555–586

work page 2008
[20]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Infer- ence over Event Descriptions.Transactions of the Association for Computational Linguistics (TACL)2 (2014), 67–78

work page 2014