SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport

Paul Krzakala; Quentin Bouniot; Simon Roschmann; Sonia Mazelet; Zeynep Akata

arxiv: 2602.23353 · v2 · pith:YS4AO72Qnew · submitted 2026-02-26 · 💻 cs.LG · cs.AI

SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport

Simon Roschmann , Paul Krzakala , Sonia Mazelet , Quentin Bouniot , Zeynep Akata This is my paper

classification 💻 cs.LG cs.AI

keywords alignmentsotalignsemi-supervisedunpaireddatalanguagemodelspaired

0 comments

read the original abstract

The Platonic Representation Hypothesis posits that neural networks trained on different modalities converge toward a shared statistical model of the world. Recent work exploits this convergence by aligning frozen pretrained vision and language models with lightweight alignment layers, but typically relies on contrastive losses and millions of paired samples. In this work, we ask whether meaningful alignment can be achieved with substantially less supervision. We introduce a semi-supervised setting in which pretrained unimodal encoders are aligned using a small number of image-text pairs together with large amounts of unpaired data. To address this challenge, we propose SOTAlign, a two-stage framework that first recovers a coarse shared geometry from limited paired data using a linear teacher, and then refines the alignment on unpaired samples via an optimal-transport-based divergence that transfers relational structure without overconstraining the target space. SOTAlign effectively leverages unpaired images and text, learning robust joint embeddings across datasets and encoder pairs, and significantly outperforming supervised and semi-supervised baselines. Code is available at https://github.com/ExplainableML/SOTAlign.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Topology-Aware Representation Alignment for Semi-Supervised Vision-Language Learning
cs.CV 2026-04 unverdicted novelty 6.0

ToMA uses persistent homology on H0-death and lightweight H1-birth edges to align multimodal manifolds, delivering stable gains on remote sensing and consistent benefits on fashion retrieval.
MSAlign: Aligning Molecule and Mass Spectra Foundation Models for Metabolite Identification
cs.LG 2026-05 conditional novelty 5.0

MSAlign aligns frozen DreaMS and ChemBERTa models with MLPs and candidate-based contrastive learning to outperform prior methods on molecule retrieval from MS/MS spectra while quantifying distribution shift in data splits.