arxiv: 2604.21836 · v1 · submitted 2026-04-23 · 🧬 q-bio.NC · cs.AI

Recognition: unknown

Modulating Cross-Modal Convergence with Single-Stimulus, Intra-Modal Dispersion

Eghbal A. Hosseini , Brian Cheung , Evelina Fedorenko , Alex H. Williams

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:57 UTC · model grok-4.3

classification 🧬 q-bio.NC cs.AI

keywords representational convergencecross-modal alignmentintra-modal dispersionvision modelslanguage modelsGeneralized Procrustes Algorithmstimulus-level analysisneural network representations

0 comments

The pith

Stimuli where vision models agree closely produce up to twice the cross-modal alignment with language models as those where they disagree.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a single-stimulus measure of how much different vision models converge on the same image's internal representation. It shows that images producing high agreement among vision models generate significantly stronger alignment between vision and language models, with the difference reaching a factor of two in tested pairings. This relationship holds across various model combinations and stimulus selection rules. A reader would care because it suggests that certain stimuli carry more consistent environmental structure, which may explain why some inputs drive representational convergence across networks and modalities while others do not.

Core claim

The authors apply the Generalized Procrustes Algorithm to activations from distinct vision models on individual stimuli to compute intra-modal dispersion. Stimuli with low dispersion, meaning high agreement among the vision models, then produce markedly higher alignment with language model representations than high-dispersion stimuli, an effect that reaches a factor of two in pairings such as DINOv2 with language models and remains stable under different selection criteria.

What carries the argument

The Generalized Procrustes Algorithm applied to single-stimulus activations from multiple vision models, which quantifies intra-modal dispersion and predicts the degree of cross-modal convergence with language models.

If this is right

Low-dispersion stimuli can be preferentially selected to maximize observed convergence between vision and language representations.
The single-stimulus dispersion measure offers a concrete way to isolate which inputs drive or prevent cross-modal alignment.
The reported effect generalizes across multiple vision-language model pairs and different stimulus selection procedures.
Measuring convergence at the individual-stimulus level supplies a route to identifying the sources of agreement and disagreement between modalities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Stimuli that admit multiple visual interpretations may systematically increase modality gaps in multimodal systems.
Curating training data around low-dispersion examples could improve alignment in large multimodal models without changing architecture.
The same dispersion measure might be applied to test whether consistent stimuli better predict alignment with human brain activity across sensory modalities.

Load-bearing premise

The Generalized Procrustes Algorithm applied to model activations gives a valid, unbiased measure of meaningful intra-modal convergence for individual stimuli.

What would settle it

Repeating the analysis with a different alignment metric or a broader set of vision models that changes the dispersion ranking and eliminates the factor-of-two difference in cross-modal alignment would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.21836 by Alex H. Williams, Brian Cheung, Eghbal A. Hosseini, Evelina Fedorenko.

**Figure 1.** Figure 1: (A) Overview of the Generalized Procrustes Analysis (GPA) problem. Given representations from diverse vision models, our goal is to learn a set of model-specific transformations to construct a single joint representation. (B) When individual model representations are projected onto this joint space, stimuli can exhibit either low dispersion between joint and individual models (bottom) or instead high disp… view at source ↗

**Figure 2.** Figure 2: Stimulus-specific dispersion modulates vision-language alignment. (A) rank based stimulus selection:stimuli are sorted by mean dispersion to create low-, high-, and randomdispersion sets. (B-E) Vision-language alignment (CKNNA) is then measured for each set across four vision models (ViT-MAE, DINOv2, CLIP, and CLIP+FT). (F) PCA based stimulus selection: stimuli are selected based on their score along the … view at source ↗

read the original abstract

Neural networks exhibit a remarkable degree of representational convergence across diverse architectures, training objectives, and even data modalities. This convergence is predictive of alignment with brain representation. A recent hypothesis suggests this arises from learning the underlying structure in the environment in similar ways. However, it is unclear how individual stimuli elicit convergent representations across networks. An image can be perceived in multiple ways and expressed differently using words. Here, we introduce a methodology based on the Generalized Procrustes Algorithm to measure intra-modal representational convergence at the single-stimulus level. We applied this to vision models with distinct training objectives, selecting stimuli based on their degree of alignment (intra-modal dispersion). Crucially, we found that this intra-modal dispersion strongly modulates alignment between vision and language models (cross-modal convergence). Specifically, stimuli with low intra-modal dispersion (high agreement among vision models) elicited significantly higher cross-modal alignment than those with high dispersion, by up to a factor of two (e.g., in pairings of DINOv2 with language models). This effect was robust to stimulus selection criteria and generalized across different pairings of vision and language models. Measuring convergence at the single-stimulus level provides a path toward understanding the sources of convergence and divergence across modalities, and between neural networks and human neural representations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a methodology based on the Generalized Procrustes Algorithm (GPA) to quantify intra-modal representational dispersion at the single-stimulus level across vision models with distinct training objectives. It reports that stimuli exhibiting low intra-modal dispersion (high agreement among vision models) produce significantly higher cross-modal alignment with language models—by up to a factor of two—than high-dispersion stimuli, with the effect robust to stimulus selection criteria and generalizing across model pairings (e.g., DINOv2 with language models).

Significance. If the central empirical observation holds after methodological validation, the work provides a stimulus-level tool for dissecting sources of representational convergence and divergence across modalities. It could inform stimulus selection strategies for brain-model alignment studies and test hypotheses about shared environmental structure learning, while offering a path to link network convergence to human neural representations.

major comments (2)

[Methods (GPA application and intra-modal dispersion calculation)] The central claim that low intra-modal dispersion modulates cross-modal alignment by up to a factor of two rests on GPA yielding an unbiased per-stimulus convergence score. However, the paper applies GPA to activations from vision models whose spaces differ in dimensionality and geometry; standard GPA minimizes summed squared distances after orthogonal transformations but requires commensurate input matrices. Without explicit preprocessing steps, dimensionality normalization, or controls for architectural artifacts (e.g., in the Methods section describing the GPA procedure), the dispersion metric risks capturing model-specific properties rather than stimulus-driven agreement.
[Results (cross-modal alignment modulation)] The reported robustness and factor-of-two effect are presented without accompanying statistical details. The Results section should include the specific tests used to establish significance, error bars or confidence intervals on the alignment scores, stimulus counts per dispersion bin, and baseline comparisons (e.g., against random or shuffled pairings) to rule out post-hoc selection effects or confounds.

minor comments (2)

[Abstract] The abstract states the effect is 'significantly higher' but does not report the statistical test, degrees of freedom, or exact p-value threshold.
[Methods] Notation for the dispersion metric and alignment score should be defined explicitly on first use, including any normalization applied before GPA.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We have addressed each major comment point by point below, making revisions to improve methodological clarity and statistical rigor where appropriate. These changes strengthen the presentation of our findings without altering the core results.

read point-by-point responses

Referee: The central claim that low intra-modal dispersion modulates cross-modal alignment by up to a factor of two rests on GPA yielding an unbiased per-stimulus convergence score. However, the paper applies GPA to activations from vision models whose spaces differ in dimensionality and geometry; standard GPA minimizes summed squared distances after orthogonal transformations but requires commensurate input matrices. Without explicit preprocessing steps, dimensionality normalization, or controls for architectural artifacts (e.g., in the Methods section describing the GPA procedure), the dispersion metric risks capturing model-specific properties rather than stimulus-driven agreement.

Authors: We agree that additional explicit details on the GPA procedure are needed to address potential concerns about dimensionality and geometry differences. In the revised Methods section, we now describe the preprocessing pipeline in full: activations from each vision model are first reduced to a common dimensionality (the minimum across models) via PCA, followed by mean-centering and scaling to unit variance per model before applying the standard GPA implementation. We have also added a control analysis in which dispersion scores are recomputed after randomly permuting activations across stimuli within each model; the resulting null distribution shows substantially higher dispersion than the observed stimulus-specific scores, confirming that the metric primarily reflects agreement driven by stimulus properties rather than architectural artifacts. These revisions directly mitigate the risk identified. revision: yes
Referee: The reported robustness and factor-of-two effect are presented without accompanying statistical details. The Results section should include the specific tests used to establish significance, error bars or confidence intervals on the alignment scores, stimulus counts per dispersion bin, and baseline comparisons (e.g., against random or shuffled pairings) to rule out post-hoc selection effects or confounds.

Authors: We acknowledge that the original Results section would benefit from more comprehensive statistical reporting. In the revised manuscript, we have expanded the Results to include: (i) the exact statistical tests (two-sample t-tests with Bonferroni correction for low- vs. high-dispersion bin comparisons, plus a linear regression of alignment on dispersion score), (ii) 95% confidence intervals as error bars on all alignment scores, (iii) explicit stimulus counts per bin (approximately 400 stimuli in the lowest-dispersion quartile and 400 in the highest, with sensitivity checks across alternative binning thresholds), and (iv) baseline comparisons against shuffled model pairings and randomly selected stimulus sets, which yield alignment values significantly lower than the observed low-dispersion condition (p < 0.001). These additions confirm the robustness of the factor-of-two effect and rule out post-hoc selection confounds. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observation from independent metrics

full rationale

The paper computes intra-modal dispersion via GPA on vision-model activations for individual stimuli, then measures cross-modal alignment between those vision activations and separate language-model activations on the same stimuli. The reported modulation (low dispersion yielding up to 2× higher alignment) is a direct empirical correlation between two distinct quantities; neither is defined in terms of the other, fitted to the target effect, nor justified by self-citation chains. No equations reduce the central claim to a tautology or renamed input, and the methodology is applied to pre-existing model outputs without load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based on abstract only; no explicit free parameters, invented entities, or non-standard axioms are described. The approach relies on the standard mathematical properties of the Generalized Procrustes Algorithm for alignment.

axioms (1)

domain assumption The Generalized Procrustes Algorithm can be applied to high-dimensional model activations to produce a scalar measure of representational agreement across models for a single stimulus.
This is the core methodological premise invoked when introducing the intra-modal dispersion metric.

pith-pipeline@v0.9.0 · 5538 in / 1399 out tokens · 59374 ms · 2026-05-08T12:57:20.796811+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 1 canonical work pages

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
[2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
[3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
[4]

" id="W5M0MpCehiHzreSzNTczkc9d

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 1999