pith. machine review for the scientific record. sign in

arxiv: 2511.21331 · v2 · submitted 2025-11-26 · 💻 cs.CV · cs.AI

The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment

Pith reviewed 2026-05-17 05:00 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords contrastive fusionhigher-order multimodal alignmentmultimodal representation learningcross-modal retrievalfused modality contrastive termXOR-like dependenciesjoint embedding space
0
0 comments X

The pith

Contrastive Fusion aligns individual modalities with their fused combinations to capture higher-order dependencies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Contrastive Fusion (ConFu) to learn joint representations across multiple modalities by moving past standard pairwise alignment. It places both single modalities and fused versions of modality pairs into one embedding space and applies contrastive losses to align the fused pairs with a third modality. This addition lets the model recover complex interactions such as XOR-like relationships that pairwise methods alone cannot recover. The framework is tested on synthetic data with known higher-order structure and on real multimodal benchmarks for retrieval and classification. It reports competitive results while supporting both one-to-one and two-to-one retrieval inside the same contrastive setup.

Core claim

ConFu extends traditional pairwise contrastive objectives with an additional fused-modality contrastive term, encouraging the joint embedding of modality pairs with a third modality. This formulation enables ConFu to capture higher-order dependencies, such as XOR-like relationships, that cannot be recovered through pairwise alignment alone, while still maintaining strong pairwise correspondence.

What carries the argument

Contrastive Fusion (ConFu), a framework that embeds both individual modalities and their fused combinations into a unified representation space and aligns them through an added fused-modality contrastive term.

If this is right

  • Enables unified one-to-one and two-to-one retrieval within a single contrastive framework.
  • Maintains competitive performance on retrieval and classification while capturing cross-modal complementarity.
  • Captures higher-order dependencies that pairwise alignment alone cannot recover.
  • Scales with increasing multimodal complexity on both synthetic and real-world benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same fused-term idea could be applied to settings with four or more modalities to test whether even richer interactions become accessible.
  • A single trained model might replace separate pairwise and multi-way retrieval pipelines in deployed multimodal systems.
  • Similar extensions of contrastive objectives could be tried in self-supervised vision or language models where fusion of views is already common.
  • The approach suggests a general template for adding higher-order terms without breaking existing pairwise fidelity.

Load-bearing premise

The additional fused-modality contrastive term successfully encodes higher-order interactions without requiring specific fusion operators or post-hoc adjustments that might not generalize beyond the evaluated benchmarks.

What would settle it

On synthetic data constructed with explicit higher-order XOR dependencies, ConFu shows no retrieval or classification gain over a standard pairwise contrastive baseline.

Figures

Figures reproduced from arXiv: 2511.21331 by Grigorios Tsagkatakis, Konstantinos Kontras, Maarten De Vos, Michaela Areti Zervou, Panagiotis Tsakalides, Stefanos Koutoupis.

Figure 1
Figure 1. Figure 1: Confu unifies direct (1→1) and compositional (2→1) alignment within a single embedding space, utilizing two modali￾ties for improved performance and adapting seamlessly when only one modality is available. as images and text, enables powerful zero-shot transfer and retrieval. However, these approaches remain intrin￾sically pairwise, capturing only correlations between two modalities while overlooking the h… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of ConFu. The framework aligns all modality [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Results on the synthetic XOR task. Our model’s ac [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Few-shot linear probing results on the SSW60 (top) and [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Aggregated modality overlap proportions across all [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Accuracy as a function of embedding dimension for [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Effect of the masking ratio on multimodal performance for the MLP fusion model. Each subplot reports the performance change [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-class modality overlap analysis on the SSW60 dataset in the zero-shot classification setting. Each cell shows the percentage of [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Few-shot linear probing results for SSW60 [ [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Few-shot linear probing results for VB100 [ [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Few-shot linear probing results for SSW60 [ [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Few-shot linear probing results for VB100 [ [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Few-shot linear probing results for CUB200 [ [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Overview of the multimodal data generation pipeline. [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗
read the original abstract

Learning joint representations across multiple modalities remains a central challenge in multimodal machine learning. Prevailing approaches predominantly operate in pairwise settings, aligning two modalities at a time. While some recent methods aim to capture higher-order interactions among multiple modalities, they often overlook or insufficiently preserve pairwise relationships, limiting their effectiveness on single-modality tasks. In this work, we introduce Contrastive Fusion (ConFu), a framework that jointly embeds both individual modalities and their fused combinations into a unified representation space, where modalities and their fused counterparts are aligned. ConFu extends traditional pairwise contrastive objectives with an additional fused-modality contrastive term, encouraging the joint embedding of modality pairs with a third modality. This formulation enables ConFu to capture higher-order dependencies, such as XOR-like relationships, that cannot be recovered through pairwise alignment alone, while still maintaining strong pairwise correspondence. We evaluate ConFu on synthetic and real-world multimodal benchmarks, assessing its ability to exploit cross-modal complementarity, capture higher-order dependencies, and scale with increasing multimodal complexity. Across these settings, ConFu demonstrates competitive performance on retrieval and classification tasks, while supporting unified one-to-one and two-to-one retrieval within a single contrastive framework. We release our code and dataset at https://github.com/estafons/confu.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces Contrastive Fusion (ConFu), a multimodal alignment framework that extends standard pairwise contrastive objectives with an additional fused-modality contrastive term. This term aligns modality pairs with a third modality in a unified embedding space, with the stated goal of capturing higher-order dependencies (e.g., XOR-like relations) that pairwise methods cannot recover, while preserving strong pairwise performance. The approach is evaluated on synthetic and real-world multimodal benchmarks for retrieval and classification tasks, including unified one-to-one and two-to-one retrieval, with code and data released.

Significance. If the fused term demonstrably encodes irreducible higher-order interactions beyond what can be recovered from summed pairwise InfoNCE losses, the method could meaningfully advance multimodal representation learning by balancing higher-order complementarity with pairwise fidelity. The public release of code and dataset supports reproducibility and further testing of the higher-order claim.

major comments (3)
  1. [§3] §3 (Method), fused-modality contrastive term: The central claim that this term captures XOR-like higher-order dependencies unrecoverable by pairwise alignment alone is load-bearing. The abstract and high-level description state that the term 'encourages the joint embedding of modality pairs with a third modality,' but without an explicit equation for the fusion operator (concatenation, element-wise product, attention, etc.) and the resulting loss, it is unclear whether the composite objective introduces non-pairwise statistics or simply reweights existing pairwise signals. A concrete derivation or proof sketch showing that the fused term induces irreducible triple-wise mutual information would be required.
  2. [§4.1] §4.1 (Synthetic experiments), XOR-like dependency tests: The paper asserts that ConFu captures higher-order relations 'that cannot be recovered through pairwise alignment alone.' However, if the fusion operator is linear or element-wise, any observed gain on synthetic XOR tasks could be an artifact of the particular fusion chosen rather than a property of the contrastive objective. An ablation replacing the fused term with an equivalent sum of pairwise terms (or reporting the effective rank of the interaction) is needed to substantiate the claim.
  3. [§4.2] §4.2–4.3 (Real-world benchmarks), scaling with multimodal complexity: The evaluation claims ConFu scales with increasing numbers of modalities and supports two-to-one retrieval. It is unclear whether the reported gains on retrieval and classification hold after controlling for the total number of contrastive pairs or the effective batch size introduced by the fused term. A controlled comparison that matches the number of negative samples across baselines would strengthen the scaling claim.
minor comments (3)
  1. [§3] Notation for the fused embedding (e.g., how f(m_i, m_j) is defined) should be introduced earlier and used consistently in the loss equations.
  2. [Figure 2] Figure 2 (or equivalent diagram of the ConFu architecture) would benefit from explicit arrows or labels showing which pairs participate in the fused contrastive term versus the standard pairwise terms.
  3. [Abstract] The abstract mentions 'synthetic and real-world multimodal benchmarks' but does not list the exact datasets or tasks in the first paragraph; adding this would improve readability.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications and indicating the revisions we will implement to address the concerns raised.

read point-by-point responses
  1. Referee: [§3] §3 (Method), fused-modality contrastive term: The central claim that this term captures XOR-like higher-order dependencies unrecoverable by pairwise alignment alone is load-bearing. The abstract and high-level description state that the term 'encourages the joint embedding of modality pairs with a third modality,' but without an explicit equation for the fusion operator (concatenation, element-wise product, attention, etc.) and the resulting loss, it is unclear whether the composite objective introduces non-pairwise statistics or simply reweights existing pairwise signals. A concrete derivation or proof sketch showing that the fused term induces irreducible triple-wise mutual information would be required.

    Authors: We concur that the method section would benefit from more explicit mathematical details. In the revised manuscript, we will add the precise definition of the fusion operator, which involves concatenating the embeddings of the modality pair and applying a linear projection to obtain the fused representation, along with the complete formulation of the fused-modality contrastive loss. We will also include a short illustrative derivation demonstrating that the fused term can encode joint dependencies not captured by pairwise terms in the context of the XOR example. However, a general proof of inducing irreducible triple-wise mutual information is a deeper theoretical issue that we will note as a direction for future work rather than claim to fully resolve here. revision: partial

  2. Referee: [§4.1] §4.1 (Synthetic experiments), XOR-like dependency tests: The paper asserts that ConFu captures higher-order relations 'that cannot be recovered through pairwise alignment alone.' However, if the fusion operator is linear or element-wise, any observed gain on synthetic XOR tasks could be an artifact of the particular fusion chosen rather than a property of the contrastive objective. An ablation replacing the fused term with an equivalent sum of pairwise terms (or reporting the effective rank of the interaction) is needed to substantiate the claim.

    Authors: This is a valuable suggestion to bolster the empirical support for our claims. We will incorporate an ablation experiment in the updated version of the paper. Specifically, we will compare ConFu against a variant where the fused term is substituted by an additional sum of pairwise contrastive losses, and report the results on the synthetic XOR dependency tests to show that the performance improvement is attributable to the fused objective rather than just increased pairwise signals. revision: yes

  3. Referee: [§4.2] §4.2–4.3 (Real-world benchmarks), scaling with multimodal complexity: The evaluation claims ConFu scales with increasing numbers of modalities and supports two-to-one retrieval. It is unclear whether the reported gains on retrieval and classification hold after controlling for the total number of contrastive pairs or the effective batch size introduced by the fused term. A controlled comparison that matches the number of negative samples across baselines would strengthen the scaling claim.

    Authors: We agree that ensuring fair comparison in terms of computational resources is important. In the revised manuscript, we will add a set of controlled experiments where the total number of negative samples is matched between ConFu and the baseline methods by appropriately scaling the batch sizes or the number of negatives sampled. This will allow us to confirm that the observed gains in retrieval and classification tasks are not solely due to the additional contrastive pairs introduced by the fused term. revision: yes

standing simulated objections not resolved
  • A complete theoretical derivation proving that the fused-modality contrastive term induces irreducible triple-wise mutual information that cannot be recovered from pairwise terms.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper defines ConFu as an explicit extension of standard pairwise contrastive losses by adding a fused-modality term, then evaluates the resulting model on synthetic and real benchmarks. No step in the provided abstract or described framework reduces a claimed higher-order capability to a quantity fitted from the same evaluation data, nor does any load-bearing premise rest on a self-citation chain that itself lacks independent verification. The derivation remains self-contained: the loss is stated, the fusion is introduced as part of the method, and performance is measured externally.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the standard assumption that contrastive objectives can align embeddings across modalities and that a simple fusion operation can produce representations whose alignment captures higher-order structure; no new physical entities or ad-hoc constants are introduced.

axioms (1)
  • domain assumption Contrastive loss functions can produce semantically meaningful joint embeddings when applied to paired multimodal data.
    Invoked implicitly when extending pairwise contrastive objectives to fused modalities.

pith-pipeline@v0.9.0 · 5555 in / 1236 out tokens · 42691 ms · 2026-05-17T05:00:43.606207+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 4 internal anchors

  1. [1]

    Towards Multimodal Sarcasm Detection (An _Obviously_ Perfect Paper)

    Santiago Castro, Devamanyu Hazarika, Ver ´onica P ´erez- Rosas, Roger Zimmermann, Rada Mihalcea, and Sou- janya Poria. Towards multimodal sarcasm detec- tion (an obviously perfect paper).arXiv preprint arXiv:1906.01815, 2019. 5

  2. [2]

    The inaturalist sounds dataset.Advances in Neural Information Processing Systems, 37:132524– 132544, 2024

    Mustafa Chasmai, Alexander Shepard, Subhransu Maji, and Grant Van Horn. The inaturalist sounds dataset.Advances in Neural Information Processing Systems, 37:132524– 132544, 2024. 4

  3. [3]

    Valor: Vision-audio-language omni-perception pretraining model and dataset,

    Sihan Chen, Xingjian He, Longteng Guo, Xinxin Zhu, Wein- ing Wang, Jinhui Tang, and Jing Liu. Valor: Vision-audio- language omni-perception pretraining model and dataset. arXiv preprint arXiv:2304.08345, 2023. 2

  4. [4]

    Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset.Advances in Neural Information Processing Sys- tems, 36:72842–72866, 2023

    Sihan Chen, Handong Li, Qunbo Wang, Zijia Zhao, Mingzhen Sun, Xinxin Zhu, and Jing Liu. Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset.Advances in Neural Information Processing Sys- tems, 36:72842–72866, 2023. 2

  5. [5]

    Gramian multimodal representation learning and alignment.arXiv preprint arXiv:2412.11959,

    Giordano Cicchetti, Eleonora Grassucci, Luigi Sigillo, and Danilo Comminiello. Gramian multimodal representation learning and alignment.arXiv preprint arXiv:2412.11959,

  6. [6]

    A triangle enables multimodal alignment beyond cosine similarity.arXiv preprint arXiv:2509.24734, 2025

    Giordano Cicchetti, Eleonora Grassucci, and Danilo Com- miniello. A triangle enables multimodal alignment beyond cosine similarity.arXiv preprint arXiv:2509.24734, 2025. 2, 5, 6, 7, 8, 3

  7. [7]

    Instructblip: Towards general-purpose vision- language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision- language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023. 5

  8. [8]

    What to align in multimodal con- trastive learning?arXiv preprint arXiv:2409.07402, 2024

    Benoit Dufumier, Javiera Castillo-Navarro, Devis Tuia, and Jean-Philippe Thiran. What to align in multimodal con- trastive learning?arXiv preprint arXiv:2409.07402, 2024. 8

  9. [9]

    Exploit- ing temporal information for dcnn-based fine-grained object classification

    ZongYuan Ge, Chris McCool, Conrad Sanderson, Peng Wang, Lingqiao Liu, Ian Reid, and Peter Corke. Exploit- ing temporal information for dcnn-based fine-grained object classification. In2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA), pages 1–6. IEEE, 2016. 4, 5, 6

  10. [10]

    Imagebind: One embedding space to bind them all

    Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 15180–15190, 2023. 1, 2

  11. [11]

    Audioclip: Extending clip to image, text and au- dio

    Andrey Guzhov, Federico Raue, J ¨orn Hees, and Andreas Dengel. Audioclip: Extending clip to image, text and au- dio. InICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 976–980. IEEE, 2022. 1, 2

  12. [12]

    Ur-funny: A multimodal language dataset for understanding humor.arXiv preprint arXiv:1904.06618, 2019

    Md Kamrul Hasan, Wasifur Rahman, Amir Zadeh, Jianyuan Zhong, Md Iftekhar Tanveer, Louis-Philippe Morency, et al. Ur-funny: A multimodal language dataset for understanding humor.arXiv preprint arXiv:1904.06618, 2019. 5

  13. [13]

    Reconboost: boosting can achieve modal- ity reconcilement

    Cong Hua, Qianqian Xu, Shilong Bao, Zhiyong Yang, and Qingming Huang. Reconboost: boosting can achieve modal- ity reconcilement. InProceedings of the 41st International Conference on Machine Learning. JMLR.org, 2024. 8

  14. [14]

    Modality competition: What makes joint training of multi-modal network fail in deep learn- ing?(provably)

    Yu Huang, Junyang Lin, Chang Zhou, Hongxia Yang, and Longbo Huang. Modality competition: What makes joint training of multi-modal network fail in deep learn- ing?(provably). InInternational Conference on Machine Learning, pages 9226–9259. PMLR, 2022. 1, 7

  15. [15]

    Free spoken digit dataset.https : / / github

    Zoran Jackson. Free spoken digit dataset.https : / / github . com / Jakobovski / free - spoken - digit-dataset, 2017. Accessed: YYYY-MM-DD. 6

  16. [16]

    Scaling up visual and vision-language representa- tion learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR,

  17. [17]

    Balancing multimodal training through game-theoretic regularization

    Konstantinos Kontras, Thomas Strypsteen, Christos Chatzichristos, Paul Pu Liang, Matthew B Blaschko, and Maarten De V os. Balancing multimodal training through game-theoretic regularization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. 8

  18. [18]

    Blaschko, and Maarten De V os

    Konstantinos Kontras, Christos Chatzichristos, Matthew B. Blaschko, and Maarten De V os. Improving multimodal learning with multi-loss gradient modulation. In35th British Machine Vision Conference 2024, BMVC 2024, Glasgow, UK, November 25-28, 2024. BMV A, 2024. 8

  19. [19]

    Multibench: Multiscale benchmarks for multimodal representation learning

    Paul Pu Liang, Yiwei Lyu, Xiang Fan, Zetian Wu, Yun Cheng, Jason Wu, Leslie Yufan Chen, Peter Wu, Michelle A Lee, Yuke Zhu, et al. Multibench: Multiscale benchmarks for multimodal representation learning. InThirty-fifth Con- ference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021. 5, 6

  20. [20]

    Factor- ized contrastive learning: Going beyond multi-view redun- dancy.Advances in Neural Information Processing Systems, 36:32971–32998, 2023

    Paul Pu Liang, Zihao Deng, Martin Q Ma, James Y Zou, Louis-Philippe Morency, and Ruslan Salakhutdinov. Factor- ized contrastive learning: Going beyond multi-view redun- dancy.Advances in Neural Information Processing Systems, 36:32971–32998, 2023. 8, 1

  21. [21]

    Temporal fusion transformers for interpretable multi-horizon time series forecasting.International journal of forecasting, 37(4):1748–1764, 2021

    Bryan Lim, Sercan ¨O Arık, Nicolas Loeff, and Tomas Pfister. Temporal fusion transformers for interpretable multi-horizon time series forecasting.International journal of forecasting, 37(4):1748–1764, 2021. 2

  22. [22]

    Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection

    Ye Liu, Siyuan Li, Yang Wu, Chang-Wen Chen, Ying Shan, and Xiaohu Qie. Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3042–3051, 2022. 2 9

  23. [23]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. 4

  24. [24]

    Balanced multimodal learning via on-the-fly gradient modulation

    Xiaokang Peng, Yake Wei, Andong Deng, Dong Wang, and Di Hu. Balanced multimodal learning via on-the-fly gradient modulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8238– 8247, 2022. 8

  25. [25]

    Karol J. Piczak. ESC: Dataset for Environmental Sound Classification. InProceedings of the 23rd Annual ACM Con- ference on Multimedia, pages 1015–1018. ACM Press. 5, 6

  26. [26]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1

  27. [27]

    Contrasting with symile: Simple model- agnostic representation learning for unlimited modalities

    Adriel Saporta, Aahlad Manas Puli, Mark Goldstein, and Ra- jesh Ranganath. Contrasting with symile: Simple model- agnostic representation learning for unlimited modalities. Advances in Neural Information Processing Systems, 37: 56919–56957, 2024. 2, 5, 6, 7, 8, 3

  28. [28]

    The multiinforma- tion function as a tool for measuring stochastic dependence

    Milan Studen `y and Jirina Vejnarov ´a. The multiinforma- tion function as a tool for measuring stochastic dependence. InLearning in graphical models, pages 261–297. Springer,

  29. [29]

    Gemma Team. Gemma. 2024. 5

  30. [30]

    Exploring fine- grained audiovisual categorization with the ssw60 dataset

    Grant Van Horn, Rui Qian, Kimberly Wilber, Hartwig Adam, Oisin Mac Aodha, and Serge Belongie. Exploring fine- grained audiovisual categorization with the ssw60 dataset. In European Conference on Computer Vision, pages 271–289. Springer, 2022. 4, 5, 6

  31. [31]

    Centralnet: a multilayer approach for mul- timodal fusion

    Valentin Vielzeuf, Alexis Lechervy, St ´ephane Pateux, and Fr´ed´eric Jurie. Centralnet: a multilayer approach for mul- timodal fusion. InProceedings of the European conference on computer vision (ECCV) workshops, pages 0–0, 2018. 5

  32. [32]

    The caltech-ucsd birds-200-2011 dataset

    Catherine Wah, Steve Branson, Peter Welinder, Pietro Per- ona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011. 4, 7

  33. [33]

    Decou- pling common and unique representations for multimodal self-supervised learning

    Yi Wang, Conrad M Albrecht, Nassim Ait Ali Braham, Chenying Liu, Zhitong Xiong, and Xiao Xiang Zhu. Decou- pling common and unique representations for multimodal self-supervised learning. InEuropean Conference on Com- puter Vision, pages 286–303. Springer, 2024. 8

  34. [34]

    Connecting multi-modal contrastive repre- sentations.Advances in Neural Information Processing Sys- tems, 36:22099–22114, 2023

    Zehan Wang, Yang Zhao, Haifeng Huang, Jiageng Liu, Aox- iong Yin, Li Tang, Linjun Li, Yongqi Wang, Ziang Zhang, and Zhou Zhao. Connecting multi-modal contrastive repre- sentations.Advances in Neural Information Processing Sys- tems, 36:22099–22114, 2023. 1, 8

  35. [35]

    Freebind: Free lunch in unified multimodal space via knowledge fusion.arXiv preprint arXiv:2405.04883, 2024

    Zehan Wang, Ziang Zhang, Xize Cheng, Rongjie Huang, Luping Liu, Zhenhui Ye, Haifeng Huang, Yang Zhao, Tao Jin, Peng Gao, et al. Freebind: Free lunch in unified multimodal space via knowledge fusion.arXiv preprint arXiv:2405.04883, 2024. 2

  36. [36]

    Omnibind: Large-scale omni multimodal representation via binding spaces.arXiv preprint arXiv:2407.11895, 2024

    Zehan Wang, Ziang Zhang, Hang Zhang, Luping Liu, Rongjie Huang, Xize Cheng, Hengshuang Zhao, and Zhou Zhao. Omnibind: Large-scale omni multimodal representa- tion via binding spaces.arXiv preprint arXiv:2407.11895,

  37. [37]

    Information theoretical analysis of multi- variate correlation.IBM Journal of research and develop- ment, 4(1):66–82, 1960

    Satosi Watanabe. Information theoretical analysis of multi- variate correlation.IBM Journal of research and develop- ment, 4(1):66–82, 1960. 3

  38. [38]

    Diagnos- ing and re-learning for balanced multimodal learning

    Yake Wei, Siwei Li, Ruoxuan Feng, and Di Hu. Diagnos- ing and re-learning for balanced multimodal learning. In European Conference on Computer Vision, pages 71–86. Springer, 2024. 8

  39. [39]

    Xeno-canto: Sharing wildlife sounds from around the world.https: //www.xeno-canto.org/, 2008

    Xeno-canto Foundation for Nature Sounds. Xeno-canto: Sharing wildlife sounds from around the world.https: //www.xeno-canto.org/, 2008. 5

  40. [40]

    mplug-2: A modularized multi-modal foundation model across text, image and video

    Haiyang Xu, Qinghao Ye, Ming Yan, Yaya Shi, Jiabo Ye, Yuanhong Xu, Chenliang Li, Bin Bi, Qi Qian, Wei Wang, et al. mplug-2: A modularized multi-modal foundation model across text, image and video. InInternational Con- ference on Machine Learning, pages 38728–38748. PMLR,

  41. [41]

    Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding

    Le Xue, Mingfei Gao, Chen Xing, Roberto Mart ´ın-Mart´ın, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, and Silvio Savarese. Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 1179–1189, 2023. 1, 2

  42. [42]

    Ulip-2: Towards scalable multimodal pre-training for 3d understanding

    Le Xue, Ning Yu, Shu Zhang, Artemis Panagopoulou, Jun- nan Li, Roberto Mart´ın-Mart´ın, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, et al. Ulip-2: Towards scalable multimodal pre-training for 3d understanding. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27091–27101, 2024. 1, 2

  43. [43]

    Biotrove: A large curated image dataset enabling ai for bio- diversity.Advances in Neural Information Processing Sys- tems, 37:102101–102120, 2024

    Chih-Hsuan Yang, Benjamin Feuer, Talukder Jubery, Zi Deng, Andre Nakkab, Md Zahid Hasan, Shivani Chiranjeevi, Kelly Marshall, Nirmal Baishnab, Asheesh Singh, et al. Biotrove: A large curated image dataset enabling ai for bio- diversity.Advances in Neural Information Processing Sys- tems, 37:102101–102120, 2024. 4

  44. [44]

    A survey of mul- timodal learning: Methods, applications, and future.ACM Computing Surveys, 57(7):1–34, 2025

    Yuan Yuan, Zhaojian Li, and Bin Zhao. A survey of mul- timodal learning: Methods, applications, and future.ACM Computing Surveys, 57(7):1–34, 2025. 1

  45. [45]

    MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos

    Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos.arXiv preprint arXiv:1606.06259, 2016. 5

  46. [46]

    Extending multi-modal contrastive rep- resentations.Advances in Neural Information Processing Systems, 37:91880–91903, 2024

    Ziang Zhang, Zehan Wang, Luping Liu, Rongjie Huang, Xize Cheng, Zhenhui Ye, Huadai Liu, Haifeng Huang, Yang Zhao, Tao Jin, et al. Extending multi-modal contrastive rep- resentations.Advances in Neural Information Processing Systems, 37:91880–91903, 2024. 1, 2, 8

  47. [47]

    Uni3d: Exploring unified 3d representation at scale.arXiv preprint arXiv:2310.06773,

    Junsheng Zhou, Jinsheng Wang, Baorui Ma, Yu-Shen Liu, Tiejun Huang, and Xinlong Wang. Uni3d: Exploring unified 3d representation at scale.arXiv preprint arXiv:2310.06773,

  48. [48]

    LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

    Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, et al. Languagebind: Extending video-language pretrain- ing to n-modality by language-based semantic alignment. arXiv preprint arXiv:2310.01852, 2023. 1, 2 10 THE MORE, THE MERRIER: CONTRASTIVE FUSION FOR HIGHER-ORDER MULTIMODAL ALIGNMENT Su...