arxiv: 2604.11043 · v4 · submitted 2026-04-13 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models

Jincheng Xie , Xingchen Xiao , Heyan Huang , Zhongyi Huang , Yu Zheng , Runheng Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:10 UTC · model grok-4.3

classification 💻 cs.AI

keywords multimodal embeddingszero-shot transfercross-modal retrievalorthogonal alignmentproxy embeddingsunpaired modalitiesunified embedding spacesgradient interference

0 comments

The pith

EmergentBridge connects unpaired modalities in unified embeddings by aligning them only in directions orthogonal to existing anchors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Unified multimodal models often receive supervision for only a few modality pairs such as image and text, leaving pairs like audio and depth weakly linked and weak at zero-shot tasks. The paper shows that directly pulling a new modality toward a proxy of an existing embedding creates gradient interference that weakens the original alignments. EmergentBridge instead learns to produce a noisy bridge anchor and restricts the new alignment to the subspace perpendicular to the anchor direction. This keeps the original structure intact while building stronger connections for the missing pairs. Experiments across nine datasets confirm gains in zero-shot classification and retrieval for the previously unpaired modalities.

Core claim

EmergentBridge improves zero-shot cross-modal transfer by learning a mapping that produces a noisy bridge anchor from an already-aligned embedding and then enforcing proxy alignment exclusively in the subspace orthogonal to the anchor-alignment direction, which preserves the structure used by existing retrieval and classification while increasing connectivity for unpaired modality pairs.

What carries the argument

The orthogonal subspace restriction applied to proxy alignment, which isolates new-modality updates from the directions that support existing anchor alignments.

If this is right

Zero-shot classification and retrieval improve on unpaired modality combinations without collecting exhaustive pairwise labels.
Existing alignments between anchor modalities remain stable when new modalities are added.
Unified embedding spaces become more scalable because new modalities can be incorporated using only partial supervision.
Emergent alignment appears between previously disconnected pairs such as audio-depth or infrared-audio.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The orthogonal-update idea could extend to continual addition of modalities without full retraining of the embedding space.
Similar subspace isolation might help in other embedding domains where partial alignments must be preserved while adding new entities.
The method suggests a general pattern for growing multimodal systems incrementally while protecting performance on the original tasks.

Load-bearing premise

Naively aligning a new modality to a synthesized proxy embedding introduces gradient interference that degrades anchor alignments, and restricting the alignment to the orthogonal subspace avoids this interference while still strengthening the desired connections.

What would settle it

Running the same experiments with the orthogonal restriction removed and finding no degradation in anchor zero-shot performance would show that the subspace constraint is not required to prevent interference.

Figures

Figures reproduced from arXiv: 2604.11043 by Heyan Huang, Jincheng Xie, Runheng Liu, Xingchen Xiao, Yu Zheng, Zhongyi Huang.

**Figure 2.** Figure 2: EmergentBridge vs. ImageBind and LanguageBind: The left image illustrates ImageBind’s approach, where modalities [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: An Example Overview of EmergentBridge. M𝑎,M𝑏 and C correspond to text, audio, and image, respectively. (Step 1) We align the anchor modality (image) with an already-aligned modality (text) using InfoNCE. (Step 2) We train a proxy predictor that maps image embeddings to proxy text embeddings, while keeping the image/text encoders frozen. (Step 3) We align audio with both the image embedding and the synthesi… view at source ↗

**Figure 4.** Figure 4: Orthogonal-subspace regularization. (a) 𝑇𝑐¯𝑖 projects 𝑥 onto the orthogonal complement of the anchor-alignment direction𝑐¯𝑖 and normalizes it. (b) The InfoNCE objective pulls 𝑥 𝑏 𝑖 toward 𝑐𝑖 (black arrow), while the orthogonal-subspace regularizer guides 𝑥 𝑏 𝑖 toward 𝑥ˆ 𝑎 𝑖 within directions orthogonal to 𝑐¯𝑖 , reducing interference with anchor alignment. where we use symmetric alignment in both directions… view at source ↗

**Figure 5.** Figure 5: Experimental Results on Classification and Retrieval Tasks After Varying Hyperparameters [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: CDF Curves of Similarity on VGG-S and SUN: The top and bottom rows respectively illustrate the cumulative [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

read the original abstract

Unified multimodal embedding spaces underpin practical applications such as cross-modal retrieval and zero-shot recognition. In many real deployments, however, supervision is available only for a small subset of modality pairs (e.g., image--text), leaving \emph{unpaired} modality pairs (e.g., audio$\leftrightarrow$depth, infrared$\leftrightarrow$audio) weakly connected and thus performing poorly on zero-shot transfer. Addressing this sparse-pairing regime is therefore essential for scaling unified embedding systems to new tasks without curating exhaustive pairwise data. We propose \textbf{EmergentBridge}, an embedding-level bridging framework that improves performance on these unpaired pairs \emph{without requiring exhaustive pairwise supervision}. Our key observation is that naively aligning a new modality to a synthesized proxy embedding can introduce \emph{gradient interference}, degrading the anchor-alignment structure that existing retrieval/classification relies on. EmergentBridge addresses this by (i) learning a mapping that produces a \emph{noisy bridge anchor} (a proxy embedding of an already-aligned modality) from an anchor embedding, and (ii) enforcing proxy alignment only in the subspace orthogonal to the anchor-alignment direction, preserving anchor alignment while strengthening non-anchor connectivity. Across nine datasets spanning multiple modalities, EmergentBridge consistently outperforms prior binding baselines on zero-shot classification and retrieval, demonstrating strong emergent alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EmergentBridge adds noisy bridge anchors plus an orthogonal subspace constraint to reduce gradient interference when binding new modalities under sparse supervision, but the abstract leaves the actual gains hard to verify.

read the letter

The main point is that this paper targets a real deployment headache in unified multimodal embeddings: you usually have solid image-text pairs but weak or missing links for unpaired ones like audio to depth, and naive proxy alignment seems to degrade the existing structure through gradient interference. EmergentBridge tries to fix that with a noisy bridge anchor mapping plus alignment restricted to the orthogonal complement of the anchor direction, which is meant to preserve what already works while building new connections. That combination looks like the actual novelty relative to standard binding baselines. The framing is clear and practical, and the claim of consistent gains across nine datasets on zero-shot classification and retrieval is the kind of result that would matter for retrieval systems or zero-shot recognition pipelines. What the work does well is name the interference problem explicitly and propose a structural constraint rather than just adding more data or parameters. The soft spots are mostly around verification. The abstract reports outperformance but gives no ablations on the noise level, no direct measurement of the interference effect, no statistical tests, and no breakdown of how much the orthogonal part contributes versus the noisy anchor alone. Without those, it is difficult to judge whether the improvement is robust or sensitive to implementation choices. The central assumption about gradient interference is plausible, but it would be stronger with supporting plots or gradient analysis. This is for people extending multimodal embedding models to new sensors or modalities when exhaustive pairwise labels are unavailable. A reader working on cross-modal retrieval would get value from the technique if the results hold up under closer inspection. It deserves peer review because the problem is concrete and the proposed fix is targeted, even though the current writeup needs fuller experiments and controls to stand on its own.

Referee Report

1 major / 2 minor

Summary. The paper proposes EmergentBridge, an embedding-level bridging framework for improving zero-shot cross-modal transfer in unified multimodal models under sparse modality-pair supervision. It identifies gradient interference from naive proxy alignment as a key issue and addresses it via a noisy bridge anchor combined with proxy alignment restricted to the orthogonal complement of the anchor-alignment direction. The central empirical claim is consistent outperformance over prior binding baselines on zero-shot classification and retrieval across nine datasets spanning multiple modalities.

Significance. If the empirical results hold under detailed scrutiny, the work could meaningfully advance scalable unified multimodal embeddings by enabling stronger connectivity for unpaired modalities without exhaustive pairwise data. The orthogonal-subspace constraint offers a lightweight structural solution to preserving anchor alignments while enhancing emergent connectivity, which may prove useful in other alignment settings.

major comments (1)

[§4 (Experiments)] §4 (Experiments): The abstract states that EmergentBridge 'consistently outperforms prior binding baselines on zero-shot classification and retrieval' across nine datasets, yet the manuscript provides no full experimental details, baseline specifications, exact metrics, ablations, or statistical significance tests. This absence is load-bearing because the outperformance claim is the primary support for the method's effectiveness and for the assertion that the orthogonal restriction avoids side effects on existing alignments.

minor comments (2)

[Abstract] Abstract: The phrase 'noisy bridge anchor' is introduced without an accompanying equation or precise definition of the noise model, making it difficult to reproduce the construction from the text alone.
[Abstract] Abstract: The claim that the method works 'without requiring exhaustive pairwise supervision' would be strengthened by an explicit statement of the minimal supervision regime used in the experiments (e.g., which modality pairs remain unpaired).

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We agree that the experimental claims require substantially more detail to be fully convincing and will revise the paper accordingly.

read point-by-point responses

Referee: [§4 (Experiments)] §4 (Experiments): The abstract states that EmergentBridge 'consistently outperforms prior binding baselines on zero-shot classification and retrieval' across nine datasets, yet the manuscript provides no full experimental details, baseline specifications, exact metrics, ablations, or statistical significance tests. This absence is load-bearing because the outperformance claim is the primary support for the method's effectiveness and for the assertion that the orthogonal restriction avoids side effects on existing alignments.

Authors: We acknowledge that the submitted manuscript does not present the experimental protocol with sufficient completeness. In the revised version we will expand §4 with: (i) full dataset descriptions and preprocessing for all nine benchmarks, (ii) exact baseline implementations, architectures, and hyper-parameters, (iii) complete numerical tables reporting all metrics (accuracy, mAP, Recall@K, etc.) together with standard deviations over multiple random seeds, (iv) targeted ablations that isolate the noisy-bridge-anchor component and the orthogonal-subspace constraint, and (v) statistical significance tests (paired t-tests and Wilcoxon signed-rank tests) comparing EmergentBridge against each baseline. These additions will directly substantiate the outperformance claim and demonstrate that the orthogonal restriction preserves anchor-alignment performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's central proposal introduces a structural constraint (noisy bridge anchor plus orthogonal-subspace restriction) to mitigate an observed training issue in multimodal embeddings. This is framed as an empirical engineering solution rather than a derivation that reduces to its own fitted parameters or self-referential definitions. No load-bearing step equates a claimed prediction or uniqueness result to an input by construction, and the performance claims rest on external dataset evaluations rather than internal tautologies. Self-citations, if present, are not invoked to justify the core mechanism as an external theorem.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard domain assumptions in embedding learning about directional structure in representation spaces and the existence of gradient interference during proxy alignment. No free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Embedding spaces contain identifiable anchor-alignment directions that can be isolated from other connectivity directions.
Invoked to justify the orthogonal subspace restriction.

pith-pipeline@v0.9.0 · 5552 in / 1140 out tokens · 73359 ms · 2026-05-13T07:10:35.312187+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J-cost uniqueness) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We define the orthogonal-subspace normalization operator T¯c_i(x) ≜ normalize((I − sg(¯c_i)sg(¯c_i)^T / ∥sg(¯c_i)∥² + ε) x) ... Losr aligns T¯c_i(x^b_i) with the proxy ˆx^a_i in the orthogonal complement of ¯c_i.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking (D=3 via cohomology orthogonality) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 5.1 (First-order preservation of anchor alignment) ... λ ≤ ∥¯c_i∥ / ∥∂Losr/∂T¯c_i(x^b_i)∥ implies (∇L_align + λ∇Losr)^T ∇L_align ≥ 0.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 3 internal anchors

[1]

Jean-Baptiste Alayrac, Adria Recasens, Rosalia Schneider, Relja Arandjelović, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, and Andrew Zisserman. 2020. Self-supervised multimodal versatile networks.Advances in neural information processing systems33 (2020), 25–37

work page 2020
[2]

Sanjeev Arora, Nadav Cohen, Noah Golowich, and Wei Hu. 2018. A Conver- gence Analysis of Gradient Descent for Deep Linear Neural Networks.ArXiv abs/1810.02281 (2018). https://api.semanticscholar.org/CorpusID:52922363

work page arXiv 2018
[3]

Philip Bachman, R Devon Hjelm, and William Buchwalter. 2019. Learning representations by maximizing mutual information across views.Advances in neural information processing systems32 (2019)

work page 2019
[4]

2004.Convex optimization

Stephen Boyd and Lieven Vandenberghe. 2004.Convex optimization. Cambridge university press

work page 2004
[5]

Yuanzhouhan Cao, Zifeng Wu, and Chunhua Shen. 2018. Estimating Depth From Monocular Images as Classification Using Deep Fully Convolutional Residual Networks.IEEE Transactions on Circuits and Systems for Video Technology28, 11 (2018), 3174–3182. doi:10.1109/TCSVT.2017.2740321

work page doi:10.1109/tcsvt.2017.2740321 2018
[6]

Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. 2020. Vg- gsound: A Large-Scale Audio-Visual Dataset.ICASSP 2020 - 2020 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP)(2020), 721–725. https://api.semanticscholar.org/CorpusID:216522760

work page 2020
[7]

Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev

work page
[8]

InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Reproducible scaling laws for contrastive language-image learning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2818–2829

work page
[9]

Aayush Dhakal, Subash Khanal, Srikumar Sastry, Adeel Ahmad, and Nathan Jacobs. 2024. GEOBIND: Binding Text, Image, and Audio through Satellite Images. arXiv preprint arXiv:2404.11720(2024)

work page arXiv 2024
[10]

Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. 2019. Clotho: an Audio Captioning Dataset.ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2019), 736–740. https: //api.semanticscholar.org/CorpusID:204800739

work page 2019
[11]

Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. 2019. Gradient Descent Finds Global Minima of Deep Neural Networks. InProceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 1675–1685. https://proceedings.mlr.press/v97/d...

work page 2019
[12]

Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang

work page
[13]

InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Clap learning audio concepts from natural language supervision. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

work page 2023
[14]

Han Fang, Pengfei Xiong, Luhui Xu, and Yu Chen. 2021. CLIP2Video: Mastering Video-Text Retrieval via Image CLIP.ArXivabs/2106.11097 (2021). https: //api.semanticscholar.org/CorpusID:235490558

work page arXiv 2021
[15]

Yuan Gao, Sangwook Kim, David E Austin, and Chris McIntosh. 2024. MED- Bind: Unifying Language and Multimodal Medical Data Embeddings.ArXiv abs/2403.12894 (2024). https://api.semanticscholar.org/CorpusID:268532501

work page arXiv 2024
[16]

Gemmeke, Daniel P

Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Au- dio Set: An ontology and human-labeled dataset for audio events. InProc. IEEE ICASSP 2017. New Orleans, LA

work page 2017
[17]

Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. 2019. An Investigation into Neural Net Optimization via Hessian Eigenvalue Density. InProceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 2232–2241. https://proceedings.mlr.press...

work page 2019
[18]

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. 2023. ImageBind: One Embedding Space To Bind Them All. arXiv:2305.05665 [cs.CV] https://arxiv.org/abs/2305.05665

work page arXiv 2023
[19]

Rohit Girdhar, Mannat Singh, Nikhila Ravi, Laurens Van Der Maaten, Armand Joulin, and Ishan Misra. 2022. Omnivore: A single model for many visual modal- ities. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16102–16112

work page 2022
[20]

Michael Gutmann and Aapo Hyvärinen. 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. InProceedings of the thirteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 297–304

work page 2010
[21]

Andrey Guzhov, Federico Raue, Jörn Hees, and Andreas Dengel. 2021. AudioCLIP: Extending CLIP to Image, Text and Audio. arXiv:2106.13043 [cs.SD] https: //arxiv.org/abs/2106.13043

work page arXiv 2021
[22]

Jiaming Han, Renrui Zhang, Wenqi Shao, Peng Gao, Peng Xu, Han Xiao, Kaipeng Zhang, Chris Liu, Song Wen, Ziyu Guo, Xudong Lu, Shuai Ren, Yafei Wen, Xiaoxin Chen, Xiangyu Yue, Hongsheng Li, and Yu Jiao Qiao. 2023. ImageBind- LLM: Multi-modality Instruction Tuning.ArXivabs/2309.03905 (2023). https: //api.semanticscholar.org/CorpusID:261582620

work page arXiv 2023
[23]

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Mo- mentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9729–9738

work page 2020
[24]

Zhang, Shaoqing Ren, and Jian Sun

Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition.2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(2015), 770–778. https://api.semanticscholar.org/CorpusID: 206594692

work page 2015
[25]

2012.Matrix analysis

Roger A Horn and Charles R Johnson. 2012.Matrix analysis. Cambridge univer- sity press

work page 2012
[26]

Xinyu Jia, Chuang Zhu, Minzhen Li, Wenqi Tang, and Wenli Zhou. 2021. LLVIP: A Visible-infrared Paired Dataset for Low-light Vision.2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)(2021), 3489–3497. https: //api.semanticscholar.org/CorpusID:237278539

work page 2021
[27]

Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. 2019. Au- dioCaps: Generating Captions for Audios in The Wild. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio ...

work page doi:10.18653/v1/n19-1011 2019
[28]

Auto-Encoding Variational Bayes

Diederik P. Kingma and Max Welling. 2013. Auto-Encoding Variational Bayes. CoRRabs/1312.6114 (2013). https://api.semanticscholar.org/CorpusID:216078090

work page internal anchor Pith review Pith/arXiv arXiv 2013
[29]

Victor Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, and James Y Zou. 2022. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning.Advances in Neural Information Processing Systems35 (2022), 17612–17625

work page 2022
[30]

Openshape: Scaling up 3d shape representation towards open-world understanding,

Minghua Liu, Ruoxi Shi, Kaiming Kuang, Yinhao Zhu, Xuanlin Li, Shizhong Han, H. Cai, Fatih Murat Porikli, and Hao Su. 2023. OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding.ArXivabs/2305.10764 (2023). https://api.semanticscholar.org/CorpusID:258762826

work page arXiv 2023
[31]

Lajanugen Logeswaran and Honglak Lee. 2018. An efficient framework for learning sentence representations.arXiv preprint arXiv:1803.02893(2018)

work page arXiv 2018
[32]

Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2022. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning.Neurocomputing508 (2022), 293–304

work page 2022
[33]

Yuanhuiyi Lyu, Xu Zheng, Jiazhou Zhou, and Lin Wang. 2024. UniBind: LLM- Augmented Unified and Balanced Representation Space to Bind Them All. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. 26752–26762

work page 2024
[34]

Martin and Michael W

Charles H. Martin and Michael W. Mahoney. 2021. Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning.Journal of Machine Learning Research22, 165 (2021), 1–73. http: //jmlr.org/papers/v22/20-410.html

work page 2021
[35]

Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. 2020. End-to-end learning of visual representations from uncurated instructional videos. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9879–9889

work page 2020
[36]

Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manén, Chen Sun, and Cordelia Schmid. 2022. Learning Audio-Video Modali- ties from Image Captions. InEuropean Conference on Computer Vision. https: //api.semanticscholar.org/CorpusID:247939759

work page 2022
[37]

Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. 2012. Indoor Segmentation and Support Inference from RGBD Images. InECCV

work page 2012
[38]

Andreea-Maria Oncescu, A Koepke, Joao F Henriques, Zeynep Akata, and Samuel Albanie. 2021. Audio retrieval with natural language queries.arXiv preprint arXiv:2105.02192(2021)

work page arXiv 2021
[39]

Karol J. Piczak. 2015. ESC: Dataset for Environmental Sound Classification. Proceedings of the 23rd ACM international conference on Multimedia(2015). https: //api.semanticscholar.org/CorpusID:17567398

work page 2015
[40]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020 [cs.CV] https://arxiv.org/ abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021
[41]

W. Rudin. 1976.Principles of Mathematical Analysis. McGraw-Hill. https: //books.google.com.sg/books?id=kwqzPAAACAAJ Trovato et al

work page 1976
[42]

S., Berg, A

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexan- der C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision (IJCV)115, 3 (2015), 211–252. doi:10.1007/s11263-015-0816-y

work page doi:10.1007/s11263-015-0816-y 2015
[43]

Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. 2015. Sun rgb-d: A rgb-d scene understanding benchmark suite. InProceedings of the IEEE conference on computer vision and pattern recognition. 567–576

work page 2015
[44]

Alex Tamkin, Mike Wu, and Noah D. Goodman. 2020. Viewmaker Networks: Learning Views for Unsupervised Representation Learning.ArXivabs/2010.07432 (2020). https://api.semanticscholar.org/CorpusID:222381644

work page arXiv 2020
[45]

Yonglong Tian, Dilip Krishnan, and Phillip Isola. 2020. Contrastive multiview coding. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16. Springer, 776–794

work page 2020
[46]

Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu. 2018. Audio- Visual Event Localization in Unconstrained Videos. InECCV

work page 2018
[47]

Aäron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation Learning with Contrastive Predictive Coding.ArXivabs/1807.03748 (2018). https://api. semanticscholar.org/CorpusID:49670925

work page internal anchor Pith review Pith/arXiv arXiv 2018
[48]

Jianhua Wang, Chuanxia Zheng, Weihai Chen, and Xingming Wu. 2017. Learning aggregated features and optimizing model for semantic labeling.The Visual Computer33 (2017), 1587–1600

work page 2017
[49]

Peng Wang, Shijie Wang, Junyang Lin, Shuai Bai, Xiaohuan Zhou, Jingren Zhou, Xinggang Wang, and Chang Zhou. 2023. ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities.ArXivabs/2305.11172 (2023). https://api.semanticscholar.org/CorpusID:258762390

work page arXiv 2023
[50]

Tongzhou Wang and Phillip Isola. 2020. Understanding Contrastive Represen- tation Learning through Alignment and Uniformity on the Hypersphere. In International Conference on Machine Learning. https://api.semanticscholar.org/ CorpusID:218718310

work page 2020
[51]

Zehan Wang, Ziang Zhang, Xize Cheng, Rongjie Huang, Luping Liu, Zhenhui Ye, Haifeng Huang, Yang Zhao, Tao Jin, Peng Gao, and Zhou Zhao. 2024. Free- Bind: Free Lunch in Unified Multimodal Space via Knowledge Fusion.ArXiv abs/2405.04883 (2024). https://api.semanticscholar.org/CorpusID:269626610

work page arXiv 2024
[52]

Zehan Wang, Ziang Zhang, Hang Zhang, Luping Liu, Rongjie Huang, Xize Cheng, Hengshuang Zhao, and Zhou Zhao. 2024. OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces.arXiv preprint arXiv:2407.11895 (2024)

work page arXiv 2024
[53]

Zehan Wang, Yang Zhao, Xize Cheng, Haifeng Huang, Jiageng Liu, Lilian H. Y. Tang, Lin Li, Yongqiang Wang, Aoxiong Yin, Ziang Zhang, and Zhou Zhao. 2023. Connecting Multi-modal Contrastive Representations.ArXivabs/2305.14381 (2023). https://api.semanticscholar.org/CorpusID:258866011

work page arXiv 2023
[54]

Ho-Hsiang Wu, Prem Seetharaman, Kundan Kumar, and Juan Pablo Bello. 2021. Wav2CLIP: Learning Robust Audio Representations from Clip.ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2021), 4563–4567. https://api.semanticscholar.org/CorpusID:239616434

work page 2021
[55]

Jiayang Wu, Wensheng Gan, Zefeng Chen, Shicheng Wan, and S Yu Philip

work page
[56]

In2023 IEEE International Conference on Big Data (BigData)

Multimodal large language models: A survey. In2023 IEEE International Conference on Big Data (BigData). IEEE, 2247–2256

work page
[57]

Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov

Yusong Wu, K. Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. 2022. Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation.ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2022), 1–5. https://api.semanticscholar.org/CorpusID...

work page 2022
[58]

Peng Xu, Xiatian Zhu, and David A Clifton. 2023. Multimodal learning with transformers: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence45, 10 (2023), 12113–12132

work page 2023
[59]

Fengyu Yang, Chao Feng, Daniel Wang, Tianye Wang, Ziyao Zeng, Zhiyang Xu, Hyoungseob Park, Pengliang Ji, Hanbin Zhao, Yuanning Li, and Alex Wong

work page
[60]

https://api.semanticscholar.org/CorpusID: 271310533

NeuroBind: Towards Unified Multimodal Representations for Neural Signals.ArXivabs/2407.14020 (2024). https://api.semanticscholar.org/CorpusID: 271310533

work page arXiv 2024
[61]

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. 2023. A survey on multimodal large language models.arXiv preprint arXiv:2306.13549(2023)

work page arXiv 2023
[62]

Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment

Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, Wancai Zhang, Zhifeng Li, Wei Liu, and Li Yuan. 2024. LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment. arXiv:2310.01852 [cs.CV] https://arxiv.org/abs/2310.01852 A Implementation Details T...

work page arXiv 2024