arxiv: 2511.12034 · v2 · submitted 2025-11-15 · 💻 cs.CV · cs.LG· cs.MM

Recognition: 2 theorem links

· Lean Theorem

Calibrated Multimodal Representation Learning with Missing Modalities

Xiaohao Liu , Xiaobo Xia , Jiaheng Wei , Shuo Yang , Xiu Su , See-kiong Ng , Tat-Seng Chua

Authors on Pith no claims yet

Pith reviewed 2026-05-17 22:05 UTC · model grok-4.3

classification 💻 cs.CV cs.LGcs.MM

keywords multimodal representation learningmissing modalitiesanchor shiftcalibrated alignmentrepresentation imputationbi-step learning

0 comments

The pith

Missing modalities cause an anchor shift in multimodal alignments that can be corrected by representation-level imputation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard multimodal alignment fails when some modalities are absent because the observed ones lock onto a suboptimal local anchor instead of the ideal joint space. CalMRL fixes this by imputing the missing representations directly in latent space using priors and cross-modal connections. A bi-step optimizer with a closed-form posterior for the shared latents makes the correction tractable. Once added to any existing strong multimodal method, the approach lets the model train on the many real-world datasets that contain incomplete samples. Experiments confirm the shift is reduced and downstream performance improves.

Core claim

Incomplete alignments arise because observed modalities converge to a local anchor that differs from the global optimum available only when every modality is present; this deviation produces an irreducible shift that CalMRL removes by explicitly modeling imputation of the missing representations from priors and inherent modality connections, solved via bi-step learning that admits a closed-form posterior over the shared latents.

What carries the argument

CalMRL's calibrated alignment, which performs representation-level imputation of missing modalities by combining priors with cross-modal connections inside a bi-step learner that has a closed-form solution for the posterior of shared latents.

If this is right

Any current multimodal method gains the ability to train on datasets that contain samples with missing modalities.
The anchor shift is provably mitigated and the optimization converges under the supplied theoretical guidance.
Representation-level imputation replaces the need for modality-specific generative models.
The same calibration step can be inserted into contrastive, reconstruction, or fusion pipelines without redesign.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same local-to-global anchor correction may apply to any alignment task that suffers from partial observations, such as sensor fusion with dropped channels.
If the priors used for imputation are themselves learned from complete data, the method could be iterated to bootstrap better priors on progressively more incomplete collections.
Testing whether the closed-form posterior remains stable when the number of missing modalities exceeds one would reveal the practical range of the bi-step solver.

Load-bearing premise

The deviation between the local anchor formed by observed modalities and the true global anchor can be corrected by imputing missing representations from priors and modality connections.

What would settle it

Running an existing multimodal method with and without CalMRL on the same dataset that contains missing modalities and finding no consistent gain in alignment quality or downstream accuracy.

Figures

Figures reproduced from arXiv: 2511.12034 by Jiaheng Wei, See-kiong Ng, Shuo Yang, Tat-Seng Chua, Xiaobo Xia, Xiaohao Liu, Xiu Su.

**Figure 1.** Figure 1: Missing modalities result in distorted representation alignment. Different modalities (in green) are aligned together with a virtual anchor (in red) implicitly with all modalities present. With missing modalities, observed ones are enforced to be aligned with a local anchor, deviating from the correct, i.e., anchor shift. thus ensuring an unbiased alignment. This introduces an inevitable challenge: colle… view at source ↗

**Figure 2.** Figure 2: The overall framework of CalMRL. Observed unimodal content is first encoded to corresponding representations {z m}m∈Ω with individual encoders ϕ m in θ. Despite the missing modalities (i.e., M/Ω), CalMRL calibrates multimodal alignment whereby missing modalities are imputed by generative parameters θb. Finally, Lrep optimizes the observed unimodal encoder to be aligned with the calibrated direction. large… view at source ↗

**Figure 5.** Figure 5: The performance comparison across missing, calibrated, and full (“ideal”) modalities. All the models are trained on MSR-VTT. vision-text alignment has benefited from extensive prior research and advancements, whereas audio-text alignment appears to have substantially greater room for improvement (corresponds to the visualization in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: t-SNE visualization on multimodal representations generated by different models. Existing models, under missing modality training, present clearly separated clusters for each modality (distinct modal boundaries). Fortunately, CalMRL mitigates this issue. formance. These results provide strong evidence to shed light on how CalMRL works: calibrating the alignment in oracle to resist the degradation for missi… view at source ↗

**Figure 7.** Figure 7: Loss curves across models on the training phase. Training stability (R3). We plot the curves of the training loss Lrep for different models in [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

read the original abstract

Multimodal representation learning harmonizes distinct modalities by aligning them into a unified latent space. Recent research generalizes traditional cross-modal alignment to produce enhanced multimodal synergy but requires all modalities to be present for a common instance, making it challenging to utilize prevalent datasets with missing modalities. We provide theoretical insights into this issue from an anchor shift perspective. Observed modalities are aligned with a local anchor that deviates from the optimal one when all modalities are present, resulting in an inevitable shift. To address this, we propose CalMRL to calibrate incomplete alignments caused by missing modalities. CalMRL leverages the priors and the inherent connections among modalities to model the imputation for the missing ones at the representation level. To resolve the optimization dilemma, we employ a bi-step learning method with the closed-form solution of the posterior distribution of shared latents. We validate its mitigation of anchor shift and convergence with theoretical guidance. By equipping the calibrated alignment with the existing advanced method, we offer new flexibility to absorb data with missing modalities, which is originally unattainable. Extensive experiments demonstrate the superiority of CalMRL. The code is released at https://github.com/Xiaohao-Liu/CalMRL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical calibration for missing modalities in multimodal alignment using an anchor-shift view and bi-step closed-form posterior, but the derivation may lean on complete-data assumptions that limit how far it transfers.

read the letter

Hi, the main point on this one is that it tackles a real bottleneck—how to train multimodal models on datasets where some modalities are absent—by reframing the alignment problem as an anchor shift and fixing it with a bi-step procedure that imputes at the representation level using priors and cross-modal links. The closed-form posterior over shared latents is the concrete technical step that lets them avoid the usual optimization mess. That feels like a genuine addition rather than a routine tweak, and it directly targets the practical issue of discarding incomplete examples in vision-language or sensor work. Releasing the code is also a plus for anyone who wants to test the claims themselves. On the downside, the soundness is the soft spot. The stress-test concern lands: if the closed-form posterior is derived by marginalizing under the assumption of full modalities or a fixed missingness pattern, then the calibration step may not actually correct the shift in the general missing-modality case the paper advertises. The abstract mentions theoretical guidance on convergence and shift mitigation, but without seeing the full derivation it is hard to tell whether the priors introduce circularity or whether the method holds when modalities are truly absent. Experiments are reported as superior, yet the lack of visible error bars or tight baseline controls makes the gains harder to judge. This is aimed at researchers who work on multimodal representation learning and need to use incomplete real-world data. A reader focused on alignment fixes or practical deployment would get value from the anchor-shift framing and the bi-step trick. It has enough of a distinct technical proposal and addresses a common pain point that it deserves a serious referee to check the math and the experiments in detail. I would send it to peer review with a request to verify the assumptions behind the closed-form solution.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that missing modalities induce an anchor shift in multimodal alignment, where observed modalities align to a suboptimal local anchor rather than the optimal full-modality anchor. It proposes CalMRL to correct this via representation-level imputation of missing modalities using priors and cross-modal connections, implemented through bi-step learning that yields a closed-form posterior over shared latents. Theoretical analysis is provided for anchor-shift mitigation and convergence, and experiments demonstrate that equipping existing advanced methods with CalMRL enables handling of incomplete data with superior performance.

Significance. If the closed-form posterior derivation and anchor-shift correction are shown to hold for arbitrary missing-modality patterns, the work would provide a principled mechanism to extend multimodal representation methods to the incomplete datasets common in practice. The public code release aids reproducibility and follow-up work.

major comments (2)

[§3.2] §3.2 (bi-step learning and closed-form posterior): the marginalization yielding the closed-form posterior over shared latents is presented without explicit incorporation of a missingness mask or per-instance modality indicator; if the joint distribution is defined only over complete observations, the resulting calibration step does not demonstrably correct the local-anchor deviation under the general missing-modality regime asserted in the abstract and §1.
[§4.3] §4.3, Table 2: the reported gains for CalMRL-augmented baselines lack standard deviations across runs or statistical significance tests, so it is unclear whether the improvements over the uncalibrated versions are reliable or could be explained by variance in the missing-modality simulation.

minor comments (2)

[§2.1] §2.1: the distinction between the local anchor a_l and the optimal anchor a* is introduced informally; an explicit equation defining both quantities and the shift metric would clarify the subsequent theoretical claims.
[Figure 3] Figure 3 caption: the visualization of imputed representations does not indicate the missing-modality rate or which modality is absent, reducing interpretability of the qualitative results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the insightful comments, which have helped us improve the clarity and rigor of our work. We address each major comment below and outline the revisions we plan to make.

read point-by-point responses

Referee: [§3.2] §3.2 (bi-step learning and closed-form posterior): the marginalization yielding the closed-form posterior over shared latents is presented without explicit incorporation of a missingness mask or per-instance modality indicator; if the joint distribution is defined only over complete observations, the resulting calibration step does not demonstrably correct the local-anchor deviation under the general missing-modality regime asserted in the abstract and §1.

Authors: We appreciate this careful reading of §3.2. The closed-form posterior is derived by marginalizing over the shared latents in the bi-step optimization, where the first step imputes missing modality representations using modality priors and cross-modal connections. This imputation effectively conditions on the observed modalities for each instance, thereby addressing the anchor shift in the incomplete regime. The joint distribution in the theoretical analysis assumes complete observations to establish the existence of the optimal anchor, but the calibration procedure generalizes to missing patterns by replacing missing representations with their imputed counterparts before alignment. To clarify this, we will revise §3.2 to explicitly include a per-instance modality indicator and missingness mask in the posterior derivation and the optimization steps. revision: yes
Referee: [§4.3] §4.3, Table 2: the reported gains for CalMRL-augmented baselines lack standard deviations across runs or statistical significance tests, so it is unclear whether the improvements over the uncalibrated versions are reliable or could be explained by variance in the missing-modality simulation.

Authors: We acknowledge that the absence of standard deviations and significance testing in Table 2 limits the interpretability of the results. In the revised manuscript, we will rerun the experiments with multiple random seeds, report mean performance with standard deviations, and include statistical significance tests (e.g., paired t-tests) comparing CalMRL-augmented methods against their baselines to confirm that the observed gains are statistically reliable. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation self-contained via model assumptions and closed-form posterior

full rationale

The paper's chain starts from an anchor-shift observation, introduces priors and cross-modal connections as modeling assumptions, then derives a bi-step optimization whose closed-form posterior is obtained by marginalization under the stated generative model. No step reduces a claimed prediction or first-principles result to a fitted parameter or self-citation by construction. The calibration is presented as an application of the derived posterior rather than a renaming or re-use of the target quantity itself. External validation via experiments on missing-modality data supplies independent grounding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that modality priors and inherent connections exist and can be leveraged for accurate representation-level imputation, plus the premise that a bi-step procedure with closed-form posterior resolves the optimization dilemma without introducing new biases.

axioms (2)

domain assumption Modalities possess inherent connections and priors that allow accurate imputation of missing representations
Invoked to model imputation for missing modalities at the representation level
ad hoc to paper The optimization dilemma caused by missing modalities can be resolved via bi-step learning with closed-form posterior
Used to justify the training procedure and convergence claims

pith-pipeline@v0.9.0 · 5527 in / 1445 out tokens · 28561 ms · 2026-05-17T22:05:11.858532+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1 (Anchor shift under incomplete modality alignment). Let u_1 and u_Ω_1 be the leading left singular vectors... s 2(1−(σ_Ω_1 + η²)/σ_1) ≤ ∥u_1 − u_Ω_1∥ ≤ √2 ∥Z_¯Ω∥_2 / (σ_1 − σ_2)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · 5 internal anchors

[1]

Multimodal deep learning

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, Andrew Y Ng, et al. Multimodal deep learning. InICML, volume 11, pages 689–696, 2011

work page 2011
[2]

A theory of multimodal learning.NeurIPS, 36:57244–57255, 2023

Zhou Lu. A theory of multimodal learning.NeurIPS, 36:57244–57255, 2023

work page 2023
[3]

Multimodal learning with transformers: A survey.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 45(10):12113–12132, 2023

Peng Xu, Xiatian Zhu, and David A Clifton. Multimodal learning with transformers: A survey.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 45(10):12113–12132, 2023

work page 2023
[4]

Gen- eralized domain prompt learning for accessible scientific vision-language models.Nexus, 2(2), 2025

Qinglong Cao, Yuntian Chen, Lu Lu, Hao Sun, Zhengzhong Zeng, Xiaokang Yang, and Dongxiao Zhang. Gen- eralized domain prompt learning for accessible scientific vision-language models.Nexus, 2(2), 2025

work page 2025
[5]

Multibench: Multiscale benchmarks for multimodal representation learning

Paul Pu Liang, Yiwei Lyu, Xiang Fan, Zetian Wu, Yun Cheng, Jason Wu, Leslie Chen, Peter Wu, Michelle A Lee, Yuke Zhu, et al. Multibench: Multiscale benchmarks for multimodal representation learning. InNeurIPS, 2021

work page 2021
[6]

Quantifying & modeling multimodal interactions: An information decomposition framework

Paul Pu Liang, Yun Cheng, Xiang Fan, Chun Kai Ling, Suzanne Nie, Richard Chen, Zihao Deng, Nicholas Allen, Randy Auerbach, Faisal Mahmood, et al. Quantifying & modeling multimodal interactions: An information decomposition framework. InNeurIPS, pages 27351–27393, 2023

work page 2023
[7]

Lumina-dimoo: An omni diffusion large language model for multi-modal generation and understanding

Yi Xin, Qi Qin, Siqi Luo, Kaiwen Zhu, Juncheng Yan, Yan Tai, Jiayi Lei, Yuewen Cao, Keqi Wang, Yibin Wang, et al. Lumina-dimoo: An omni diffusion large language model for multi-modal generation and understanding. arXiv preprint arXiv:2510.06308, 2025

work page arXiv 2025
[8]

Vit-lens: Towards omni-modal representations

Weixian Lei, Yixiao Ge, Kun Yi, Jianfeng Zhang, Difei Gao, Dylan Sun, Yuying Ge, Ying Shan, and Mike Zheng Shou. Vit-lens: Towards omni-modal representations. InCVPR, pages 26647–26657, 2024

work page 2024
[9]

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, and Xiaobo Xia. Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Position: The platonic representation hypoth- esis

Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. Position: The platonic representation hypoth- esis. InICML, 2024

work page 2024
[11]

Understanding the emergence of multi- modal representation alignment

Megan Tjandrasuwita, Chanakya Ekbote, Liu Ziyin, and Paul Pu Liang. Understanding the emergence of multi- modal representation alignment. InICML, 2025

work page 2025
[12]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, pages 8748–8763, 2021. 19 APREPRINT

work page 2021
[13]

Imagebind: One embedding space to bind them all

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. InCVPR, pages 15180–15190, 2023

work page 2023
[14]

Vast: A vision- audio-subtitle-text omni-modality foundation model and dataset

Sihan Chen, Handong Li, Qunbo Wang, Zijia Zhao, Mingzhen Sun, Xinxin Zhu, and Jing Liu. Vast: A vision- audio-subtitle-text omni-modality foundation model and dataset. InNeurIPS, pages 72842–72866, 2023

work page 2023
[15]

Gramian multimodal represen- tation learning and alignment

Giordano Cicchetti, Eleonora Grassucci, Luigi Sigillo, and Danilo Comminiello. Gramian multimodal represen- tation learning and alignment. InICLR, 2025

work page 2025
[16]

Continual multimodal contrastive learning

Xiaohao Liu, Xiaobo Xia, See-Kiong Ng, and Tat-Seng Chua. Continual multimodal contrastive learning. In NeurIPS, 2025

work page 2025
[17]

What to align in multimodal contrastive learning?ICLR, 2025

Benoit Dufumier, Javiera Castillo-Navarro, Devis Tuia, and Jean-Philippe Thiran. What to align in multimodal contrastive learning?ICLR, 2025

work page 2025
[18]

Principled multimodal representation learning

Xiaohao Liu, Xiaobo Xia, See-Kiong Ng, and Tat-Seng Chua. Principled multimodal representation learning. arXiv preprint arXiv:2507.17343, 2025

work page arXiv 2025
[19]

Wav2clip: Learning robust audio representations from clip

Ho-Hsiang Wu, Prem Seetharaman, Kundan Kumar, and Juan Pablo Bello. Wav2clip: Learning robust audio representations from clip. InICASSP, pages 4563–4567, 2022

work page 2022
[20]

Videoclip: Contrastive pre-training for zero-shot video-text understanding.arXiv preprint arXiv:2109.14084, 2021

Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. Videoclip: Contrastive pre-training for zero-shot video-text understanding.arXiv preprint arXiv:2109.14084, 2021

work page arXiv 2021
[21]

Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning.Neurocomputing, 508:293–304, 2022

Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning.Neurocomputing, 508:293–304, 2022

work page 2022
[22]

Audioclip: Extending clip to image, text and audio

Andrey Guzhov, Federico Raue, J ¨orn Hees, and Andreas Dengel. Audioclip: Extending clip to image, text and audio. InICASSP, pages 976–980, 2022

work page 2022
[23]

Pointclip: Point cloud understanding by clip

Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xupeng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li. Pointclip: Point cloud understanding by clip. InCVPR, pages 8552–8562, 2022

work page 2022
[24]

A TRIANGLE enables multimodal alignment beyond cosine similarity

Giordano Cicchetti, Eleonora Grassucci, and Danilo Comminiello. A TRIANGLE enables multimodal alignment beyond cosine similarity. InNeurIPS, 2025

work page 2025
[25]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, pages 248–255, 2009

work page 2009
[26]

Internvid: A large-scale video-text dataset for multimodal understanding and generation

Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Ping Luo, Ziwei Liu, Yali Wang, Limin Wang, and Yu Qiao. Internvid: A large-scale video-text dataset for multimodal understanding and generation. InICLR. OpenReview.net, 2024

work page 2024
[27]

Openvid-1m: A large-scale high-quality dataset for text-to-video generation

Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhenheng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1m: A large-scale high-quality dataset for text-to-video generation. InICLR. OpenReview.net, 2025

work page 2025
[28]

Audiocaps: Generating captions for audios in the wild

Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. Audiocaps: Generating captions for audios in the wild. InACL, pages 119–132, 2019

work page 2019
[29]

Clotho: An audio captioning dataset

Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: An audio captioning dataset. InICASSP, pages 736–740. IEEE, 2020

work page 2020
[30]

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. InICASSP, pages 1–5. IEEE, 2023

work page 2023
[31]

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, et al. Languagebind: Extending video-language pretraining to n-modality by language-based se- mantic alignment.arXiv preprint arXiv:2310.01852, 2023. 20 APREPRINT

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following.arXiv preprint arXiv:2309.00615, 2023

Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xianzheng Ma, Jiaming Han, Kexin Chen, Peng Gao, Xianzhi Li, Hongsheng Li, et al. Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following.arXiv preprint arXiv:2309.00615, 2023

work page arXiv 2023
[33]

Unibind: Llm-augmented unified and balanced repre- sentation space to bind them all

Yuanhuiyi Lyu, Xu Zheng, Jiazhou Zhou, and Lin Wang. Unibind: Llm-augmented unified and balanced repre- sentation space to bind them all. InCVPR, pages 26752–26762, 2024

work page 2024
[34]

Omnibind: Large-scale omni multimodal representation via binding spaces.arXiv preprint arXiv:2407.11895, 2024

Zehan Wang, Ziang Zhang, Hang Zhang, Luping Liu, Rongjie Huang, Xize Cheng, Hengshuang Zhao, and Zhou Zhao. Omnibind: Large-scale omni multimodal representation via binding spaces.arXiv preprint arXiv:2407.11895, 2024

work page arXiv 2024
[35]

Self-supervised multimodal learning: A survey

Yongshuo Zong, Oisin Mac Aodha, and Timothy Hospedales. Self-supervised multimodal learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

work page 2024
[36]

Multi-modal contrastive masked autoencoders: A two-stage progressive pre-training approach for rgbd datasets

Muhammad Abdullah Jamal and Omid Mohareri. Multi-modal contrastive masked autoencoders: A two-stage progressive pre-training approach for rgbd datasets. InCVPR, pages 17947–17957, 2025

work page 2025
[37]

Dpu: Dynamic prototype updating for multimodal out-of-distribution detection

Shawn Li, Huixian Gong, Hao Dong, Tiankai Yang, Zhengzhong Tu, and Yue Zhao. Dpu: Dynamic prototype updating for multimodal out-of-distribution detection. InCVPR, pages 10193–10202, June 2025

work page 2025
[38]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InICML, pages 1597–1607, 2020

work page 2020
[39]

Crossclr: Cross-modal contrastive learning for multi-modal video representations

Mohammadreza Zolfaghari, Yi Zhu, Peter Gehler, and Thomas Brox. Crossclr: Cross-modal contrastive learning for multi-modal video representations. InICCV, pages 1450–1459, 2021

work page 2021
[40]

Cross-modal contrastive learning for text-to-image generation

Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, and Yinfei Yang. Cross-modal contrastive learning for text-to-image generation. InCVPR, pages 833–842, 2021

work page 2021
[41]

Few-shot adversarial prompt learning on vision-language models

Yiwei Zhou, Xiaobo Xia, Zhiwei Lin, Bo Han, and Tongliang Liu. Few-shot adversarial prompt learning on vision-language models. InNeurIPS, pages 3122–3156, 2024

work page 2024
[42]

Clap: learning audio concepts from natural language supervision

Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang. Clap: learning audio concepts from natural language supervision. InICASSP, pages 1–5, 2023

work page 2023
[43]

Freebind: Free lunch in unified multimodal space via knowledge fusion.arXiv preprint arXiv:2405.04883, 2024

Zehan Wang, Ziang Zhang, Xize Cheng, Rongjie Huang, Luping Liu, Zhenhui Ye, Haifeng Huang, Yang Zhao, Tao Jin, Peng Gao, et al. Freebind: Free lunch in unified multimodal space via knowledge fusion.arXiv preprint arXiv:2405.04883, 2024

work page arXiv 2024
[44]

Frozen in time: A joint video and image encoder for end-to-end retrieval

Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. InCVPR, pages 1728–1738, 2021

work page 2021
[45]

Learning audio-video modalities from image captions

Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, and Cordelia Schmid. Learning audio-video modalities from image captions. InECCV, pages 407–426, 2022

work page 2022
[46]

Videoprism: A foundational visual encoder for video under- standing

Long Zhao, Nitesh Bharadwaj Gundavarapu, Liangzhe Yuan, Hao Zhou, Shen Yan, Jennifer J Sun, Luke Fried- man, Rui Qian, Tobias Weyand, Yue Zhao, et al. Videoprism: A foundational visual encoder for video under- standing. InICML, pages 60785–60811. PMLR, 2024

work page 2024
[47]

Miradata: A large-scale video dataset with long durations and structured captions

Xuan Ju, Yiming Gao, Zhaoyang Zhang, Ziyang Yuan, Xintao Wang, Ailing Zeng, Yu Xiong, Qiang Xu, and Ying Shan. Miradata: A large-scale video dataset with long durations and structured captions. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors, NeurIPS, 2024

work page 2024
[48]

A touch, vision, and language dataset for multimodal alignment

Letian Fu, Gaurav Datta, Huang Huang, William Chung-Ho Panitch, Jaimyn Drake, Joseph Ortiz, Mustafa Mukadam, Mike Lambeta, Roberto Calandra, and Ken Goldberg. A touch, vision, and language dataset for multimodal alignment. InICML. OpenReview.net, 2024

work page 2024
[49]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InICML, pages 12888–12900, 2022. 21 APREPRINT

work page 2022
[50]

Valor: Vision- audio-language omni-perception pretraining model and dataset.arXiv preprint arXiv:2304.08345, 2023

Sihan Chen, Xingjian He, Longteng Guo, Xinxin Zhu, Weining Wang, Jinhui Tang, and Jing Liu. Valor: Vision- audio-language omni-perception pretraining model and dataset.arXiv preprint arXiv:2304.08345, 2023

work page arXiv 2023
[51]

Omnivec: Learning robust representations with cross modal sharing

Siddharth Srivastava and Gaurav Sharma. Omnivec: Learning robust representations with cross modal sharing. InWACV, pages 1225–1237. IEEE, 2024

work page 2024
[52]

VIT-LENS: towards omni-modal representations

Weixian Lei, Yixiao Ge, Kun Yi, Jianfeng Zhang, Difei Gao, Dylan Sun, Yuying Ge, Ying Shan, and Mike Zheng Shou. VIT-LENS: towards omni-modal representations. InCVPR, pages 26637–26647. IEEE, 2024

work page 2024
[53]

Onellm: One framework to align all modalities with language

Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, and Xiangyu Yue. Onellm: One framework to align all modalities with language. InCVPR, pages 26574–26585. IEEE, 2024

work page 2024
[54]

Deep Multimodal Learning with Missing Modality: A Survey

Renjie Wu, Hu Wang, Hsiang-Ting Chen, and Gustavo Carneiro. Deep multimodal learning with missing modal- ity: A survey.arXiv preprint arXiv:2409.07825, 2024

work page internal anchor Pith review arXiv 2024
[55]

Are multimodal transformers robust to missing modality? InCVPR, pages 18177–18186, 2022

Mengmeng Ma, Jian Ren, Long Zhao, Davide Testuggine, and Xi Peng. Are multimodal transformers robust to missing modality? InCVPR, pages 18177–18186, 2022

work page 2022
[56]

Deep adversarial learning for multi-modality missing data completion

Lei Cai, Zhengyang Wang, Hongyang Gao, Dinggang Shen, and Shuiwang Ji. Deep adversarial learning for multi-modality missing data completion. InKDD, pages 1158–1166, 2018

work page 2018
[57]

Incomplete multimodality-diffused emotion recognition

Yuanzhi Wang, Yong Li, and Zhen Cui. Incomplete multimodality-diffused emotion recognition. InNeurIPS, pages 17117–17128, 2023

work page 2023
[58]

Missing modality imagination network for emotion recognition with uncertain missing modalities

Jinming Zhao, Ruichen Li, and Qin Jin. Missing modality imagination network for emotion recognition with uncertain missing modalities. InACL, pages 2608–2618, 2021

work page 2021
[59]

Smil: Multimodal learning with severely missing modality

Mengmeng Ma, Jian Ren, Long Zhao, Sergey Tulyakov, Cathy Wu, and Xi Peng. Smil: Multimodal learning with severely missing modality. InAAAI, volume 35, pages 2302–2310, 2021

work page 2021
[60]

M3care: Learning with missing modalities in multimodal healthcare data

Chaohe Zhang, Xu Chu, Liantao Ma, Yinghao Zhu, Yasha Wang, Jiangtao Wang, and Junfeng Zhao. M3care: Learning with missing modalities in multimodal healthcare data. InKDD, pages 2418–2428, 2022

work page 2022
[61]

Multi-modal learning with missing modality via shared-specific feature modelling

Hu Wang, Yuanhong Chen, Congbo Ma, Jodie Avery, Louise Hull, and Gustavo Carneiro. Multi-modal learning with missing modality via shared-specific feature modelling. InCVPR, pages 15878–15887, 2023

work page 2023
[62]

Rethinking missing modality learning from a decoding perspective

Tao Jin, Xize Cheng, Linjun Li, Wang Lin, Ye Wang, and Zhou Zhao. Rethinking missing modality learning from a decoding perspective. InMM, pages 4431–4439, 2023

work page 2023
[63]

Modal- nexus auto-encoder for multi-modality cellular data integration and imputation.Nature Communications, 15(1):9021, 2024

Zhenchao Tang, Guanxing Chen, Shouzhi Chen, Jianhua Yao, Linlin You, and Calvin Yu-Chian Chen. Modal- nexus auto-encoder for multi-modality cellular data integration and imputation.Nature Communications, 15(1):9021, 2024

work page 2024
[64]

Probabilistic conformal distillation for enhancing missing modality robustness

Mengxi Chen, Fei Zhang, Zihua Zhao, Jiangchao Yao, Ya Zhang, and Yanfeng Wang. Probabilistic conformal distillation for enhancing missing modality robustness. InCVPR, pages 36218–36242, 2024

work page 2024
[65]

Flex-moe: Modeling arbitrary modality combination via the flexible mixture-of-experts

Sukwon Yun, Inyoung Choi, Jie Peng, Yangfan Wu, Jingxuan Bao, Qiyiwen Zhang, Jiayi Xin, Qi Long, and Tianlong Chen. Flex-moe: Modeling arbitrary modality combination via the flexible mixture-of-experts. In NeurIPS, pages 98782–98805, 2024

work page 2024
[66]

Knowledge bridger: Towards training-free missing modality completion

Guanzhou Ke, Shengfeng He, Xiaoli Wang, Bo Wang, Guoqing Chao, Yuanyang Zhang, Yi Xie, and Hexing Su. Knowledge bridger: Towards training-free missing modality completion. InCVPR, pages 25864–25873, 2025

work page 2025
[67]

Boosting discriminability for robust multimodal entity linking with visual modality missing

Mingrui Lao, Zheng Li, Yanming Guo, Xueyi Zhang, Siqi Cai, Zhaoyun Ding, and Haizhou Li. Boosting discriminability for robust multimodal entity linking with visual modality missing. InSIGIR, pages 989–999, 2025

work page 2025
[68]

Learnable cross-modal knowledge distillation for multi-modal learning with missing modality

Hu Wang, Congbo Ma, Jianpeng Zhang, Yuan Zhang, Jodie Avery, Louise Hull, and Gustavo Carneiro. Learnable cross-modal knowledge distillation for multi-modal learning with missing modality. InMICCAI, pages 216–226, 2023. 22 APREPRINT

work page 2023
[69]

Leveraging knowledge of modality experts for incomplete multi- modal learning

Wenxin Xu, Hexin Jiang, and Xuefeng Liang. Leveraging knowledge of modality experts for incomplete multi- modal learning. InMM, pages 438–446, 2024

work page 2024
[70]

Simmlm: A simple framework for multi-modal learning with missing modality.arXiv preprint arXiv:2507.19264, 2025b

Sijie Li, Chen Chen, and Jungong Han. Simmlm: A simple framework for multi-modal learning with missing modality.arXiv preprint arXiv:2507.19264, 2025

work page arXiv 2025
[71]

Robust multimodal learning with missing modalities via parameter-efficient adaptation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

Md Kaykobad Reza, Ashley Prater-Bennette, and M Salman Asif. Robust multimodal learning with missing modalities via parameter-efficient adaptation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

work page 2024
[72]

Probabilistic principal component analysis.Journal of the Royal Statistical Society Series B: Statistical Methodology, 61(3):611–622, 1999

Michael E Tipping and Christopher M Bishop. Probabilistic principal component analysis.Journal of the Royal Statistical Society Series B: Statistical Methodology, 61(3):611–622, 1999

work page 1999
[73]

Factor analysis, probabilistic principal component analysis, variational inference, and variational autoencoder: Tutorial and survey.arXiv preprint arXiv:2101.00734, 2021

Benyamin Ghojogh, Ali Ghodsi, Fakhri Karray, and Mark Crowley. Factor analysis, probabilistic principal component analysis, variational inference, and variational autoencoder: Tutorial and survey.arXiv preprint arXiv:2101.00734, 2021

work page arXiv 2021
[74]

Better together: Leveraging unpaired multimodal data for stronger unimodal models.arXiv preprint arXiv:2510.08492, 2025

Sharut Gupta, Shobhita Sundaram, Chenyu Wang, Stefanie Jegelka, and Phillip Isola. Better together: Leveraging unpaired multimodal data for stronger unimodal models.arXiv preprint arXiv:2510.08492, 2025

work page arXiv 2025
[75]

Collecting highly parallel data for paraphrase evaluation

David Chen and William B Dolan. Collecting highly parallel data for paraphrase evaluation. InACL, pages 190–200, 2011

work page 2011
[76]

Localizing moments in video with natural language

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with natural language. InICCV, pages 5803–5812, 2017

work page 2017
[77]

Dense-captioning events in videos

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. InICCV, pages 706–715, 2017

work page 2017
[78]

Vatex: A large-scale, high-quality multilingual dataset for video-and-language research

Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. InICCV, pages 4581–4591, 2019

work page 2019
[79]

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, et al. Internvideo: General video foundation models via generative and discriminative learning.arXiv preprint arXiv:2212.03191, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[80]

Matrix perturbation theory.Handbook of linear algebra, pages 15–21, 2006

Ren-Cang Li. Matrix perturbation theory.Handbook of linear algebra, pages 15–21, 2006

work page 2006

Showing first 80 references.