pith. machine review for the scientific record. sign in

arxiv: 2511.12034 · v2 · submitted 2025-11-15 · 💻 cs.CV · cs.LG· cs.MM

Recognition: 2 theorem links

· Lean Theorem

Calibrated Multimodal Representation Learning with Missing Modalities

Authors on Pith no claims yet

Pith reviewed 2026-05-17 22:05 UTC · model grok-4.3

classification 💻 cs.CV cs.LGcs.MM
keywords multimodal representation learningmissing modalitiesanchor shiftcalibrated alignmentrepresentation imputationbi-step learning
0
0 comments X

The pith

Missing modalities cause an anchor shift in multimodal alignments that can be corrected by representation-level imputation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard multimodal alignment fails when some modalities are absent because the observed ones lock onto a suboptimal local anchor instead of the ideal joint space. CalMRL fixes this by imputing the missing representations directly in latent space using priors and cross-modal connections. A bi-step optimizer with a closed-form posterior for the shared latents makes the correction tractable. Once added to any existing strong multimodal method, the approach lets the model train on the many real-world datasets that contain incomplete samples. Experiments confirm the shift is reduced and downstream performance improves.

Core claim

Incomplete alignments arise because observed modalities converge to a local anchor that differs from the global optimum available only when every modality is present; this deviation produces an irreducible shift that CalMRL removes by explicitly modeling imputation of the missing representations from priors and inherent modality connections, solved via bi-step learning that admits a closed-form posterior over the shared latents.

What carries the argument

CalMRL's calibrated alignment, which performs representation-level imputation of missing modalities by combining priors with cross-modal connections inside a bi-step learner that has a closed-form solution for the posterior of shared latents.

If this is right

  • Any current multimodal method gains the ability to train on datasets that contain samples with missing modalities.
  • The anchor shift is provably mitigated and the optimization converges under the supplied theoretical guidance.
  • Representation-level imputation replaces the need for modality-specific generative models.
  • The same calibration step can be inserted into contrastive, reconstruction, or fusion pipelines without redesign.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same local-to-global anchor correction may apply to any alignment task that suffers from partial observations, such as sensor fusion with dropped channels.
  • If the priors used for imputation are themselves learned from complete data, the method could be iterated to bootstrap better priors on progressively more incomplete collections.
  • Testing whether the closed-form posterior remains stable when the number of missing modalities exceeds one would reveal the practical range of the bi-step solver.

Load-bearing premise

The deviation between the local anchor formed by observed modalities and the true global anchor can be corrected by imputing missing representations from priors and modality connections.

What would settle it

Running an existing multimodal method with and without CalMRL on the same dataset that contains missing modalities and finding no consistent gain in alignment quality or downstream accuracy.

Figures

Figures reproduced from arXiv: 2511.12034 by Jiaheng Wei, See-kiong Ng, Shuo Yang, Tat-Seng Chua, Xiaobo Xia, Xiaohao Liu, Xiu Su.

Figure 1
Figure 1. Figure 1: Missing modalities result in distorted repre￾sentation alignment. Different modalities (in green) are aligned together with a virtual anchor (in red) implicitly with all modalities present. With missing modalities, ob￾served ones are enforced to be aligned with a local anchor, deviating from the correct, i.e., anchor shift. thus ensuring an unbiased alignment. This introduces an inevitable challenge: colle… view at source ↗
Figure 2
Figure 2. Figure 2: The overall framework of CalMRL. Observed unimodal content is first encoded to corresponding rep￾resentations {z m}m∈Ω with individual encoders ϕ m in θ. Despite the missing modalities (i.e., M/Ω), CalMRL calibrates multimodal alignment whereby missing modalities are imputed by generative parameters θb. Finally, Lrep optimizes the observed unimodal encoder to be aligned with the calibrated direction. large… view at source ↗
Figure 5
Figure 5. Figure 5: The performance com￾parison across missing, calibrated, and full (“ideal”) modalities. All the models are trained on MSR-VTT. vision-text alignment has benefited from extensive prior research and advancements, whereas audio-text alignment appears to have substantially greater room for improvement (corresponds to the visualization in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: t-SNE visualization on multimodal representations generated by different models. Existing models, under missing modality training, present clearly separated clusters for each modality (distinct modal boundaries). Fortunately, CalMRL mitigates this issue. formance. These results provide strong evidence to shed light on how CalMRL works: calibrating the alignment in oracle to resist the degradation for missi… view at source ↗
Figure 7
Figure 7. Figure 7: Loss curves across models on the training phase. Training stability (R3). We plot the curves of the training loss Lrep for different models in [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
read the original abstract

Multimodal representation learning harmonizes distinct modalities by aligning them into a unified latent space. Recent research generalizes traditional cross-modal alignment to produce enhanced multimodal synergy but requires all modalities to be present for a common instance, making it challenging to utilize prevalent datasets with missing modalities. We provide theoretical insights into this issue from an anchor shift perspective. Observed modalities are aligned with a local anchor that deviates from the optimal one when all modalities are present, resulting in an inevitable shift. To address this, we propose CalMRL to calibrate incomplete alignments caused by missing modalities. CalMRL leverages the priors and the inherent connections among modalities to model the imputation for the missing ones at the representation level. To resolve the optimization dilemma, we employ a bi-step learning method with the closed-form solution of the posterior distribution of shared latents. We validate its mitigation of anchor shift and convergence with theoretical guidance. By equipping the calibrated alignment with the existing advanced method, we offer new flexibility to absorb data with missing modalities, which is originally unattainable. Extensive experiments demonstrate the superiority of CalMRL. The code is released at https://github.com/Xiaohao-Liu/CalMRL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that missing modalities induce an anchor shift in multimodal alignment, where observed modalities align to a suboptimal local anchor rather than the optimal full-modality anchor. It proposes CalMRL to correct this via representation-level imputation of missing modalities using priors and cross-modal connections, implemented through bi-step learning that yields a closed-form posterior over shared latents. Theoretical analysis is provided for anchor-shift mitigation and convergence, and experiments demonstrate that equipping existing advanced methods with CalMRL enables handling of incomplete data with superior performance.

Significance. If the closed-form posterior derivation and anchor-shift correction are shown to hold for arbitrary missing-modality patterns, the work would provide a principled mechanism to extend multimodal representation methods to the incomplete datasets common in practice. The public code release aids reproducibility and follow-up work.

major comments (2)
  1. [§3.2] §3.2 (bi-step learning and closed-form posterior): the marginalization yielding the closed-form posterior over shared latents is presented without explicit incorporation of a missingness mask or per-instance modality indicator; if the joint distribution is defined only over complete observations, the resulting calibration step does not demonstrably correct the local-anchor deviation under the general missing-modality regime asserted in the abstract and §1.
  2. [§4.3] §4.3, Table 2: the reported gains for CalMRL-augmented baselines lack standard deviations across runs or statistical significance tests, so it is unclear whether the improvements over the uncalibrated versions are reliable or could be explained by variance in the missing-modality simulation.
minor comments (2)
  1. [§2.1] §2.1: the distinction between the local anchor a_l and the optimal anchor a* is introduced informally; an explicit equation defining both quantities and the shift metric would clarify the subsequent theoretical claims.
  2. [Figure 3] Figure 3 caption: the visualization of imputed representations does not indicate the missing-modality rate or which modality is absent, reducing interpretability of the qualitative results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the insightful comments, which have helped us improve the clarity and rigor of our work. We address each major comment below and outline the revisions we plan to make.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (bi-step learning and closed-form posterior): the marginalization yielding the closed-form posterior over shared latents is presented without explicit incorporation of a missingness mask or per-instance modality indicator; if the joint distribution is defined only over complete observations, the resulting calibration step does not demonstrably correct the local-anchor deviation under the general missing-modality regime asserted in the abstract and §1.

    Authors: We appreciate this careful reading of §3.2. The closed-form posterior is derived by marginalizing over the shared latents in the bi-step optimization, where the first step imputes missing modality representations using modality priors and cross-modal connections. This imputation effectively conditions on the observed modalities for each instance, thereby addressing the anchor shift in the incomplete regime. The joint distribution in the theoretical analysis assumes complete observations to establish the existence of the optimal anchor, but the calibration procedure generalizes to missing patterns by replacing missing representations with their imputed counterparts before alignment. To clarify this, we will revise §3.2 to explicitly include a per-instance modality indicator and missingness mask in the posterior derivation and the optimization steps. revision: yes

  2. Referee: [§4.3] §4.3, Table 2: the reported gains for CalMRL-augmented baselines lack standard deviations across runs or statistical significance tests, so it is unclear whether the improvements over the uncalibrated versions are reliable or could be explained by variance in the missing-modality simulation.

    Authors: We acknowledge that the absence of standard deviations and significance testing in Table 2 limits the interpretability of the results. In the revised manuscript, we will rerun the experiments with multiple random seeds, report mean performance with standard deviations, and include statistical significance tests (e.g., paired t-tests) comparing CalMRL-augmented methods against their baselines to confirm that the observed gains are statistically reliable. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation self-contained via model assumptions and closed-form posterior

full rationale

The paper's chain starts from an anchor-shift observation, introduces priors and cross-modal connections as modeling assumptions, then derives a bi-step optimization whose closed-form posterior is obtained by marginalization under the stated generative model. No step reduces a claimed prediction or first-principles result to a fitted parameter or self-citation by construction. The calibration is presented as an application of the derived posterior rather than a renaming or re-use of the target quantity itself. External validation via experiments on missing-modality data supplies independent grounding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that modality priors and inherent connections exist and can be leveraged for accurate representation-level imputation, plus the premise that a bi-step procedure with closed-form posterior resolves the optimization dilemma without introducing new biases.

axioms (2)
  • domain assumption Modalities possess inherent connections and priors that allow accurate imputation of missing representations
    Invoked to model imputation for missing modalities at the representation level
  • ad hoc to paper The optimization dilemma caused by missing modalities can be resolved via bi-step learning with closed-form posterior
    Used to justify the training procedure and convergence claims

pith-pipeline@v0.9.0 · 5527 in / 1445 out tokens · 28561 ms · 2026-05-17T22:05:11.858532+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · 5 internal anchors

  1. [1]

    Multimodal deep learning

    Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, Andrew Y Ng, et al. Multimodal deep learning. InICML, volume 11, pages 689–696, 2011

  2. [2]

    A theory of multimodal learning.NeurIPS, 36:57244–57255, 2023

    Zhou Lu. A theory of multimodal learning.NeurIPS, 36:57244–57255, 2023

  3. [3]

    Multimodal learning with transformers: A survey.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 45(10):12113–12132, 2023

    Peng Xu, Xiatian Zhu, and David A Clifton. Multimodal learning with transformers: A survey.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 45(10):12113–12132, 2023

  4. [4]

    Gen- eralized domain prompt learning for accessible scientific vision-language models.Nexus, 2(2), 2025

    Qinglong Cao, Yuntian Chen, Lu Lu, Hao Sun, Zhengzhong Zeng, Xiaokang Yang, and Dongxiao Zhang. Gen- eralized domain prompt learning for accessible scientific vision-language models.Nexus, 2(2), 2025

  5. [5]

    Multibench: Multiscale benchmarks for multimodal representation learning

    Paul Pu Liang, Yiwei Lyu, Xiang Fan, Zetian Wu, Yun Cheng, Jason Wu, Leslie Chen, Peter Wu, Michelle A Lee, Yuke Zhu, et al. Multibench: Multiscale benchmarks for multimodal representation learning. InNeurIPS, 2021

  6. [6]

    Quantifying & modeling multimodal interactions: An information decomposition framework

    Paul Pu Liang, Yun Cheng, Xiang Fan, Chun Kai Ling, Suzanne Nie, Richard Chen, Zihao Deng, Nicholas Allen, Randy Auerbach, Faisal Mahmood, et al. Quantifying & modeling multimodal interactions: An information decomposition framework. InNeurIPS, pages 27351–27393, 2023

  7. [7]

    Lumina-dimoo: An omni diffusion large language model for multi-modal generation and understanding

    Yi Xin, Qi Qin, Siqi Luo, Kaiwen Zhu, Juncheng Yan, Yan Tai, Jiayi Lei, Yuewen Cao, Keqi Wang, Yibin Wang, et al. Lumina-dimoo: An omni diffusion large language model for multi-modal generation and understanding. arXiv preprint arXiv:2510.06308, 2025

  8. [8]

    Vit-lens: Towards omni-modal representations

    Weixian Lei, Yixiao Ge, Kun Yi, Jianfeng Zhang, Difei Gao, Dylan Sun, Yuying Ge, Ying Shan, and Mike Zheng Shou. Vit-lens: Towards omni-modal representations. InCVPR, pages 26647–26657, 2024

  9. [9]

    GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

    Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, and Xiaobo Xia. Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458, 2025

  10. [10]

    Position: The platonic representation hypoth- esis

    Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. Position: The platonic representation hypoth- esis. InICML, 2024

  11. [11]

    Understanding the emergence of multi- modal representation alignment

    Megan Tjandrasuwita, Chanakya Ekbote, Liu Ziyin, and Paul Pu Liang. Understanding the emergence of multi- modal representation alignment. InICML, 2025

  12. [12]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, pages 8748–8763, 2021. 19 APREPRINT

  13. [13]

    Imagebind: One embedding space to bind them all

    Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. InCVPR, pages 15180–15190, 2023

  14. [14]

    Vast: A vision- audio-subtitle-text omni-modality foundation model and dataset

    Sihan Chen, Handong Li, Qunbo Wang, Zijia Zhao, Mingzhen Sun, Xinxin Zhu, and Jing Liu. Vast: A vision- audio-subtitle-text omni-modality foundation model and dataset. InNeurIPS, pages 72842–72866, 2023

  15. [15]

    Gramian multimodal represen- tation learning and alignment

    Giordano Cicchetti, Eleonora Grassucci, Luigi Sigillo, and Danilo Comminiello. Gramian multimodal represen- tation learning and alignment. InICLR, 2025

  16. [16]

    Continual multimodal contrastive learning

    Xiaohao Liu, Xiaobo Xia, See-Kiong Ng, and Tat-Seng Chua. Continual multimodal contrastive learning. In NeurIPS, 2025

  17. [17]

    What to align in multimodal contrastive learning?ICLR, 2025

    Benoit Dufumier, Javiera Castillo-Navarro, Devis Tuia, and Jean-Philippe Thiran. What to align in multimodal contrastive learning?ICLR, 2025

  18. [18]

    Principled multimodal representation learning

    Xiaohao Liu, Xiaobo Xia, See-Kiong Ng, and Tat-Seng Chua. Principled multimodal representation learning. arXiv preprint arXiv:2507.17343, 2025

  19. [19]

    Wav2clip: Learning robust audio representations from clip

    Ho-Hsiang Wu, Prem Seetharaman, Kundan Kumar, and Juan Pablo Bello. Wav2clip: Learning robust audio representations from clip. InICASSP, pages 4563–4567, 2022

  20. [20]

    Videoclip: Contrastive pre-training for zero-shot video-text understanding.arXiv preprint arXiv:2109.14084, 2021

    Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. Videoclip: Contrastive pre-training for zero-shot video-text understanding.arXiv preprint arXiv:2109.14084, 2021

  21. [21]

    Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning.Neurocomputing, 508:293–304, 2022

    Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning.Neurocomputing, 508:293–304, 2022

  22. [22]

    Audioclip: Extending clip to image, text and audio

    Andrey Guzhov, Federico Raue, J ¨orn Hees, and Andreas Dengel. Audioclip: Extending clip to image, text and audio. InICASSP, pages 976–980, 2022

  23. [23]

    Pointclip: Point cloud understanding by clip

    Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xupeng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li. Pointclip: Point cloud understanding by clip. InCVPR, pages 8552–8562, 2022

  24. [24]

    A TRIANGLE enables multimodal alignment beyond cosine similarity

    Giordano Cicchetti, Eleonora Grassucci, and Danilo Comminiello. A TRIANGLE enables multimodal alignment beyond cosine similarity. InNeurIPS, 2025

  25. [25]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, pages 248–255, 2009

  26. [26]

    Internvid: A large-scale video-text dataset for multimodal understanding and generation

    Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Ping Luo, Ziwei Liu, Yali Wang, Limin Wang, and Yu Qiao. Internvid: A large-scale video-text dataset for multimodal understanding and generation. InICLR. OpenReview.net, 2024

  27. [27]

    Openvid-1m: A large-scale high-quality dataset for text-to-video generation

    Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhenheng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1m: A large-scale high-quality dataset for text-to-video generation. InICLR. OpenReview.net, 2025

  28. [28]

    Audiocaps: Generating captions for audios in the wild

    Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. Audiocaps: Generating captions for audios in the wild. InACL, pages 119–132, 2019

  29. [29]

    Clotho: An audio captioning dataset

    Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: An audio captioning dataset. InICASSP, pages 736–740. IEEE, 2020

  30. [30]

    Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

    Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. InICASSP, pages 1–5. IEEE, 2023

  31. [31]

    LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

    Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, et al. Languagebind: Extending video-language pretraining to n-modality by language-based se- mantic alignment.arXiv preprint arXiv:2310.01852, 2023. 20 APREPRINT

  32. [32]

    Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following.arXiv preprint arXiv:2309.00615, 2023

    Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xianzheng Ma, Jiaming Han, Kexin Chen, Peng Gao, Xianzhi Li, Hongsheng Li, et al. Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following.arXiv preprint arXiv:2309.00615, 2023

  33. [33]

    Unibind: Llm-augmented unified and balanced repre- sentation space to bind them all

    Yuanhuiyi Lyu, Xu Zheng, Jiazhou Zhou, and Lin Wang. Unibind: Llm-augmented unified and balanced repre- sentation space to bind them all. InCVPR, pages 26752–26762, 2024

  34. [34]

    Omnibind: Large-scale omni multimodal representation via binding spaces.arXiv preprint arXiv:2407.11895, 2024

    Zehan Wang, Ziang Zhang, Hang Zhang, Luping Liu, Rongjie Huang, Xize Cheng, Hengshuang Zhao, and Zhou Zhao. Omnibind: Large-scale omni multimodal representation via binding spaces.arXiv preprint arXiv:2407.11895, 2024

  35. [35]

    Self-supervised multimodal learning: A survey

    Yongshuo Zong, Oisin Mac Aodha, and Timothy Hospedales. Self-supervised multimodal learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  36. [36]

    Multi-modal contrastive masked autoencoders: A two-stage progressive pre-training approach for rgbd datasets

    Muhammad Abdullah Jamal and Omid Mohareri. Multi-modal contrastive masked autoencoders: A two-stage progressive pre-training approach for rgbd datasets. InCVPR, pages 17947–17957, 2025

  37. [37]

    Dpu: Dynamic prototype updating for multimodal out-of-distribution detection

    Shawn Li, Huixian Gong, Hao Dong, Tiankai Yang, Zhengzhong Tu, and Yue Zhao. Dpu: Dynamic prototype updating for multimodal out-of-distribution detection. InCVPR, pages 10193–10202, June 2025

  38. [38]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InICML, pages 1597–1607, 2020

  39. [39]

    Crossclr: Cross-modal contrastive learning for multi-modal video representations

    Mohammadreza Zolfaghari, Yi Zhu, Peter Gehler, and Thomas Brox. Crossclr: Cross-modal contrastive learning for multi-modal video representations. InICCV, pages 1450–1459, 2021

  40. [40]

    Cross-modal contrastive learning for text-to-image generation

    Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, and Yinfei Yang. Cross-modal contrastive learning for text-to-image generation. InCVPR, pages 833–842, 2021

  41. [41]

    Few-shot adversarial prompt learning on vision-language models

    Yiwei Zhou, Xiaobo Xia, Zhiwei Lin, Bo Han, and Tongliang Liu. Few-shot adversarial prompt learning on vision-language models. InNeurIPS, pages 3122–3156, 2024

  42. [42]

    Clap: learning audio concepts from natural language supervision

    Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang. Clap: learning audio concepts from natural language supervision. InICASSP, pages 1–5, 2023

  43. [43]

    Freebind: Free lunch in unified multimodal space via knowledge fusion.arXiv preprint arXiv:2405.04883, 2024

    Zehan Wang, Ziang Zhang, Xize Cheng, Rongjie Huang, Luping Liu, Zhenhui Ye, Haifeng Huang, Yang Zhao, Tao Jin, Peng Gao, et al. Freebind: Free lunch in unified multimodal space via knowledge fusion.arXiv preprint arXiv:2405.04883, 2024

  44. [44]

    Frozen in time: A joint video and image encoder for end-to-end retrieval

    Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. InCVPR, pages 1728–1738, 2021

  45. [45]

    Learning audio-video modalities from image captions

    Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, and Cordelia Schmid. Learning audio-video modalities from image captions. InECCV, pages 407–426, 2022

  46. [46]

    Videoprism: A foundational visual encoder for video under- standing

    Long Zhao, Nitesh Bharadwaj Gundavarapu, Liangzhe Yuan, Hao Zhou, Shen Yan, Jennifer J Sun, Luke Fried- man, Rui Qian, Tobias Weyand, Yue Zhao, et al. Videoprism: A foundational visual encoder for video under- standing. InICML, pages 60785–60811. PMLR, 2024

  47. [47]

    Miradata: A large-scale video dataset with long durations and structured captions

    Xuan Ju, Yiming Gao, Zhaoyang Zhang, Ziyang Yuan, Xintao Wang, Ailing Zeng, Yu Xiong, Qiang Xu, and Ying Shan. Miradata: A large-scale video dataset with long durations and structured captions. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors, NeurIPS, 2024

  48. [48]

    A touch, vision, and language dataset for multimodal alignment

    Letian Fu, Gaurav Datta, Huang Huang, William Chung-Ho Panitch, Jaimyn Drake, Joseph Ortiz, Mustafa Mukadam, Mike Lambeta, Roberto Calandra, and Ken Goldberg. A touch, vision, and language dataset for multimodal alignment. InICML. OpenReview.net, 2024

  49. [49]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InICML, pages 12888–12900, 2022. 21 APREPRINT

  50. [50]

    Valor: Vision- audio-language omni-perception pretraining model and dataset.arXiv preprint arXiv:2304.08345, 2023

    Sihan Chen, Xingjian He, Longteng Guo, Xinxin Zhu, Weining Wang, Jinhui Tang, and Jing Liu. Valor: Vision- audio-language omni-perception pretraining model and dataset.arXiv preprint arXiv:2304.08345, 2023

  51. [51]

    Omnivec: Learning robust representations with cross modal sharing

    Siddharth Srivastava and Gaurav Sharma. Omnivec: Learning robust representations with cross modal sharing. InWACV, pages 1225–1237. IEEE, 2024

  52. [52]

    VIT-LENS: towards omni-modal representations

    Weixian Lei, Yixiao Ge, Kun Yi, Jianfeng Zhang, Difei Gao, Dylan Sun, Yuying Ge, Ying Shan, and Mike Zheng Shou. VIT-LENS: towards omni-modal representations. InCVPR, pages 26637–26647. IEEE, 2024

  53. [53]

    Onellm: One framework to align all modalities with language

    Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, and Xiangyu Yue. Onellm: One framework to align all modalities with language. InCVPR, pages 26574–26585. IEEE, 2024

  54. [54]

    Deep Multimodal Learning with Missing Modality: A Survey

    Renjie Wu, Hu Wang, Hsiang-Ting Chen, and Gustavo Carneiro. Deep multimodal learning with missing modal- ity: A survey.arXiv preprint arXiv:2409.07825, 2024

  55. [55]

    Are multimodal transformers robust to missing modality? InCVPR, pages 18177–18186, 2022

    Mengmeng Ma, Jian Ren, Long Zhao, Davide Testuggine, and Xi Peng. Are multimodal transformers robust to missing modality? InCVPR, pages 18177–18186, 2022

  56. [56]

    Deep adversarial learning for multi-modality missing data completion

    Lei Cai, Zhengyang Wang, Hongyang Gao, Dinggang Shen, and Shuiwang Ji. Deep adversarial learning for multi-modality missing data completion. InKDD, pages 1158–1166, 2018

  57. [57]

    Incomplete multimodality-diffused emotion recognition

    Yuanzhi Wang, Yong Li, and Zhen Cui. Incomplete multimodality-diffused emotion recognition. InNeurIPS, pages 17117–17128, 2023

  58. [58]

    Missing modality imagination network for emotion recognition with uncertain missing modalities

    Jinming Zhao, Ruichen Li, and Qin Jin. Missing modality imagination network for emotion recognition with uncertain missing modalities. InACL, pages 2608–2618, 2021

  59. [59]

    Smil: Multimodal learning with severely missing modality

    Mengmeng Ma, Jian Ren, Long Zhao, Sergey Tulyakov, Cathy Wu, and Xi Peng. Smil: Multimodal learning with severely missing modality. InAAAI, volume 35, pages 2302–2310, 2021

  60. [60]

    M3care: Learning with missing modalities in multimodal healthcare data

    Chaohe Zhang, Xu Chu, Liantao Ma, Yinghao Zhu, Yasha Wang, Jiangtao Wang, and Junfeng Zhao. M3care: Learning with missing modalities in multimodal healthcare data. InKDD, pages 2418–2428, 2022

  61. [61]

    Multi-modal learning with missing modality via shared-specific feature modelling

    Hu Wang, Yuanhong Chen, Congbo Ma, Jodie Avery, Louise Hull, and Gustavo Carneiro. Multi-modal learning with missing modality via shared-specific feature modelling. InCVPR, pages 15878–15887, 2023

  62. [62]

    Rethinking missing modality learning from a decoding perspective

    Tao Jin, Xize Cheng, Linjun Li, Wang Lin, Ye Wang, and Zhou Zhao. Rethinking missing modality learning from a decoding perspective. InMM, pages 4431–4439, 2023

  63. [63]

    Modal- nexus auto-encoder for multi-modality cellular data integration and imputation.Nature Communications, 15(1):9021, 2024

    Zhenchao Tang, Guanxing Chen, Shouzhi Chen, Jianhua Yao, Linlin You, and Calvin Yu-Chian Chen. Modal- nexus auto-encoder for multi-modality cellular data integration and imputation.Nature Communications, 15(1):9021, 2024

  64. [64]

    Probabilistic conformal distillation for enhancing missing modality robustness

    Mengxi Chen, Fei Zhang, Zihua Zhao, Jiangchao Yao, Ya Zhang, and Yanfeng Wang. Probabilistic conformal distillation for enhancing missing modality robustness. InCVPR, pages 36218–36242, 2024

  65. [65]

    Flex-moe: Modeling arbitrary modality combination via the flexible mixture-of-experts

    Sukwon Yun, Inyoung Choi, Jie Peng, Yangfan Wu, Jingxuan Bao, Qiyiwen Zhang, Jiayi Xin, Qi Long, and Tianlong Chen. Flex-moe: Modeling arbitrary modality combination via the flexible mixture-of-experts. In NeurIPS, pages 98782–98805, 2024

  66. [66]

    Knowledge bridger: Towards training-free missing modality completion

    Guanzhou Ke, Shengfeng He, Xiaoli Wang, Bo Wang, Guoqing Chao, Yuanyang Zhang, Yi Xie, and Hexing Su. Knowledge bridger: Towards training-free missing modality completion. InCVPR, pages 25864–25873, 2025

  67. [67]

    Boosting discriminability for robust multimodal entity linking with visual modality missing

    Mingrui Lao, Zheng Li, Yanming Guo, Xueyi Zhang, Siqi Cai, Zhaoyun Ding, and Haizhou Li. Boosting discriminability for robust multimodal entity linking with visual modality missing. InSIGIR, pages 989–999, 2025

  68. [68]

    Learnable cross-modal knowledge distillation for multi-modal learning with missing modality

    Hu Wang, Congbo Ma, Jianpeng Zhang, Yuan Zhang, Jodie Avery, Louise Hull, and Gustavo Carneiro. Learnable cross-modal knowledge distillation for multi-modal learning with missing modality. InMICCAI, pages 216–226, 2023. 22 APREPRINT

  69. [69]

    Leveraging knowledge of modality experts for incomplete multi- modal learning

    Wenxin Xu, Hexin Jiang, and Xuefeng Liang. Leveraging knowledge of modality experts for incomplete multi- modal learning. InMM, pages 438–446, 2024

  70. [70]

    Simmlm: A simple framework for multi-modal learning with missing modality.arXiv preprint arXiv:2507.19264, 2025b

    Sijie Li, Chen Chen, and Jungong Han. Simmlm: A simple framework for multi-modal learning with missing modality.arXiv preprint arXiv:2507.19264, 2025

  71. [71]

    Robust multimodal learning with missing modalities via parameter-efficient adaptation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

    Md Kaykobad Reza, Ashley Prater-Bennette, and M Salman Asif. Robust multimodal learning with missing modalities via parameter-efficient adaptation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  72. [72]

    Probabilistic principal component analysis.Journal of the Royal Statistical Society Series B: Statistical Methodology, 61(3):611–622, 1999

    Michael E Tipping and Christopher M Bishop. Probabilistic principal component analysis.Journal of the Royal Statistical Society Series B: Statistical Methodology, 61(3):611–622, 1999

  73. [73]

    Factor analysis, probabilistic principal component analysis, variational inference, and variational autoencoder: Tutorial and survey.arXiv preprint arXiv:2101.00734, 2021

    Benyamin Ghojogh, Ali Ghodsi, Fakhri Karray, and Mark Crowley. Factor analysis, probabilistic principal component analysis, variational inference, and variational autoencoder: Tutorial and survey.arXiv preprint arXiv:2101.00734, 2021

  74. [74]

    Better together: Leveraging unpaired multimodal data for stronger unimodal models.arXiv preprint arXiv:2510.08492, 2025

    Sharut Gupta, Shobhita Sundaram, Chenyu Wang, Stefanie Jegelka, and Phillip Isola. Better together: Leveraging unpaired multimodal data for stronger unimodal models.arXiv preprint arXiv:2510.08492, 2025

  75. [75]

    Collecting highly parallel data for paraphrase evaluation

    David Chen and William B Dolan. Collecting highly parallel data for paraphrase evaluation. InACL, pages 190–200, 2011

  76. [76]

    Localizing moments in video with natural language

    Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with natural language. InICCV, pages 5803–5812, 2017

  77. [77]

    Dense-captioning events in videos

    Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. InICCV, pages 706–715, 2017

  78. [78]

    Vatex: A large-scale, high-quality multilingual dataset for video-and-language research

    Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. InICCV, pages 4581–4591, 2019

  79. [79]

    InternVideo: General Video Foundation Models via Generative and Discriminative Learning

    Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, et al. Internvideo: General video foundation models via generative and discriminative learning.arXiv preprint arXiv:2212.03191, 2022

  80. [80]

    Matrix perturbation theory.Handbook of linear algebra, pages 15–21, 2006

    Ren-Cang Li. Matrix perturbation theory.Handbook of linear algebra, pages 15–21, 2006

Showing first 80 references.