Recognition: 2 theorem links
· Lean TheoremCalibrated Multimodal Representation Learning with Missing Modalities
Pith reviewed 2026-05-17 22:05 UTC · model grok-4.3
The pith
Missing modalities cause an anchor shift in multimodal alignments that can be corrected by representation-level imputation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Incomplete alignments arise because observed modalities converge to a local anchor that differs from the global optimum available only when every modality is present; this deviation produces an irreducible shift that CalMRL removes by explicitly modeling imputation of the missing representations from priors and inherent modality connections, solved via bi-step learning that admits a closed-form posterior over the shared latents.
What carries the argument
CalMRL's calibrated alignment, which performs representation-level imputation of missing modalities by combining priors with cross-modal connections inside a bi-step learner that has a closed-form solution for the posterior of shared latents.
If this is right
- Any current multimodal method gains the ability to train on datasets that contain samples with missing modalities.
- The anchor shift is provably mitigated and the optimization converges under the supplied theoretical guidance.
- Representation-level imputation replaces the need for modality-specific generative models.
- The same calibration step can be inserted into contrastive, reconstruction, or fusion pipelines without redesign.
Where Pith is reading between the lines
- The same local-to-global anchor correction may apply to any alignment task that suffers from partial observations, such as sensor fusion with dropped channels.
- If the priors used for imputation are themselves learned from complete data, the method could be iterated to bootstrap better priors on progressively more incomplete collections.
- Testing whether the closed-form posterior remains stable when the number of missing modalities exceeds one would reveal the practical range of the bi-step solver.
Load-bearing premise
The deviation between the local anchor formed by observed modalities and the true global anchor can be corrected by imputing missing representations from priors and modality connections.
What would settle it
Running an existing multimodal method with and without CalMRL on the same dataset that contains missing modalities and finding no consistent gain in alignment quality or downstream accuracy.
Figures
read the original abstract
Multimodal representation learning harmonizes distinct modalities by aligning them into a unified latent space. Recent research generalizes traditional cross-modal alignment to produce enhanced multimodal synergy but requires all modalities to be present for a common instance, making it challenging to utilize prevalent datasets with missing modalities. We provide theoretical insights into this issue from an anchor shift perspective. Observed modalities are aligned with a local anchor that deviates from the optimal one when all modalities are present, resulting in an inevitable shift. To address this, we propose CalMRL to calibrate incomplete alignments caused by missing modalities. CalMRL leverages the priors and the inherent connections among modalities to model the imputation for the missing ones at the representation level. To resolve the optimization dilemma, we employ a bi-step learning method with the closed-form solution of the posterior distribution of shared latents. We validate its mitigation of anchor shift and convergence with theoretical guidance. By equipping the calibrated alignment with the existing advanced method, we offer new flexibility to absorb data with missing modalities, which is originally unattainable. Extensive experiments demonstrate the superiority of CalMRL. The code is released at https://github.com/Xiaohao-Liu/CalMRL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that missing modalities induce an anchor shift in multimodal alignment, where observed modalities align to a suboptimal local anchor rather than the optimal full-modality anchor. It proposes CalMRL to correct this via representation-level imputation of missing modalities using priors and cross-modal connections, implemented through bi-step learning that yields a closed-form posterior over shared latents. Theoretical analysis is provided for anchor-shift mitigation and convergence, and experiments demonstrate that equipping existing advanced methods with CalMRL enables handling of incomplete data with superior performance.
Significance. If the closed-form posterior derivation and anchor-shift correction are shown to hold for arbitrary missing-modality patterns, the work would provide a principled mechanism to extend multimodal representation methods to the incomplete datasets common in practice. The public code release aids reproducibility and follow-up work.
major comments (2)
- [§3.2] §3.2 (bi-step learning and closed-form posterior): the marginalization yielding the closed-form posterior over shared latents is presented without explicit incorporation of a missingness mask or per-instance modality indicator; if the joint distribution is defined only over complete observations, the resulting calibration step does not demonstrably correct the local-anchor deviation under the general missing-modality regime asserted in the abstract and §1.
- [§4.3] §4.3, Table 2: the reported gains for CalMRL-augmented baselines lack standard deviations across runs or statistical significance tests, so it is unclear whether the improvements over the uncalibrated versions are reliable or could be explained by variance in the missing-modality simulation.
minor comments (2)
- [§2.1] §2.1: the distinction between the local anchor a_l and the optimal anchor a* is introduced informally; an explicit equation defining both quantities and the shift metric would clarify the subsequent theoretical claims.
- [Figure 3] Figure 3 caption: the visualization of imputed representations does not indicate the missing-modality rate or which modality is absent, reducing interpretability of the qualitative results.
Simulated Author's Rebuttal
We are grateful to the referee for the insightful comments, which have helped us improve the clarity and rigor of our work. We address each major comment below and outline the revisions we plan to make.
read point-by-point responses
-
Referee: [§3.2] §3.2 (bi-step learning and closed-form posterior): the marginalization yielding the closed-form posterior over shared latents is presented without explicit incorporation of a missingness mask or per-instance modality indicator; if the joint distribution is defined only over complete observations, the resulting calibration step does not demonstrably correct the local-anchor deviation under the general missing-modality regime asserted in the abstract and §1.
Authors: We appreciate this careful reading of §3.2. The closed-form posterior is derived by marginalizing over the shared latents in the bi-step optimization, where the first step imputes missing modality representations using modality priors and cross-modal connections. This imputation effectively conditions on the observed modalities for each instance, thereby addressing the anchor shift in the incomplete regime. The joint distribution in the theoretical analysis assumes complete observations to establish the existence of the optimal anchor, but the calibration procedure generalizes to missing patterns by replacing missing representations with their imputed counterparts before alignment. To clarify this, we will revise §3.2 to explicitly include a per-instance modality indicator and missingness mask in the posterior derivation and the optimization steps. revision: yes
-
Referee: [§4.3] §4.3, Table 2: the reported gains for CalMRL-augmented baselines lack standard deviations across runs or statistical significance tests, so it is unclear whether the improvements over the uncalibrated versions are reliable or could be explained by variance in the missing-modality simulation.
Authors: We acknowledge that the absence of standard deviations and significance testing in Table 2 limits the interpretability of the results. In the revised manuscript, we will rerun the experiments with multiple random seeds, report mean performance with standard deviations, and include statistical significance tests (e.g., paired t-tests) comparing CalMRL-augmented methods against their baselines to confirm that the observed gains are statistically reliable. revision: yes
Circularity Check
No circularity: derivation self-contained via model assumptions and closed-form posterior
full rationale
The paper's chain starts from an anchor-shift observation, introduces priors and cross-modal connections as modeling assumptions, then derives a bi-step optimization whose closed-form posterior is obtained by marginalization under the stated generative model. No step reduces a claimed prediction or first-principles result to a fitted parameter or self-citation by construction. The calibration is presented as an application of the derived posterior rather than a renaming or re-use of the target quantity itself. External validation via experiments on missing-modality data supplies independent grounding.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Modalities possess inherent connections and priors that allow accurate imputation of missing representations
- ad hoc to paper The optimization dilemma caused by missing modalities can be resolved via bi-step learning with closed-form posterior
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 1 (Anchor shift under incomplete modality alignment). Let u_1 and u_Ω_1 be the leading left singular vectors... s 2(1−(σ_Ω_1 + η²)/σ_1) ≤ ∥u_1 − u_Ω_1∥ ≤ √2 ∥Z_¯Ω∥_2 / (σ_1 − σ_2)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, Andrew Y Ng, et al. Multimodal deep learning. InICML, volume 11, pages 689–696, 2011
work page 2011
-
[2]
A theory of multimodal learning.NeurIPS, 36:57244–57255, 2023
Zhou Lu. A theory of multimodal learning.NeurIPS, 36:57244–57255, 2023
work page 2023
-
[3]
Peng Xu, Xiatian Zhu, and David A Clifton. Multimodal learning with transformers: A survey.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 45(10):12113–12132, 2023
work page 2023
-
[4]
Qinglong Cao, Yuntian Chen, Lu Lu, Hao Sun, Zhengzhong Zeng, Xiaokang Yang, and Dongxiao Zhang. Gen- eralized domain prompt learning for accessible scientific vision-language models.Nexus, 2(2), 2025
work page 2025
-
[5]
Multibench: Multiscale benchmarks for multimodal representation learning
Paul Pu Liang, Yiwei Lyu, Xiang Fan, Zetian Wu, Yun Cheng, Jason Wu, Leslie Chen, Peter Wu, Michelle A Lee, Yuke Zhu, et al. Multibench: Multiscale benchmarks for multimodal representation learning. InNeurIPS, 2021
work page 2021
-
[6]
Quantifying & modeling multimodal interactions: An information decomposition framework
Paul Pu Liang, Yun Cheng, Xiang Fan, Chun Kai Ling, Suzanne Nie, Richard Chen, Zihao Deng, Nicholas Allen, Randy Auerbach, Faisal Mahmood, et al. Quantifying & modeling multimodal interactions: An information decomposition framework. InNeurIPS, pages 27351–27393, 2023
work page 2023
-
[7]
Lumina-dimoo: An omni diffusion large language model for multi-modal generation and understanding
Yi Xin, Qi Qin, Siqi Luo, Kaiwen Zhu, Juncheng Yan, Yan Tai, Jiayi Lei, Yuewen Cao, Keqi Wang, Yibin Wang, et al. Lumina-dimoo: An omni diffusion large language model for multi-modal generation and understanding. arXiv preprint arXiv:2510.06308, 2025
-
[8]
Vit-lens: Towards omni-modal representations
Weixian Lei, Yixiao Ge, Kun Yi, Jianfeng Zhang, Difei Gao, Dylan Sun, Yuying Ge, Ying Shan, and Mike Zheng Shou. Vit-lens: Towards omni-modal representations. InCVPR, pages 26647–26657, 2024
work page 2024
-
[9]
GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, and Xiaobo Xia. Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Position: The platonic representation hypoth- esis
Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. Position: The platonic representation hypoth- esis. InICML, 2024
work page 2024
-
[11]
Understanding the emergence of multi- modal representation alignment
Megan Tjandrasuwita, Chanakya Ekbote, Liu Ziyin, and Paul Pu Liang. Understanding the emergence of multi- modal representation alignment. InICML, 2025
work page 2025
-
[12]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, pages 8748–8763, 2021. 19 APREPRINT
work page 2021
-
[13]
Imagebind: One embedding space to bind them all
Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. InCVPR, pages 15180–15190, 2023
work page 2023
-
[14]
Vast: A vision- audio-subtitle-text omni-modality foundation model and dataset
Sihan Chen, Handong Li, Qunbo Wang, Zijia Zhao, Mingzhen Sun, Xinxin Zhu, and Jing Liu. Vast: A vision- audio-subtitle-text omni-modality foundation model and dataset. InNeurIPS, pages 72842–72866, 2023
work page 2023
-
[15]
Gramian multimodal represen- tation learning and alignment
Giordano Cicchetti, Eleonora Grassucci, Luigi Sigillo, and Danilo Comminiello. Gramian multimodal represen- tation learning and alignment. InICLR, 2025
work page 2025
-
[16]
Continual multimodal contrastive learning
Xiaohao Liu, Xiaobo Xia, See-Kiong Ng, and Tat-Seng Chua. Continual multimodal contrastive learning. In NeurIPS, 2025
work page 2025
-
[17]
What to align in multimodal contrastive learning?ICLR, 2025
Benoit Dufumier, Javiera Castillo-Navarro, Devis Tuia, and Jean-Philippe Thiran. What to align in multimodal contrastive learning?ICLR, 2025
work page 2025
-
[18]
Principled multimodal representation learning
Xiaohao Liu, Xiaobo Xia, See-Kiong Ng, and Tat-Seng Chua. Principled multimodal representation learning. arXiv preprint arXiv:2507.17343, 2025
-
[19]
Wav2clip: Learning robust audio representations from clip
Ho-Hsiang Wu, Prem Seetharaman, Kundan Kumar, and Juan Pablo Bello. Wav2clip: Learning robust audio representations from clip. InICASSP, pages 4563–4567, 2022
work page 2022
-
[20]
Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. Videoclip: Contrastive pre-training for zero-shot video-text understanding.arXiv preprint arXiv:2109.14084, 2021
-
[21]
Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning.Neurocomputing, 508:293–304, 2022
work page 2022
-
[22]
Audioclip: Extending clip to image, text and audio
Andrey Guzhov, Federico Raue, J ¨orn Hees, and Andreas Dengel. Audioclip: Extending clip to image, text and audio. InICASSP, pages 976–980, 2022
work page 2022
-
[23]
Pointclip: Point cloud understanding by clip
Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xupeng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li. Pointclip: Point cloud understanding by clip. InCVPR, pages 8552–8562, 2022
work page 2022
-
[24]
A TRIANGLE enables multimodal alignment beyond cosine similarity
Giordano Cicchetti, Eleonora Grassucci, and Danilo Comminiello. A TRIANGLE enables multimodal alignment beyond cosine similarity. InNeurIPS, 2025
work page 2025
-
[25]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, pages 248–255, 2009
work page 2009
-
[26]
Internvid: A large-scale video-text dataset for multimodal understanding and generation
Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Ping Luo, Ziwei Liu, Yali Wang, Limin Wang, and Yu Qiao. Internvid: A large-scale video-text dataset for multimodal understanding and generation. InICLR. OpenReview.net, 2024
work page 2024
-
[27]
Openvid-1m: A large-scale high-quality dataset for text-to-video generation
Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhenheng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1m: A large-scale high-quality dataset for text-to-video generation. InICLR. OpenReview.net, 2025
work page 2025
-
[28]
Audiocaps: Generating captions for audios in the wild
Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. Audiocaps: Generating captions for audios in the wild. InACL, pages 119–132, 2019
work page 2019
-
[29]
Clotho: An audio captioning dataset
Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: An audio captioning dataset. InICASSP, pages 736–740. IEEE, 2020
work page 2020
-
[30]
Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. InICASSP, pages 1–5. IEEE, 2023
work page 2023
-
[31]
Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, et al. Languagebind: Extending video-language pretraining to n-modality by language-based se- mantic alignment.arXiv preprint arXiv:2310.01852, 2023. 20 APREPRINT
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xianzheng Ma, Jiaming Han, Kexin Chen, Peng Gao, Xianzhi Li, Hongsheng Li, et al. Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following.arXiv preprint arXiv:2309.00615, 2023
-
[33]
Unibind: Llm-augmented unified and balanced repre- sentation space to bind them all
Yuanhuiyi Lyu, Xu Zheng, Jiazhou Zhou, and Lin Wang. Unibind: Llm-augmented unified and balanced repre- sentation space to bind them all. InCVPR, pages 26752–26762, 2024
work page 2024
-
[34]
Zehan Wang, Ziang Zhang, Hang Zhang, Luping Liu, Rongjie Huang, Xize Cheng, Hengshuang Zhao, and Zhou Zhao. Omnibind: Large-scale omni multimodal representation via binding spaces.arXiv preprint arXiv:2407.11895, 2024
-
[35]
Self-supervised multimodal learning: A survey
Yongshuo Zong, Oisin Mac Aodha, and Timothy Hospedales. Self-supervised multimodal learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024
work page 2024
-
[36]
Muhammad Abdullah Jamal and Omid Mohareri. Multi-modal contrastive masked autoencoders: A two-stage progressive pre-training approach for rgbd datasets. InCVPR, pages 17947–17957, 2025
work page 2025
-
[37]
Dpu: Dynamic prototype updating for multimodal out-of-distribution detection
Shawn Li, Huixian Gong, Hao Dong, Tiankai Yang, Zhengzhong Tu, and Yue Zhao. Dpu: Dynamic prototype updating for multimodal out-of-distribution detection. InCVPR, pages 10193–10202, June 2025
work page 2025
-
[38]
A simple framework for contrastive learning of visual representations
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InICML, pages 1597–1607, 2020
work page 2020
-
[39]
Crossclr: Cross-modal contrastive learning for multi-modal video representations
Mohammadreza Zolfaghari, Yi Zhu, Peter Gehler, and Thomas Brox. Crossclr: Cross-modal contrastive learning for multi-modal video representations. InICCV, pages 1450–1459, 2021
work page 2021
-
[40]
Cross-modal contrastive learning for text-to-image generation
Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, and Yinfei Yang. Cross-modal contrastive learning for text-to-image generation. InCVPR, pages 833–842, 2021
work page 2021
-
[41]
Few-shot adversarial prompt learning on vision-language models
Yiwei Zhou, Xiaobo Xia, Zhiwei Lin, Bo Han, and Tongliang Liu. Few-shot adversarial prompt learning on vision-language models. InNeurIPS, pages 3122–3156, 2024
work page 2024
-
[42]
Clap: learning audio concepts from natural language supervision
Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang. Clap: learning audio concepts from natural language supervision. InICASSP, pages 1–5, 2023
work page 2023
-
[43]
Zehan Wang, Ziang Zhang, Xize Cheng, Rongjie Huang, Luping Liu, Zhenhui Ye, Haifeng Huang, Yang Zhao, Tao Jin, Peng Gao, et al. Freebind: Free lunch in unified multimodal space via knowledge fusion.arXiv preprint arXiv:2405.04883, 2024
-
[44]
Frozen in time: A joint video and image encoder for end-to-end retrieval
Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. InCVPR, pages 1728–1738, 2021
work page 2021
-
[45]
Learning audio-video modalities from image captions
Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, and Cordelia Schmid. Learning audio-video modalities from image captions. InECCV, pages 407–426, 2022
work page 2022
-
[46]
Videoprism: A foundational visual encoder for video under- standing
Long Zhao, Nitesh Bharadwaj Gundavarapu, Liangzhe Yuan, Hao Zhou, Shen Yan, Jennifer J Sun, Luke Fried- man, Rui Qian, Tobias Weyand, Yue Zhao, et al. Videoprism: A foundational visual encoder for video under- standing. InICML, pages 60785–60811. PMLR, 2024
work page 2024
-
[47]
Miradata: A large-scale video dataset with long durations and structured captions
Xuan Ju, Yiming Gao, Zhaoyang Zhang, Ziyang Yuan, Xintao Wang, Ailing Zeng, Yu Xiong, Qiang Xu, and Ying Shan. Miradata: A large-scale video dataset with long durations and structured captions. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors, NeurIPS, 2024
work page 2024
-
[48]
A touch, vision, and language dataset for multimodal alignment
Letian Fu, Gaurav Datta, Huang Huang, William Chung-Ho Panitch, Jaimyn Drake, Joseph Ortiz, Mustafa Mukadam, Mike Lambeta, Roberto Calandra, and Ken Goldberg. A touch, vision, and language dataset for multimodal alignment. InICML. OpenReview.net, 2024
work page 2024
-
[49]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InICML, pages 12888–12900, 2022. 21 APREPRINT
work page 2022
-
[50]
Sihan Chen, Xingjian He, Longteng Guo, Xinxin Zhu, Weining Wang, Jinhui Tang, and Jing Liu. Valor: Vision- audio-language omni-perception pretraining model and dataset.arXiv preprint arXiv:2304.08345, 2023
-
[51]
Omnivec: Learning robust representations with cross modal sharing
Siddharth Srivastava and Gaurav Sharma. Omnivec: Learning robust representations with cross modal sharing. InWACV, pages 1225–1237. IEEE, 2024
work page 2024
-
[52]
VIT-LENS: towards omni-modal representations
Weixian Lei, Yixiao Ge, Kun Yi, Jianfeng Zhang, Difei Gao, Dylan Sun, Yuying Ge, Ying Shan, and Mike Zheng Shou. VIT-LENS: towards omni-modal representations. InCVPR, pages 26637–26647. IEEE, 2024
work page 2024
-
[53]
Onellm: One framework to align all modalities with language
Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, and Xiangyu Yue. Onellm: One framework to align all modalities with language. InCVPR, pages 26574–26585. IEEE, 2024
work page 2024
-
[54]
Deep Multimodal Learning with Missing Modality: A Survey
Renjie Wu, Hu Wang, Hsiang-Ting Chen, and Gustavo Carneiro. Deep multimodal learning with missing modal- ity: A survey.arXiv preprint arXiv:2409.07825, 2024
work page internal anchor Pith review arXiv 2024
-
[55]
Are multimodal transformers robust to missing modality? InCVPR, pages 18177–18186, 2022
Mengmeng Ma, Jian Ren, Long Zhao, Davide Testuggine, and Xi Peng. Are multimodal transformers robust to missing modality? InCVPR, pages 18177–18186, 2022
work page 2022
-
[56]
Deep adversarial learning for multi-modality missing data completion
Lei Cai, Zhengyang Wang, Hongyang Gao, Dinggang Shen, and Shuiwang Ji. Deep adversarial learning for multi-modality missing data completion. InKDD, pages 1158–1166, 2018
work page 2018
-
[57]
Incomplete multimodality-diffused emotion recognition
Yuanzhi Wang, Yong Li, and Zhen Cui. Incomplete multimodality-diffused emotion recognition. InNeurIPS, pages 17117–17128, 2023
work page 2023
-
[58]
Missing modality imagination network for emotion recognition with uncertain missing modalities
Jinming Zhao, Ruichen Li, and Qin Jin. Missing modality imagination network for emotion recognition with uncertain missing modalities. InACL, pages 2608–2618, 2021
work page 2021
-
[59]
Smil: Multimodal learning with severely missing modality
Mengmeng Ma, Jian Ren, Long Zhao, Sergey Tulyakov, Cathy Wu, and Xi Peng. Smil: Multimodal learning with severely missing modality. InAAAI, volume 35, pages 2302–2310, 2021
work page 2021
-
[60]
M3care: Learning with missing modalities in multimodal healthcare data
Chaohe Zhang, Xu Chu, Liantao Ma, Yinghao Zhu, Yasha Wang, Jiangtao Wang, and Junfeng Zhao. M3care: Learning with missing modalities in multimodal healthcare data. InKDD, pages 2418–2428, 2022
work page 2022
-
[61]
Multi-modal learning with missing modality via shared-specific feature modelling
Hu Wang, Yuanhong Chen, Congbo Ma, Jodie Avery, Louise Hull, and Gustavo Carneiro. Multi-modal learning with missing modality via shared-specific feature modelling. InCVPR, pages 15878–15887, 2023
work page 2023
-
[62]
Rethinking missing modality learning from a decoding perspective
Tao Jin, Xize Cheng, Linjun Li, Wang Lin, Ye Wang, and Zhou Zhao. Rethinking missing modality learning from a decoding perspective. InMM, pages 4431–4439, 2023
work page 2023
-
[63]
Zhenchao Tang, Guanxing Chen, Shouzhi Chen, Jianhua Yao, Linlin You, and Calvin Yu-Chian Chen. Modal- nexus auto-encoder for multi-modality cellular data integration and imputation.Nature Communications, 15(1):9021, 2024
work page 2024
-
[64]
Probabilistic conformal distillation for enhancing missing modality robustness
Mengxi Chen, Fei Zhang, Zihua Zhao, Jiangchao Yao, Ya Zhang, and Yanfeng Wang. Probabilistic conformal distillation for enhancing missing modality robustness. InCVPR, pages 36218–36242, 2024
work page 2024
-
[65]
Flex-moe: Modeling arbitrary modality combination via the flexible mixture-of-experts
Sukwon Yun, Inyoung Choi, Jie Peng, Yangfan Wu, Jingxuan Bao, Qiyiwen Zhang, Jiayi Xin, Qi Long, and Tianlong Chen. Flex-moe: Modeling arbitrary modality combination via the flexible mixture-of-experts. In NeurIPS, pages 98782–98805, 2024
work page 2024
-
[66]
Knowledge bridger: Towards training-free missing modality completion
Guanzhou Ke, Shengfeng He, Xiaoli Wang, Bo Wang, Guoqing Chao, Yuanyang Zhang, Yi Xie, and Hexing Su. Knowledge bridger: Towards training-free missing modality completion. InCVPR, pages 25864–25873, 2025
work page 2025
-
[67]
Boosting discriminability for robust multimodal entity linking with visual modality missing
Mingrui Lao, Zheng Li, Yanming Guo, Xueyi Zhang, Siqi Cai, Zhaoyun Ding, and Haizhou Li. Boosting discriminability for robust multimodal entity linking with visual modality missing. InSIGIR, pages 989–999, 2025
work page 2025
-
[68]
Learnable cross-modal knowledge distillation for multi-modal learning with missing modality
Hu Wang, Congbo Ma, Jianpeng Zhang, Yuan Zhang, Jodie Avery, Louise Hull, and Gustavo Carneiro. Learnable cross-modal knowledge distillation for multi-modal learning with missing modality. InMICCAI, pages 216–226, 2023. 22 APREPRINT
work page 2023
-
[69]
Leveraging knowledge of modality experts for incomplete multi- modal learning
Wenxin Xu, Hexin Jiang, and Xuefeng Liang. Leveraging knowledge of modality experts for incomplete multi- modal learning. InMM, pages 438–446, 2024
work page 2024
-
[70]
Sijie Li, Chen Chen, and Jungong Han. Simmlm: A simple framework for multi-modal learning with missing modality.arXiv preprint arXiv:2507.19264, 2025
-
[71]
Md Kaykobad Reza, Ashley Prater-Bennette, and M Salman Asif. Robust multimodal learning with missing modalities via parameter-efficient adaptation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024
work page 2024
-
[72]
Michael E Tipping and Christopher M Bishop. Probabilistic principal component analysis.Journal of the Royal Statistical Society Series B: Statistical Methodology, 61(3):611–622, 1999
work page 1999
-
[73]
Benyamin Ghojogh, Ali Ghodsi, Fakhri Karray, and Mark Crowley. Factor analysis, probabilistic principal component analysis, variational inference, and variational autoencoder: Tutorial and survey.arXiv preprint arXiv:2101.00734, 2021
-
[74]
Sharut Gupta, Shobhita Sundaram, Chenyu Wang, Stefanie Jegelka, and Phillip Isola. Better together: Leveraging unpaired multimodal data for stronger unimodal models.arXiv preprint arXiv:2510.08492, 2025
-
[75]
Collecting highly parallel data for paraphrase evaluation
David Chen and William B Dolan. Collecting highly parallel data for paraphrase evaluation. InACL, pages 190–200, 2011
work page 2011
-
[76]
Localizing moments in video with natural language
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with natural language. InICCV, pages 5803–5812, 2017
work page 2017
-
[77]
Dense-captioning events in videos
Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. InICCV, pages 706–715, 2017
work page 2017
-
[78]
Vatex: A large-scale, high-quality multilingual dataset for video-and-language research
Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. InICCV, pages 4581–4591, 2019
work page 2019
-
[79]
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, et al. Internvideo: General video foundation models via generative and discriminative learning.arXiv preprint arXiv:2212.03191, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[80]
Matrix perturbation theory.Handbook of linear algebra, pages 15–21, 2006
Ren-Cang Li. Matrix perturbation theory.Handbook of linear algebra, pages 15–21, 2006
work page 2006
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.