Recognition: unknown
Latent Denoising Improves Visual Alignment in Large Multimodal Models
Pith reviewed 2026-05-09 23:04 UTC · model grok-4.3
The pith
Training large multimodal models to recover clean visual patch features from corrupted tokens improves internal alignment and robustness.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Large Multimodal Models trained with an autoregressive language modeling objective receive only indirect supervision on visual tokens, resulting in weak internal representations. We propose a latent denoising framework that applies a saliency-aware mixture of masking and Gaussian noising to projected visual tokens and trains the model to recover clean teacher patch features from hidden states at a selected intermediate LLM layer using a decoder. Intra-image similarity is preserved through contrastive patch distillation to avoid collapse. During inference the corruption and auxiliary components are removed, adding no overhead.
What carries the argument
A decoder that recovers clean teacher patch features from an intermediate LLM layer while the training objective also enforces preservation of the teacher's intra-image similarity structure through contrastive distillation.
If this is right
- Consistent gains appear on standard multimodal benchmarks for visual understanding and reasoning.
- Clear improvements occur on compositional robustness benchmarks such as NaturalBench.
- Higher accuracy is maintained and degradation is reduced when non-adversarial common corruptions are applied to benchmark images at both moderate and severe levels.
- Inference remains unchanged because corruption and auxiliary heads are disabled at test time.
Where Pith is reading between the lines
- The same denoising principle could be tested on other multimodal architectures that rely on projected visual tokens.
- Combining the approach with existing alignment losses might produce additive robustness gains.
- The framework suggests that intermediate-layer reconstruction targets can serve as a general tool for strengthening cross-modal representations under distribution shift.
- Similar corruption and recovery objectives might be explored for language-only or other modality-specific alignment tasks.
Load-bearing premise
Recovering clean teacher patch features from an intermediate LLM layer via denoising while preserving intra-image similarity produces better-aligned internal visual representations without introducing new failure modes or representation collapse.
What would settle it
A controlled experiment in which the denoising training produces no accuracy gain or causes worse degradation on NaturalBench and ImageNet-C-style corrupted versions of standard multimodal benchmarks would falsify the central claim.
Figures
read the original abstract
Large Multimodal Models (LMMs) such as LLaVA are typically trained with an autoregressive language modeling objective, providing only indirect supervision to visual tokens. This often yields weak internal visual representations and brittle behavior under distribution shift. Inspired by recent progress on latent denoising for learning high-quality visual tokenizers, we show that the same principle provides an effective form of visual supervision for improving internal visual feature alignment and multimodal understanding in LMMs. We propose a latent denoising framework that corrupts projected visual tokens using a saliency-aware mixture of masking and Gaussian noising. The LMM is trained to denoise these corrupted tokens by recovering clean teacher patch features from hidden states at a selected intermediate LLM layer using a decoder. To prevent representation collapse, our framework also preserves the teacher's intra-image similarity structure and applies intra-image contrastive patch distillation. During inference, corruption and auxiliary heads are disabled, introducing no additional inference-time overhead. Across a broad suite of standard multimodal benchmarks, our method consistently improves visual understanding and reasoning over strong baselines, and yields clear gains on compositional robustness benchmarks (e.g., NaturalBench). Moreover, under ImageNet-C-style non-adversarial common corruptions applied to benchmark images, our method maintains higher accuracy and exhibits reduced degradation at both moderate and severe corruption levels. Our code is available at https://github.com/dhruvashp/latent-denoising-for-lmms.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a latent denoising framework for Large Multimodal Models (LMMs) to enhance visual token alignment. Projected visual tokens are corrupted via saliency-aware masking and Gaussian noise. The model is trained to recover clean teacher patch features from an intermediate LLM layer's hidden states using a decoder, supplemented by intra-image contrastive patch distillation to avoid collapse. The auxiliary losses are combined with the standard autoregressive objective, and all additions are disabled at inference. Empirical results show consistent improvements on multimodal benchmarks, gains on NaturalBench for compositional robustness, and better performance under ImageNet-C corruptions.
Significance. If the empirical results hold, this work is significant as it provides a practical, inference-free method to directly supervise visual representations in LMMs, which are otherwise only indirectly trained via language modeling. The approach leverages ideas from latent denoising in visual tokenizers and demonstrates benefits in both standard and robustness settings. The public release of code is a notable strength for reproducibility and further research.
major comments (2)
- [§4 (Experiments)] §4 (Experiments): The reported consistent gains on standard benchmarks and NaturalBench are promising, but the section lacks details on the number of runs, statistical tests for significance, or variance in results, which is necessary to confirm the robustness of the improvements over strong baselines.
- [§3.1 (Method)] §3.1 (Method): The choice of the intermediate LLM layer for extracting hidden states to recover teacher features is not justified with ablations; different layers may yield varying alignment quality, potentially affecting the central claim of improved visual representations.
minor comments (2)
- [Abstract] Abstract: The abstract mentions 'strong baselines' but does not specify which ones; this should be clarified for readers.
- [§5 (Conclusion)] §5 (Conclusion): Ensure all hyperparameters and training details are listed in the appendix for full reproducibility.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our work and the recommendation for minor revision. We appreciate the constructive comments on experimental robustness and methodological justification. We address each major comment below and commit to appropriate revisions.
read point-by-point responses
-
Referee: [§4 (Experiments)] §4 (Experiments): The reported consistent gains on standard benchmarks and NaturalBench are promising, but the section lacks details on the number of runs, statistical tests for significance, or variance in results, which is necessary to confirm the robustness of the improvements over strong baselines.
Authors: We agree that additional details on experimental variability would strengthen the presentation. The main results in the submitted manuscript were obtained from single training runs per configuration due to the high computational cost of LMM training. However, we performed limited multi-seed checks during development and observed stable gains. In the revised manuscript, we will add a dedicated paragraph in §4 reporting results from three independent runs (different random seeds) for the primary benchmarks, including means and standard deviations. We will also include paired t-test p-values comparing our method to the strongest baselines to quantify significance. revision: yes
-
Referee: [§3.1 (Method)] §3.1 (Method): The choice of the intermediate LLM layer for extracting hidden states to recover teacher features is not justified with ablations; different layers may yield varying alignment quality, potentially affecting the central claim of improved visual representations.
Authors: The intermediate layer (layer 16 in the 32-layer LLM) was selected because mid-depth layers typically encode a useful combination of visual semantics and emerging linguistic structure, consistent with layer-wise probing studies in the LLM literature. We did not include a full ablation study in the original submission. For the revision, we will add an ablation in the appendix comparing layers 8, 16, 24, and 32 on a subset of benchmarks, showing that layer 16 provides the best alignment quality and downstream performance without inducing collapse. revision: yes
Circularity Check
No significant circularity
full rationale
The paper describes an empirical auxiliary training procedure (saliency-aware corruption of projected visual tokens followed by decoder-based recovery of teacher patch features plus intra-image contrastive distillation) whose claimed benefits are measured directly on external multimodal benchmarks and corruption suites. No equations, uniqueness theorems, or self-citations are invoked to derive the performance gains; the method is a concrete recipe whose outputs are not forced by construction from its own fitted parameters or prior author results. The central claim therefore remains independently testable and does not reduce to a renaming or self-referential fit.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al
-
[2]
Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems35 (2022), 23716–23736
2022
-
[3]
Anthropic. 2026. Claude Opus 4.6. https://www.anthropic.com/news/claude- opus-4-6
2026
-
[4]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision. 2425–2433
2015
-
[5]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report.arXiv preprint arXiv:2309.16609(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al . 2025. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. 2025. Qwen2.5-VL Technical Rep...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Jeffrey P Bigham, Chandrika Jayant, Hanjie Ji, et al. 2010. Vizwiz: nearly real-time answers to visual questions. InProceedings of the 23nd annual ACM symposium on User interface software and technology. 333–342
2010
-
[9]
Lin Chen, Jinsong Li, Xiaoyi Dong, et al. 2024. Are we on the right way for eval- uating large vision-language models?Advances in Neural Information Processing Systems37 (2024), 27056–27087
2024
-
[10]
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. 2024. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InEuropean Conference on Computer Vision. Springer, 19–35
2024
-
[11]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. InInterna- tional conference on machine learning. PmLR, 1597–1607
2020
-
[12]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al . 2024. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271(2024)
work page internal anchor Pith review arXiv 2024
-
[13]
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. 2024. How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open- Source Suites.arXiv preprint arXiv:2404.16821(2024)
work page internal anchor Pith review arXiv 2024
-
[14]
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al . 2024. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 24185–24198
2024
-
[15]
Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, et al. 2026. Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding.arXiv preprint arXiv:2601.10611(2026)
- [16]
-
[17]
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. 2023. Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in neural information processing systems36 (2023), 49250–49267
2023
-
[18]
DeepSeek-AI. 2024. DeepSeek-V2. https://github.com/deepseek-ai/DeepSeek-V2
2024
-
[19]
Ailin Deng, Tri Cao, Zhirui Chen, and Bryan Hooi. 2025. Words or vision: Do vision-language models have blind faith in text?. InProceedings of the Computer Vision and Pattern Recognition Conference. 3867–3876
2025
-
[20]
Anxhelo Diko, Danilo Avola, Marco Cascio, and Luigi Cinque. 2024. ReViT: En- hancing vision transformers feature diversity with attention residual connections. Pattern Recognition156 (2024), 110853
2024
-
[21]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929(2020)
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[22]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, et al. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. InInt. Conf. Learn. Represent
2021
-
[23]
Patrick Esser, Robin Rombach, and Bjorn Ommer. 2021. Taming transformers for high-resolution image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12873–12883
2021
-
[24]
Chaoyou Fu, Peixian Chen, Yunhang Shen, et al. 2023. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394(2023)
work page internal anchor Pith review arXiv 2023
- [25]
- [26]
-
[27]
Google. 2026. Gemini 3.1 Pro. https://ai.google.dev/gemini-api/docs/models/ gemini-3.1-pro-preview
2026
-
[28]
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh
-
[29]
InProceedings of the IEEE conference on computer vision and pattern recognition
Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition. 6904–6913
-
[30]
Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. 2024. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognitio...
2024
-
[31]
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick
-
[32]
InProceedings of the IEEE/CVF conference on computer vision and pattern recognition
Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16000–16009
-
[33]
Dan Hendrycks and Thomas Dietterich. 2019. Benchmarking neural net- work robustness to common corruptions and perturbations.arXiv preprint arXiv:1903.12261(2019)
work page internal anchor Pith review arXiv 2019
-
[34]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models.Advances in neural information processing systems33 (2020), 6840–6851
2020
-
[35]
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. 2022. Lora: Low-rank adaptation of large language models.Iclr1, 2 (2022), 3
2022
-
[36]
Qidong Huang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Jiaqi Wang, Weiming Zhang, and Nenghai Yu. 2025. Deciphering Cross-Modal Alignment in Large Vision-Language Models via Modality Integration Rate. InProceedings of the IEEE/CVF International Conference on Computer Vision. 218–227
2025
-
[37]
Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real- world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6700–6709
2019
-
[38]
Farhan Ishmam, Ishmam Tashdeed, Talukder Asir Saadat, Hamjajul Ashmafee, Abu Raihan Mostofa Kamal, and Azam Hossain. 2025. Visual robustness bench- mark for visual question answering (vqa). InProceedings of the Winter Conference on Applications of Computer Vision. 6623–6633
2025
- [39]
- [40]
-
[41]
Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. 2025. Your large vision-language model only needs a few attention heads for visual grounding. In Conference’17, July 2017, Washington, DC, USA Dhruv Parikh, Jacob Fein-Ashley, Rajgopal Kannan, and Viktor Prasanna Proceedings of the Computer Vision and Pattern Recognition Conference. 9339–9350
2025
-
[42]
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Ha- jishirzi, and Ali Farhadi. 2016. A diagram is worth a dozen images. InEuropean conference on computer vision. Springer, 235–251
2016
-
[43]
Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. 2019. Similarity of neural network representations revisited. InInt. Conf. Mach. Learn
2019
-
[44]
Baiqi Li, Zhiqiu Lin, Wenxuan Peng, Jean de Dieu Nyandwi, Daniel Jiang, Zixian Ma, Simran Khanuja, Ranjay Krishna, Graham Neubig, and Deva Ramanan. 2024. Naturalbench: Evaluating vision-language models on natural adversarial samples. Advances in Neural Information Processing Systems37 (2024), 17044–17068
2024
-
[45]
Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. 2023. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125(2023)
work page internal anchor Pith review arXiv 2023
-
[46]
Chunyi Li, Jianbo Zhang, Zicheng Zhang, Haoning Wu, Yuan Tian, Wei Sun, Guo Lu, Xiongkuo Min, Xiaohong Liu, Weisi Lin, et al. 2025. R-bench: Are your large multimodal model robust to real-world corruptions?IEEE Journal of Selected Topics in Signal Processing(2025)
2025
- [47]
-
[48]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning. PMLR, 19730–19742
2023
-
[49]
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. 2023. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 conference on empirical methods in natural language processing. 292–305
2023
-
[50]
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2023. Improved Baselines with Visual Instruction Tuning
2023
-
[51]
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. 2024. Llavanext: Improved reasoning, ocr, and world knowledge
2024
-
[52]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual in- struction tuning.Advances in neural information processing systems36 (2023), 34892–34916
2023
-
[53]
Yuan Liu, Haodong Duan, Yuanhan Zhang, et al. 2024. Mmbench: Is your multi- modal model an all-around player?. InEuropean conference on computer vision. Springer, 216–233
2024
-
[54]
Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. 2024. Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences67, 12 (2024), 220102
2024
-
[55]
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. 2023. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255(2023)
work page internal anchor Pith review arXiv 2023
-
[56]
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. 2022. Learn to explain: Multi- modal reasoning via thought chains for science question answering.Advances in neural information processing systems35 (2022), 2507–2521
2022
-
[57]
Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque
-
[58]
InFindings of the association for computational linguistics: ACL 2022
Chartqa: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the association for computational linguistics: ACL 2022. 2263–2279
2022
-
[59]
Ahmed Masry, Juan A Rodriguez, Tianyu Zhang, Suyuchen Wang, Chao Wang, Aarash Feizi, Akshay Kalkunte Suresh, Abhay Puri, Xiangru Jian, Pierre-André Noël, et al . 2025. Alignvlm: Bridging vision and language latent spaces for multimodal understanding. InSecond Workshop on Representational Alignment at ICLR 2025
2025
-
[60]
Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. 2022. Infographicvqa. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1697–1706
2022
-
[61]
Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. 2021. Docvqa: A dataset for vqa on document images. InProceedings of the IEEE/CVF winter conference on applications of computer vision. 2200–2209
2021
-
[62]
Meta. 2025. Llama 4. https://www.llama.com/models/llama-4/
2025
-
[63]
Nitesh Methani, Pritha Ganguly, Mitesh M Khapra, and Pratyush Kumar. 2020. Plotqa: Reasoning over scientific plots. InProceedings of the ieee/cvf winter con- ference on applications of computer vision. 1527–1536
2020
-
[64]
Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty
-
[65]
In2019 international conference on document analysis and recognition (ICDAR)
Ocr-vqa: Visual question answering by reading text in images. In2019 international conference on document analysis and recognition (ICDAR). IEEE, 947–952
-
[66]
OpenAI. 2024. GPT-4o. https://openai.com/index/hello-gpt-4o/
2024
-
[67]
OpenAI. 2026. GPT-5.4. https://openai.com/index/introducing-gpt-5-4/
2026
-
[68]
Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. 2019. Relational knowledge distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3967–3976
2019
-
[69]
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. 2023. Kosmos-2: Grounding multimodal large language models to the world.arXiv preprint arXiv:2306.14824(2023)
work page internal anchor Pith review arXiv 2023
-
[70]
Du, Zehuan Yuan, and Xinglong Wu
Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K. Du, Zehuan Yuan, and Xinglong Wu. 2025. TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2545– 2555
2025
-
[71]
Qwen Team. 2025. Qwen3-Next. https://qwen.ai/blog?id= 4074cca80393150c248e508aa62983f9cb7d27cd&from=research.latest- advancements-list
2025
-
[72]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763
2021
-
[73]
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695
2022
-
[74]
Olivier Roy and Martin Vetterli. 2007. The effective rank: A measure of effective dimensionality.European Signal Processing Conference(2007)
2007
-
[75]
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards VQA Models That Can Read. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
2019
-
[76]
Zhen Sun, Yunhang Shen, Jie Li, Xing Sun, Pingyang Dai, Liujuan Cao, and Rongrong Ji. 2025. DS-VLM: Diffusion Supervision Vision Language Model. In Forty-second International Conference on Machine Learning
2025
-
[77]
Jianting Tang, Yubo Wang, Haoyu Cao, and Linli Xu. 2025. BASIC: Boosting Visual Alignment with Intrinsic Refined Embeddings in Multimodal Large Language Models. InProceedings of the IEEE/CVF International Conference on Computer Vision. 20582–20592
2025
-
[78]
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[79]
Baoshun Tong, Hanjiang Lai, Yan Pan, and Jian Yin. 2025. On the zero-shot adversarial robustness of vision-language models: A truly zero-shot and training- free approach. InProceedings of the Computer Vision and Pattern Recognition Conference. 19921–19930
2025
- [80]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.