Recognition: unknown
Let ViT Speak: Generative Language-Image Pre-training
Pith reviewed 2026-05-09 18:53 UTC · model grok-4.3
The pith
A ViT can learn to generate language tokens from visual tokens using only a language modeling objective, aligning it with autoregressive LLMs without contrastive batches or a separate text decoder.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GenLIP trains a single ViT to jointly process visual and textual tokens by directly predicting language tokens from visual tokens with a language modeling objective. This minimalist design replaces contrastive loss and extra decoders while still delivering competitive or better results on diverse MLLM benchmarks when trained on 8B samples, with additional gains on detail-sensitive tasks after multi-resolution continued pretraining.
What carries the argument
The generative language modeling objective that trains the ViT to output language tokens conditioned only on visual tokens inside one shared transformer.
If this is right
- A single transformer suffices to model both modalities together.
- The approach scales with both data volume and model size.
- It matches or exceeds strong baselines on multimodal tasks despite using substantially less pretraining data than competitors.
- Continued pretraining at native aspect ratios and multiple resolutions improves results on OCR and chart-understanding tasks.
- The encoder can be used directly inside autoregressive MLLMs without extra alignment stages.
Where Pith is reading between the lines
- Native-aspect-ratio training may preserve spatial relationships that standard square resizing discards, explaining the OCR gains.
- Removing contrastive batch construction could simplify large-scale data pipelines and reduce memory requirements during pretraining.
- The same generative objective might extend to video or other temporally ordered visual inputs with minimal architectural change.
- Because the vision encoder is already trained to emit language-like tokens, downstream instruction tuning of the full MLLM could converge faster.
Load-bearing premise
That a pure language modeling loss on visual-to-text token prediction will produce vision features aligned well enough for autoregressive LLMs without any contrastive signal or separate text tower.
What would settle it
Plugging the GenLIP-trained ViT and a contrastively trained ViT into identical MLLM architectures and finding that the contrastive version scores substantially higher on the same suite of multimodal benchmarks would falsify the performance claim.
read the original abstract
In this paper, we present \textbf{Gen}erative \textbf{L}anguage-\textbf{I}mage \textbf{P}re-training (GenLIP), a minimalist generative pretraining framework for Vision Transformers (ViTs) designed for multimodal large language models (MLLMs). To better align vision encoders with the autoregressive nature of LLMs, GenLIP trains a ViT to predict language tokens directly from visual tokens using a standard language modeling objective, without contrastive batch construction or an additional text decoder. This design offers three key advantages: (1) \textbf{Simplicity}: a single transformer jointly models visual and textual tokens; (2) \textbf{Scalability}: it scales effectively with both data and model size; and (3) \textbf{Performance}: it achieves competitive or superior results across diverse multimodal benchmarks. Trained on 8B samples from Recap-DataComp-1B, GenLIP matches or surpasses strong baselines despite using substantially less pretraining data. After continued pretraining on multi-resolution images at native aspect ratios, GenLIP further improves on detail-sensitive tasks such as OCR and chart understanding, making it a strong foundation for vision encoders in MLLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GenLIP, a minimalist generative pre-training framework for Vision Transformers (ViTs) aimed at multimodal large language models (MLLMs). It trains a ViT to predict language tokens directly from visual tokens using only a standard language modeling objective, without contrastive batch construction or an additional text decoder. The approach is claimed to offer simplicity (single transformer for visual and textual tokens), scalability with data and model size, and competitive or superior performance on diverse multimodal benchmarks. Specifically, when trained on 8B samples from Recap-DataComp-1B it matches or surpasses strong baselines despite using less data; continued pretraining on multi-resolution images at native aspect ratios further improves results on detail-sensitive tasks such as OCR and chart understanding.
Significance. If the results hold under rigorous controls, the work would be significant for demonstrating that a pure generative language-modeling objective can produce vision encoders that align effectively with autoregressive LLMs, thereby simplifying pretraining pipelines that currently rely on contrastive losses. The reported ability to achieve strong performance with substantially less data and to improve on fine-grained tasks via multi-resolution continued pretraining would be a practical contribution to MLLM vision-encoder design.
major comments (2)
- [Abstract and Experiments section] Abstract and Experiments section: the central claim that the generative LM objective (rather than data quality) produces effective alignment is load-bearing, yet the manuscript provides no controlled ablation that holds the Recap-DataComp-1B dataset fixed while swapping the objective for a contrastive baseline. Without this comparison, gains over prior methods could be driven by the high-quality recaptions rather than the minimalist design.
- [Experiments section] Experiments section: the abstract states competitive results on diverse benchmarks but supplies no details on experimental setup, exact baselines and their data volumes, error bars, or data exclusion criteria. This absence prevents verification of the claim that GenLIP matches or surpasses strong baselines with substantially less pretraining data.
minor comments (2)
- [Abstract] Abstract: the phrase '8B samples from Recap-DataComp-1B' would benefit from explicit clarification whether this constitutes the full dataset or a curated subset.
- [Introduction] The manuscript would be strengthened by citing prior generative vision-language pretraining works that also avoid contrastive losses, to better situate the novelty of the single-transformer design.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We agree that additional controls and details are needed to strengthen the claims and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract and Experiments section] Abstract and Experiments section: the central claim that the generative LM objective (rather than data quality) produces effective alignment is load-bearing, yet the manuscript provides no controlled ablation that holds the Recap-DataComp-1B dataset fixed while swapping the objective for a contrastive baseline. Without this comparison, gains over prior methods could be driven by the high-quality recaptions rather than the minimalist design.
Authors: We acknowledge that a controlled ablation holding the Recap-DataComp-1B dataset fixed while comparing the generative LM objective to a contrastive baseline would more rigorously isolate the contribution of the objective. In the revised manuscript, we will add this experiment by training a contrastive baseline on the identical dataset and reporting comparative results on the multimodal benchmarks. This will help substantiate that the performance gains are attributable to the generative approach rather than solely to data quality. revision: yes
-
Referee: [Experiments section] Experiments section: the abstract states competitive results on diverse benchmarks but supplies no details on experimental setup, exact baselines and their data volumes, error bars, or data exclusion criteria. This absence prevents verification of the claim that GenLIP matches or surpasses strong baselines with substantially less pretraining data.
Authors: We agree that the current manuscript lacks sufficient experimental details for full verification. The revised version will expand the Experiments section to include a complete description of the setup, exact baselines with their pretraining data volumes, error bars from repeated runs where feasible, and any data exclusion criteria. We will also ensure the abstract references these details for clarity. revision: yes
Circularity Check
No circularity detected; empirical generative pretraining framework with no load-bearing derivations or self-referential reductions
full rationale
The paper describes GenLIP as a direct application of standard language modeling to train a ViT to predict language tokens from visual tokens, without contrastive losses or extra decoders. No equations, uniqueness theorems, or parameter-fitting steps are presented that reduce claimed performance or alignment properties to inputs by construction. Results are reported from training on Recap-DataComp-1B and continued pretraining, with comparisons to baselines. The approach relies on experimental validation rather than tautological definitions or self-citation chains that would force the outcome. This is a standard empirical ML paper with self-contained claims.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A single transformer can jointly model visual and textual tokens under a language modeling objective
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Nocaps: Novel object captioning at scale
Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. Nocaps: Novel object captioning at scale. InProceedings of the IEEE/CVF international conference on computer vision, pages 8948–8957, 2019
2019
-
[3]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advancesin neural information processing systems, 35:23716–23736, 2022
2022
-
[4]
Multi-label cluster discrimination for visual representation learning
Xiang An, Kaicheng Yang, Xiangzi Dai, Ziyong Feng, and Jiankang Deng. Multi-label cluster discrimination for visual representation learning. InECCV, 2024
2024
-
[5]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023
work page internal anchor Pith review arXiv 2023
-
[7]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Vl-beit: Generative vision-language pretraining.arXiv preprint arXiv:2206.01127, 2022
Hangbo Bao, Wenhui Wang, Li Dong, and Furu Wei. Vl-beit: Generative vision-language pretraining.arXiv preprint arXiv:2206.01127, 2022
-
[9]
PaliGemma: A versatile 3B VLM for transfer
Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024
work page internal anchor Pith review arXiv 2024
-
[10]
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568, 2025
work page Pith review arXiv 2025
-
[11]
A single transformer for scalable vision-language modeling
Yangyi Chen, Xingyao Wang, Hao Peng, and Heng Ji. A single transformer for scalable vision-language modeling. Transactions on Machine Learning Research, 2024
2024
-
[12]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024
2024
-
[14]
Reproducible scaling laws for contrastive language-image learning
Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2818–2829, 2023
2023
-
[15]
Meta clip 2: A worldwide scaling recipe
Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James R Glass, LIFEI HUANG, Jason E Weston, Luke Zettlemoyer, et al. Meta clip 2: A worldwide scaling recipe. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
2025
-
[16]
Vision Transformers Need Registers
Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588, 2023
work page internal anchor Pith review arXiv 2023
-
[17]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009
2009
-
[18]
Unveiling encoder-free vision-language models
Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, and Xinlong Wang. Unveiling encoder-free vision-language models. Advancesin Neural Information Processing Systems, 37:52545–52567, 2024. 16
2024
-
[19]
From pixels to words–towards native vision-language primitives at scale
Haiwen Diao, Mingxuan Li, Silei Wu, Linjun Dai, Xiaohua Wang, Hanming Deng, Lewei Lu, Dahua Lin, and Ziwei Liu. From pixels to words–towards native vision-language primitives at scale.arXiv preprint arXiv:2510.14979, 2025
-
[20]
Evev2: Improved baselines for encoder-free vision-language models
Haiwen Diao, Xiaotong Li, Yufeng Cui, Yueze Wang, Haoge Deng, Ting Pan, Wenxuan Wang, Huchuan Lu, and Xinlong Wang. Evev2: Improved baselines for encoder-free vision-language models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21014–21025, 2025
2025
-
[21]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In9th International Conference on Learning Representations, ICLR 2021. O...
2021
-
[22]
Improving clip training with language rewrites
Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, and Yonglong Tian. Improving clip training with language rewrites. Advancesin Neural Information Processing Systems, 36:35544–35575, 2023
2023
-
[23]
Multimodal autoregressive pre-training of large vision encoders
Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor G Turrisi da Costa, Louis Béthune, Zhe Gan, et al. Multimodal autoregressive pre-training of large vision encoders. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 9641–9654, 2025
2025
-
[24]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023
work page internal anchor Pith review arXiv 2023
-
[25]
Datacomp: In search of the next generation of multimodal datasets
Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. Advancesin Neural Information Processing Systems, 36:27092–27112, 2023
2023
-
[26]
Making the v in vqa matter: Elevating the role of image understanding in visual question answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017
2017
-
[27]
Infinity-mm: Scaling multimodal performance with large-scale and high-quality instruction data
Shuhao Gu, Jialing Zhang, Siyuan Zhou, Kevin Yu, Zhaohu Xing, Liangdong Wang, Zhou Cao, Jintao Jia, Zhuoyi Zhang, Yixuan Wang, et al. Infinity-mm: Scaling multimodal performance with large-scale and high-quality instruction data. CoRR, 2024
2024
-
[28]
Classification done right for vision-language pre-training
Zilong Huang, Qinghao Ye, Bingyi Kang, Jiashi Feng, and Haoqi Fan. Classification done right for vision-language pre-training. Advancesin Neural Information Processing Systems, 37:96483–96504, 2024
2024
-
[29]
Gqa: A new dataset for real-world visual reasoning and compositional question answering
Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019
2019
-
[30]
Unifying vision-language repre- sentation space with single-tower transformer
Jiho Jang, Chaerin Kong, Donghyeon Jeon, Seonhoon Kim, and Nojun Kwak. Unifying vision-language repre- sentation space with single-tower transformer. InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 980–988, 2023
2023
-
[31]
Scaling up visual and vision-language representation learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021
2021
-
[32]
A diagram is worth a dozen images
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InEuropean conference on computer vision, pages 235–251. Springer, 2016
2016
-
[33]
Veclip: Improving clip training via visual-enriched captions
Zhengfeng Lai, Haotian Zhang, Bowen Zhang, Wentao Wu, Haoping Bai, Aleksei Timofeev, Xianzhi Du, Zhe Gan, Jiulong Shan, Chen-Nee Chuah, et al. Veclip: Improving clip training via visual-enriched captions. InEuropean Conference on Computer Vision, pages 111–127. Springer, 2024
2024
-
[34]
The scalability of simplicity: Empirical analysis of vision-language learning with a single transformer
Weixian Lei, Jiacong Wang, Haochen Wang, Xiangtai Li, Jun Hao Liew, Jiashi Feng, and Zilong Huang. The scalability of simplicity: Empirical analysis of vision-language learning with a single transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 20758–20769, October 2025
2025
-
[35]
Llava-onevision: Easy visual task transfer.CoRR, 2024
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.CoRR, 2024. 17
2024
-
[36]
Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich visual comprehension.CoRR, 2024
Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, and Ying Shan. Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich visual comprehension.CoRR, 2024
2024
-
[37]
Align before fuse: Vision and language representation learning with momentum distillation.Advances in neural information processing systems, 34:9694–9705, 2021
Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation.Advances in neural information processing systems, 34:9694–9705, 2021
2021
-
[38]
Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInternational conference on machine learning, pages 12888–12900. PMLR, 2022
2022
-
[39]
Xiangtai Li, Tao Zhang, Yanwei Li, Haobo Yuan, Shihao Chen, Yikang Zhou, Jiahao Meng, Yueyi Sun, Shilin Xu, Lu Qi, et al. Denseworld-1m: Towards detailed dense grounded caption in the real world.arXiv preprint arXiv:2506.24102, 2025
-
[40]
What if we recaption billions of web images with llama-3? InInternational Conference on Machine Learning
Xianhang Li, Haoqin Tu, Mude Hui, Zeyu Wang, Bingchen Zhao, Junfei Xiao, Sucheng Ren, Jieru Mei, Qing Liu, Huangjie Zheng, et al. What if we recaption billions of web images with llama-3? InInternational Conference on Machine Learning. PMLR, 2024
2024
-
[41]
Openvision: A fully-open, cost-effective family of advanced vision encoders for multimodal learning
Xianhang Li, Yanqing Liu, Haoqin Tu, and Cihang Xie. Openvision: A fully-open, cost-effective family of advanced vision encoders for multimodal learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3977–3987, 2025
2025
-
[42]
Densefusion-1m: Merging vision experts for comprehensive multimodal perception.Advances in Neural Information Processing Systems, 37:18535–18556, 2024
Xiaotong Li, Fan Zhang, Haiwen Diao, Yueze Wang, Xinlong Wang, and Lingyu Duan. Densefusion-1m: Merging vision experts for comprehensive multimodal perception.Advances in Neural Information Processing Systems, 37:18535–18556, 2024
2024
-
[43]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
2023
-
[44]
Llava-next: Improved reasoning, ocr, and world knowledge, january 2024.URL https://llava-vl.github
HaotianLiu, ChunyuanLi, YuhengLi, BoLi, YuanhanZhang, ShengShen, andYongJaeLee. Llava-next: Improved reasoning, ocr, and world knowledge, january 2024.URL https://llava-vl.github. io/blog/2024-01-30-llava-next, 1(8), 2024
2024
-
[45]
Yanqing Liu, Xianhang Li, Zeyu Wang, Bingchen Zhao, and Cihang Xie. Clips: An enhanced clip framework for learning with synthetic captions.arXiv preprint arXiv:2411.16828, 2024
-
[46]
Yanqing Liu, Xianhang Li, Letian Zhang, Zirui Wang, Zeyu Zheng, Yuyin Zhou, and Cihang Xie. Openvision 2: A family of generative pretrained visual encoders for multimodal learning.arXiv preprint arXiv:2509.01644, 2025
-
[47]
Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024
Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024
2024
-
[48]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022
2022
-
[49]
Generation and comprehension of unambiguous object descriptions
Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016
2016
-
[50]
Chartqa: A benchmark for question answering about charts with visual and logical reasoning
Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the association for computational linguistics: ACL 2022, pages 2263–2279, 2022
2022
-
[51]
Docvqa: A dataset for vqa on document images
Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021
2021
-
[52]
Infographicvqa
Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. Infographicvqa. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1697–1706, 2022
2022
-
[53]
Dinov2: Learning robust visual features without supervision
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. Transactionson Machine Learning Research Journal, 2024. 18
2024
-
[54]
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.arXiv preprint arXiv:2505.06708, 2025
work page internal anchor Pith review arXiv 2025
-
[55]
Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021
2021
-
[57]
Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020
2020
-
[58]
Textcaps: a dataset for image captioning with reading comprehension
Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image captioning with reading comprehension. InEuropean conference on computer vision, pages 742–758. Springer, 2020
2020
-
[59]
Towards vqa models that can read
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019
2019
-
[60]
Generative multimodal models are in-context learners
Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14398–14409, 2024
2024
-
[61]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024
work page internal anchor Pith review arXiv 2024
-
[62]
Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025
work page internal anchor Pith review arXiv 2025
-
[63]
Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026
work page internal anchor Pith review arXiv 2026
-
[64]
Qwen2.5: A party of foundation models, September 2024
Qwen Team. Qwen2.5: A party of foundation models, September 2024. URLhttps://qwenlm.github.io/blog/ qwen2.5/
2024
-
[65]
Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advancesin Neural Information Processing Systems, 37:87310–87356, 2024
Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri Iyer, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advancesin Neural Information Processing Systems, 37:87310–87356, 2024
2024
-
[66]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[67]
Image captioners are scalable vision learners too.Advances in Neural Information Processing Systems, 36:46830–46855, 2023
Michael Tschannen, Manoj Kumar, Andreas Steiner, Xiaohua Zhai, Neil Houlsby, and Lucas Beyer. Image captioners are scalable vision learners too.Advances in Neural Information Processing Systems, 36:46830–46855, 2023
2023
-
[68]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprintarXiv:2502.14786, 2025
work page internal anchor Pith review arXiv 2025
-
[69]
arXiv preprint arXiv:2205.14100 , year=
Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language.arXiv preprint arXiv:2205.14100, 2022
-
[70]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 19
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[71]
arXiv preprint arXiv:2108.10904 , year=
Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. Simvlm: Simple visual language model pretraining with weak supervision.arXiv preprint arXiv:2108.10904, 2021
-
[72]
Efficient Streaming Language Models with Attention Sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023
work page internal anchor Pith review arXiv 2023
-
[73]
Region-based cluster discrimination for visual representation learning
Yin Xie, Kaicheng Yang, Xiang An, Kun Wu, Yongle Zhao, Weimo Deng, Zimin Ran, Yumeng Wang, Ziyong Feng, Miles Roy, Elezi Ismail, and Jiankang Deng. Region-based cluster discrimination for visual representation learning. In ICCV, 2025
2025
-
[74]
Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying clip data.arXiv preprint arXiv:2309.16671, 2023
-
[75]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[76]
Alip: Adaptive language-image pre-training with synthetic caption
Kaicheng Yang, Jiankang Deng, Xiang An, Jiawei Li, Ziyong Feng, Jia Guo, Jing Yang, and Tongliang Liu. Alip: Adaptive language-image pre-training with synthetic caption. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2922–2931, 2023
2023
-
[77]
Coca: Contrastive captioners are image- text foundation models
Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models.arXiv preprint arXiv:2205.01917, 2022
-
[78]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023
2023
-
[79]
Glipv2: unifying localization and vl understanding
Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Harold Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, and Jianfeng Gao. Glipv2: unifying localization and vl understanding. In36th Conf. Neural Inf. Process. Syst. NeurIPS, 2022
2022
-
[80]
Lmms-eval: Reality check on the evaluation of large multimodal models
Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, et al. Lmms-eval: Reality check on the evaluation of large multimodal models. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 881–916, 2025
2025
-
[81]
Dreamlip: Language-image pre-training with long captions
Kecheng Zheng, Yifei Zhang, Wei Wu, Fan Lu, Shuailei Ma, Xin Jin, Wei Chen, and Yujun Shen. Dreamlip: Language-image pre-training with long captions. InEuropean Conference on Computer Vision, pages 73–90. Springer, 2024
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.