pith. machine review for the scientific record. sign in

arxiv: 2605.00809 · v1 · submitted 2026-05-01 · 💻 cs.CV

Recognition: unknown

Let ViT Speak: Generative Language-Image Pre-training

Authors on Pith no claims yet

Pith reviewed 2026-05-09 18:53 UTC · model grok-4.3

classification 💻 cs.CV
keywords Generative pretrainingVision TransformerMultimodal large language modelsLanguage-image alignmentAutoregressive modelingViT encoderOCR and chart understanding
0
0 comments X

The pith

A ViT can learn to generate language tokens from visual tokens using only a language modeling objective, aligning it with autoregressive LLMs without contrastive batches or a separate text decoder.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

GenLIP pretrains Vision Transformers by feeding visual tokens into the model and training it to predict the corresponding language tokens under a standard next-token language modeling loss. This produces a vision encoder that works directly inside multimodal LLMs without the usual contrastive pretraining stage or an auxiliary text decoder. The resulting model reaches or exceeds strong baselines on multimodal benchmarks after training on 8 billion image-text pairs, and a second stage of continued pretraining on native-resolution images lifts performance further on tasks that require fine visual detail such as OCR and chart reading.

Core claim

GenLIP trains a single ViT to jointly process visual and textual tokens by directly predicting language tokens from visual tokens with a language modeling objective. This minimalist design replaces contrastive loss and extra decoders while still delivering competitive or better results on diverse MLLM benchmarks when trained on 8B samples, with additional gains on detail-sensitive tasks after multi-resolution continued pretraining.

What carries the argument

The generative language modeling objective that trains the ViT to output language tokens conditioned only on visual tokens inside one shared transformer.

If this is right

  • A single transformer suffices to model both modalities together.
  • The approach scales with both data volume and model size.
  • It matches or exceeds strong baselines on multimodal tasks despite using substantially less pretraining data than competitors.
  • Continued pretraining at native aspect ratios and multiple resolutions improves results on OCR and chart-understanding tasks.
  • The encoder can be used directly inside autoregressive MLLMs without extra alignment stages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Native-aspect-ratio training may preserve spatial relationships that standard square resizing discards, explaining the OCR gains.
  • Removing contrastive batch construction could simplify large-scale data pipelines and reduce memory requirements during pretraining.
  • The same generative objective might extend to video or other temporally ordered visual inputs with minimal architectural change.
  • Because the vision encoder is already trained to emit language-like tokens, downstream instruction tuning of the full MLLM could converge faster.

Load-bearing premise

That a pure language modeling loss on visual-to-text token prediction will produce vision features aligned well enough for autoregressive LLMs without any contrastive signal or separate text tower.

What would settle it

Plugging the GenLIP-trained ViT and a contrastively trained ViT into identical MLLM architectures and finding that the contrastive version scores substantially higher on the same suite of multimodal benchmarks would falsify the performance claim.

read the original abstract

In this paper, we present \textbf{Gen}erative \textbf{L}anguage-\textbf{I}mage \textbf{P}re-training (GenLIP), a minimalist generative pretraining framework for Vision Transformers (ViTs) designed for multimodal large language models (MLLMs). To better align vision encoders with the autoregressive nature of LLMs, GenLIP trains a ViT to predict language tokens directly from visual tokens using a standard language modeling objective, without contrastive batch construction or an additional text decoder. This design offers three key advantages: (1) \textbf{Simplicity}: a single transformer jointly models visual and textual tokens; (2) \textbf{Scalability}: it scales effectively with both data and model size; and (3) \textbf{Performance}: it achieves competitive or superior results across diverse multimodal benchmarks. Trained on 8B samples from Recap-DataComp-1B, GenLIP matches or surpasses strong baselines despite using substantially less pretraining data. After continued pretraining on multi-resolution images at native aspect ratios, GenLIP further improves on detail-sensitive tasks such as OCR and chart understanding, making it a strong foundation for vision encoders in MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces GenLIP, a minimalist generative pre-training framework for Vision Transformers (ViTs) aimed at multimodal large language models (MLLMs). It trains a ViT to predict language tokens directly from visual tokens using only a standard language modeling objective, without contrastive batch construction or an additional text decoder. The approach is claimed to offer simplicity (single transformer for visual and textual tokens), scalability with data and model size, and competitive or superior performance on diverse multimodal benchmarks. Specifically, when trained on 8B samples from Recap-DataComp-1B it matches or surpasses strong baselines despite using less data; continued pretraining on multi-resolution images at native aspect ratios further improves results on detail-sensitive tasks such as OCR and chart understanding.

Significance. If the results hold under rigorous controls, the work would be significant for demonstrating that a pure generative language-modeling objective can produce vision encoders that align effectively with autoregressive LLMs, thereby simplifying pretraining pipelines that currently rely on contrastive losses. The reported ability to achieve strong performance with substantially less data and to improve on fine-grained tasks via multi-resolution continued pretraining would be a practical contribution to MLLM vision-encoder design.

major comments (2)
  1. [Abstract and Experiments section] Abstract and Experiments section: the central claim that the generative LM objective (rather than data quality) produces effective alignment is load-bearing, yet the manuscript provides no controlled ablation that holds the Recap-DataComp-1B dataset fixed while swapping the objective for a contrastive baseline. Without this comparison, gains over prior methods could be driven by the high-quality recaptions rather than the minimalist design.
  2. [Experiments section] Experiments section: the abstract states competitive results on diverse benchmarks but supplies no details on experimental setup, exact baselines and their data volumes, error bars, or data exclusion criteria. This absence prevents verification of the claim that GenLIP matches or surpasses strong baselines with substantially less pretraining data.
minor comments (2)
  1. [Abstract] Abstract: the phrase '8B samples from Recap-DataComp-1B' would benefit from explicit clarification whether this constitutes the full dataset or a curated subset.
  2. [Introduction] The manuscript would be strengthened by citing prior generative vision-language pretraining works that also avoid contrastive losses, to better situate the novelty of the single-transformer design.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We agree that additional controls and details are needed to strengthen the claims and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and Experiments section] Abstract and Experiments section: the central claim that the generative LM objective (rather than data quality) produces effective alignment is load-bearing, yet the manuscript provides no controlled ablation that holds the Recap-DataComp-1B dataset fixed while swapping the objective for a contrastive baseline. Without this comparison, gains over prior methods could be driven by the high-quality recaptions rather than the minimalist design.

    Authors: We acknowledge that a controlled ablation holding the Recap-DataComp-1B dataset fixed while comparing the generative LM objective to a contrastive baseline would more rigorously isolate the contribution of the objective. In the revised manuscript, we will add this experiment by training a contrastive baseline on the identical dataset and reporting comparative results on the multimodal benchmarks. This will help substantiate that the performance gains are attributable to the generative approach rather than solely to data quality. revision: yes

  2. Referee: [Experiments section] Experiments section: the abstract states competitive results on diverse benchmarks but supplies no details on experimental setup, exact baselines and their data volumes, error bars, or data exclusion criteria. This absence prevents verification of the claim that GenLIP matches or surpasses strong baselines with substantially less pretraining data.

    Authors: We agree that the current manuscript lacks sufficient experimental details for full verification. The revised version will expand the Experiments section to include a complete description of the setup, exact baselines with their pretraining data volumes, error bars from repeated runs where feasible, and any data exclusion criteria. We will also ensure the abstract references these details for clarity. revision: yes

Circularity Check

0 steps flagged

No circularity detected; empirical generative pretraining framework with no load-bearing derivations or self-referential reductions

full rationale

The paper describes GenLIP as a direct application of standard language modeling to train a ViT to predict language tokens from visual tokens, without contrastive losses or extra decoders. No equations, uniqueness theorems, or parameter-fitting steps are presented that reduce claimed performance or alignment properties to inputs by construction. Results are reported from training on Recap-DataComp-1B and continued pretraining, with comparisons to baselines. The approach relies on experimental validation rather than tautological definitions or self-citation chains that would force the outcome. This is a standard empirical ML paper with self-contained claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard assumptions about transformer architectures and language modeling objectives; no free parameters, new entities, or ad-hoc axioms are explicitly introduced in the abstract.

axioms (1)
  • domain assumption A single transformer can jointly model visual and textual tokens under a language modeling objective
    Invoked in the description of the single-transformer design for GenLIP.

pith-pipeline@v0.9.0 · 5542 in / 1204 out tokens · 24259 ms · 2026-05-09T18:53:36.754036+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

82 extracted references · 27 canonical work pages · 17 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Nocaps: Novel object captioning at scale

    Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. Nocaps: Novel object captioning at scale. InProceedings of the IEEE/CVF international conference on computer vision, pages 8948–8957, 2019

  3. [3]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advancesin neural information processing systems, 35:23716–23736, 2022

  4. [4]

    Multi-label cluster discrimination for visual representation learning

    Xiang An, Kaicheng Yang, Xiangzi Dai, Ziyong Feng, and Jiankang Deng. Multi-label cluster discrimination for visual representation learning. InECCV, 2024

  5. [5]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

  6. [6]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023

  7. [7]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  8. [8]

    Vl-beit: Generative vision-language pretraining.arXiv preprint arXiv:2206.01127, 2022

    Hangbo Bao, Wenhui Wang, Li Dong, and Furu Wei. Vl-beit: Generative vision-language pretraining.arXiv preprint arXiv:2206.01127, 2022

  9. [9]

    PaliGemma: A versatile 3B VLM for transfer

    Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

  10. [10]

    BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568, 2025

  11. [11]

    A single transformer for scalable vision-language modeling

    Yangyi Chen, Xingyao Wang, Hao Peng, and Heng Ji. A single transformer for scalable vision-language modeling. Transactions on Machine Learning Research, 2024

  12. [12]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024

  13. [14]

    Reproducible scaling laws for contrastive language-image learning

    Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2818–2829, 2023

  14. [15]

    Meta clip 2: A worldwide scaling recipe

    Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James R Glass, LIFEI HUANG, Jason E Weston, Luke Zettlemoyer, et al. Meta clip 2: A worldwide scaling recipe. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  15. [16]

    Vision Transformers Need Registers

    Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588, 2023

  16. [17]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  17. [18]

    Unveiling encoder-free vision-language models

    Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, and Xinlong Wang. Unveiling encoder-free vision-language models. Advancesin Neural Information Processing Systems, 37:52545–52567, 2024. 16

  18. [19]

    From pixels to words–towards native vision-language primitives at scale

    Haiwen Diao, Mingxuan Li, Silei Wu, Linjun Dai, Xiaohua Wang, Hanming Deng, Lewei Lu, Dahua Lin, and Ziwei Liu. From pixels to words–towards native vision-language primitives at scale.arXiv preprint arXiv:2510.14979, 2025

  19. [20]

    Evev2: Improved baselines for encoder-free vision-language models

    Haiwen Diao, Xiaotong Li, Yufeng Cui, Yueze Wang, Haoge Deng, Ting Pan, Wenxuan Wang, Huchuan Lu, and Xinlong Wang. Evev2: Improved baselines for encoder-free vision-language models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21014–21025, 2025

  20. [21]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In9th International Conference on Learning Representations, ICLR 2021. O...

  21. [22]

    Improving clip training with language rewrites

    Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, and Yonglong Tian. Improving clip training with language rewrites. Advancesin Neural Information Processing Systems, 36:35544–35575, 2023

  22. [23]

    Multimodal autoregressive pre-training of large vision encoders

    Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor G Turrisi da Costa, Louis Béthune, Zhe Gan, et al. Multimodal autoregressive pre-training of large vision encoders. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 9641–9654, 2025

  23. [24]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023

  24. [25]

    Datacomp: In search of the next generation of multimodal datasets

    Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. Advancesin Neural Information Processing Systems, 36:27092–27112, 2023

  25. [26]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017

  26. [27]

    Infinity-mm: Scaling multimodal performance with large-scale and high-quality instruction data

    Shuhao Gu, Jialing Zhang, Siyuan Zhou, Kevin Yu, Zhaohu Xing, Liangdong Wang, Zhou Cao, Jintao Jia, Zhuoyi Zhang, Yixuan Wang, et al. Infinity-mm: Scaling multimodal performance with large-scale and high-quality instruction data. CoRR, 2024

  27. [28]

    Classification done right for vision-language pre-training

    Zilong Huang, Qinghao Ye, Bingyi Kang, Jiashi Feng, and Haoqi Fan. Classification done right for vision-language pre-training. Advancesin Neural Information Processing Systems, 37:96483–96504, 2024

  28. [29]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019

  29. [30]

    Unifying vision-language repre- sentation space with single-tower transformer

    Jiho Jang, Chaerin Kong, Donghyeon Jeon, Seonhoon Kim, and Nojun Kwak. Unifying vision-language repre- sentation space with single-tower transformer. InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 980–988, 2023

  30. [31]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021

  31. [32]

    A diagram is worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InEuropean conference on computer vision, pages 235–251. Springer, 2016

  32. [33]

    Veclip: Improving clip training via visual-enriched captions

    Zhengfeng Lai, Haotian Zhang, Bowen Zhang, Wentao Wu, Haoping Bai, Aleksei Timofeev, Xianzhi Du, Zhe Gan, Jiulong Shan, Chen-Nee Chuah, et al. Veclip: Improving clip training via visual-enriched captions. InEuropean Conference on Computer Vision, pages 111–127. Springer, 2024

  33. [34]

    The scalability of simplicity: Empirical analysis of vision-language learning with a single transformer

    Weixian Lei, Jiacong Wang, Haochen Wang, Xiangtai Li, Jun Hao Liew, Jiashi Feng, and Zilong Huang. The scalability of simplicity: Empirical analysis of vision-language learning with a single transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 20758–20769, October 2025

  34. [35]

    Llava-onevision: Easy visual task transfer.CoRR, 2024

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.CoRR, 2024. 17

  35. [36]

    Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich visual comprehension.CoRR, 2024

    Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, and Ying Shan. Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich visual comprehension.CoRR, 2024

  36. [37]

    Align before fuse: Vision and language representation learning with momentum distillation.Advances in neural information processing systems, 34:9694–9705, 2021

    Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation.Advances in neural information processing systems, 34:9694–9705, 2021

  37. [38]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInternational conference on machine learning, pages 12888–12900. PMLR, 2022

  38. [39]

    Denseworld-1m: Towards detailed dense grounded caption in the real world.arXiv preprint arXiv:2506.24102, 2025

    Xiangtai Li, Tao Zhang, Yanwei Li, Haobo Yuan, Shihao Chen, Yikang Zhou, Jiahao Meng, Yueyi Sun, Shilin Xu, Lu Qi, et al. Denseworld-1m: Towards detailed dense grounded caption in the real world.arXiv preprint arXiv:2506.24102, 2025

  39. [40]

    What if we recaption billions of web images with llama-3? InInternational Conference on Machine Learning

    Xianhang Li, Haoqin Tu, Mude Hui, Zeyu Wang, Bingchen Zhao, Junfei Xiao, Sucheng Ren, Jieru Mei, Qing Liu, Huangjie Zheng, et al. What if we recaption billions of web images with llama-3? InInternational Conference on Machine Learning. PMLR, 2024

  40. [41]

    Openvision: A fully-open, cost-effective family of advanced vision encoders for multimodal learning

    Xianhang Li, Yanqing Liu, Haoqin Tu, and Cihang Xie. Openvision: A fully-open, cost-effective family of advanced vision encoders for multimodal learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3977–3987, 2025

  41. [42]

    Densefusion-1m: Merging vision experts for comprehensive multimodal perception.Advances in Neural Information Processing Systems, 37:18535–18556, 2024

    Xiaotong Li, Fan Zhang, Haiwen Diao, Yueze Wang, Xinlong Wang, and Lingyu Duan. Densefusion-1m: Merging vision experts for comprehensive multimodal perception.Advances in Neural Information Processing Systems, 37:18535–18556, 2024

  42. [43]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

  43. [44]

    Llava-next: Improved reasoning, ocr, and world knowledge, january 2024.URL https://llava-vl.github

    HaotianLiu, ChunyuanLi, YuhengLi, BoLi, YuanhanZhang, ShengShen, andYongJaeLee. Llava-next: Improved reasoning, ocr, and world knowledge, january 2024.URL https://llava-vl.github. io/blog/2024-01-30-llava-next, 1(8), 2024

  44. [45]

    Clips: An enhanced clip framework for learning with synthetic captions.arXiv preprint arXiv:2411.16828, 2024

    Yanqing Liu, Xianhang Li, Zeyu Wang, Bingchen Zhao, and Cihang Xie. Clips: An enhanced clip framework for learning with synthetic captions.arXiv preprint arXiv:2411.16828, 2024

  45. [46]

    Openvision 2: A family of generative pretrained visual encoders for multimodal learning.arXiv preprint arXiv:2509.01644, 2025

    Yanqing Liu, Xianhang Li, Letian Zhang, Zirui Wang, Zeyu Zheng, Yuyin Zhou, and Cihang Xie. Openvision 2: A family of generative pretrained visual encoders for multimodal learning.arXiv preprint arXiv:2509.01644, 2025

  46. [47]

    Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024

    Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024

  47. [48]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022

  48. [49]

    Generation and comprehension of unambiguous object descriptions

    Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016

  49. [50]

    Chartqa: A benchmark for question answering about charts with visual and logical reasoning

    Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the association for computational linguistics: ACL 2022, pages 2263–2279, 2022

  50. [51]

    Docvqa: A dataset for vqa on document images

    Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021

  51. [52]

    Infographicvqa

    Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. Infographicvqa. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1697–1706, 2022

  52. [53]

    Dinov2: Learning robust visual features without supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. Transactionson Machine Learning Research Journal, 2024. 18

  53. [54]

    Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

    Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.arXiv preprint arXiv:2505.06708, 2025

  54. [55]

    Qwen2.5 Technical Report

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

  55. [56]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021

  56. [57]

    Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

  57. [58]

    Textcaps: a dataset for image captioning with reading comprehension

    Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image captioning with reading comprehension. InEuropean conference on computer vision, pages 742–758. Springer, 2020

  58. [59]

    Towards vqa models that can read

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019

  59. [60]

    Generative multimodal models are in-context learners

    Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14398–14409, 2024

  60. [61]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

  61. [62]

    Kimi-VL Technical Report

    Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

  62. [63]

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

  63. [64]

    Qwen2.5: A party of foundation models, September 2024

    Qwen Team. Qwen2.5: A party of foundation models, September 2024. URLhttps://qwenlm.github.io/blog/ qwen2.5/

  64. [65]

    Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advancesin Neural Information Processing Systems, 37:87310–87356, 2024

    Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri Iyer, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advancesin Neural Information Processing Systems, 37:87310–87356, 2024

  65. [66]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  66. [67]

    Image captioners are scalable vision learners too.Advances in Neural Information Processing Systems, 36:46830–46855, 2023

    Michael Tschannen, Manoj Kumar, Andreas Steiner, Xiaohua Zhai, Neil Houlsby, and Lucas Beyer. Image captioners are scalable vision learners too.Advances in Neural Information Processing Systems, 36:46830–46855, 2023

  67. [68]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprintarXiv:2502.14786, 2025

  68. [69]

    arXiv preprint arXiv:2205.14100 , year=

    Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language.arXiv preprint arXiv:2205.14100, 2022

  69. [70]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 19

  70. [71]

    arXiv preprint arXiv:2108.10904 , year=

    Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. Simvlm: Simple visual language model pretraining with weak supervision.arXiv preprint arXiv:2108.10904, 2021

  71. [72]

    Efficient Streaming Language Models with Attention Sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

  72. [73]

    Region-based cluster discrimination for visual representation learning

    Yin Xie, Kaicheng Yang, Xiang An, Kun Wu, Yongle Zhao, Weimo Deng, Zimin Ran, Yumeng Wang, Ziyong Feng, Miles Roy, Elezi Ismail, and Jiankang Deng. Region-based cluster discrimination for visual representation learning. In ICCV, 2025

  73. [74]

    Demysti- fying clip data

    Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying clip data.arXiv preprint arXiv:2309.16671, 2023

  74. [75]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  75. [76]

    Alip: Adaptive language-image pre-training with synthetic caption

    Kaicheng Yang, Jiankang Deng, Xiang An, Jiawei Li, Ziyong Feng, Jia Guo, Jing Yang, and Tongliang Liu. Alip: Adaptive language-image pre-training with synthetic caption. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2922–2931, 2023

  76. [77]

    Coca: Contrastive captioners are image- text foundation models

    Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models.arXiv preprint arXiv:2205.01917, 2022

  77. [78]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023

  78. [79]

    Glipv2: unifying localization and vl understanding

    Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Harold Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, and Jianfeng Gao. Glipv2: unifying localization and vl understanding. In36th Conf. Neural Inf. Process. Syst. NeurIPS, 2022

  79. [80]

    Lmms-eval: Reality check on the evaluation of large multimodal models

    Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, et al. Lmms-eval: Reality check on the evaluation of large multimodal models. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 881–916, 2025

  80. [81]

    Dreamlip: Language-image pre-training with long captions

    Kecheng Zheng, Yifei Zhang, Wei Wu, Fan Lu, Shuailei Ma, Xin Jin, Wei Chen, and Yujun Shen. Dreamlip: Language-image pre-training with long captions. InEuropean Conference on Computer Vision, pages 73–90. Springer, 2024

Showing first 80 references.