arxiv: 2605.00809 · v1 · submitted 2026-05-01 · 💻 cs.CV

Recognition: unknown

Let ViT Speak: Generative Language-Image Pre-training

Yan Fang , Mengcheng Lan , Zilong Huang , Weixian Lei , Yunqing Zhao , Yujie Zhong , Yingchen Yu , Qi She

show 2 more authors

Yao Zhao Yunchao Wei

Authors on Pith no claims yet

Pith reviewed 2026-05-09 18:53 UTC · model grok-4.3

classification 💻 cs.CV

keywords Generative pretrainingVision TransformerMultimodal large language modelsLanguage-image alignmentAutoregressive modelingViT encoderOCR and chart understanding

0 comments

The pith

A ViT can learn to generate language tokens from visual tokens using only a language modeling objective, aligning it with autoregressive LLMs without contrastive batches or a separate text decoder.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

GenLIP pretrains Vision Transformers by feeding visual tokens into the model and training it to predict the corresponding language tokens under a standard next-token language modeling loss. This produces a vision encoder that works directly inside multimodal LLMs without the usual contrastive pretraining stage or an auxiliary text decoder. The resulting model reaches or exceeds strong baselines on multimodal benchmarks after training on 8 billion image-text pairs, and a second stage of continued pretraining on native-resolution images lifts performance further on tasks that require fine visual detail such as OCR and chart reading.

Core claim

GenLIP trains a single ViT to jointly process visual and textual tokens by directly predicting language tokens from visual tokens with a language modeling objective. This minimalist design replaces contrastive loss and extra decoders while still delivering competitive or better results on diverse MLLM benchmarks when trained on 8B samples, with additional gains on detail-sensitive tasks after multi-resolution continued pretraining.

What carries the argument

The generative language modeling objective that trains the ViT to output language tokens conditioned only on visual tokens inside one shared transformer.

If this is right

A single transformer suffices to model both modalities together.
The approach scales with both data volume and model size.
It matches or exceeds strong baselines on multimodal tasks despite using substantially less pretraining data than competitors.
Continued pretraining at native aspect ratios and multiple resolutions improves results on OCR and chart-understanding tasks.
The encoder can be used directly inside autoregressive MLLMs without extra alignment stages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Native-aspect-ratio training may preserve spatial relationships that standard square resizing discards, explaining the OCR gains.
Removing contrastive batch construction could simplify large-scale data pipelines and reduce memory requirements during pretraining.
The same generative objective might extend to video or other temporally ordered visual inputs with minimal architectural change.
Because the vision encoder is already trained to emit language-like tokens, downstream instruction tuning of the full MLLM could converge faster.

Load-bearing premise

That a pure language modeling loss on visual-to-text token prediction will produce vision features aligned well enough for autoregressive LLMs without any contrastive signal or separate text tower.

What would settle it

Plugging the GenLIP-trained ViT and a contrastively trained ViT into identical MLLM architectures and finding that the contrastive version scores substantially higher on the same suite of multimodal benchmarks would falsify the performance claim.

read the original abstract

In this paper, we present \textbf{Gen}erative \textbf{L}anguage-\textbf{I}mage \textbf{P}re-training (GenLIP), a minimalist generative pretraining framework for Vision Transformers (ViTs) designed for multimodal large language models (MLLMs). To better align vision encoders with the autoregressive nature of LLMs, GenLIP trains a ViT to predict language tokens directly from visual tokens using a standard language modeling objective, without contrastive batch construction or an additional text decoder. This design offers three key advantages: (1) \textbf{Simplicity}: a single transformer jointly models visual and textual tokens; (2) \textbf{Scalability}: it scales effectively with both data and model size; and (3) \textbf{Performance}: it achieves competitive or superior results across diverse multimodal benchmarks. Trained on 8B samples from Recap-DataComp-1B, GenLIP matches or surpasses strong baselines despite using substantially less pretraining data. After continued pretraining on multi-resolution images at native aspect ratios, GenLIP further improves on detail-sensitive tasks such as OCR and chart understanding, making it a strong foundation for vision encoders in MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GenLIP shows a clean generative LM objective on joint ViT tokens can match strong baselines with less data, but the gains may trace more to the recaptioned dataset than the method itself.

read the letter

The core idea here is straightforward: train a ViT to predict language tokens directly from visual tokens using only a standard language modeling loss inside one transformer, skipping contrastive batches and any separate text decoder. The paper reports that this setup, trained on 8B samples from Recap-DataComp-1B, reaches or exceeds several baselines on multimodal benchmarks despite using less pretraining data overall. Continued pretraining at native resolutions then lifts performance on detail-heavy tasks like OCR and chart understanding. That simplicity and the reported scaling behavior are the parts that stand out as useful if they replicate.

Referee Report

2 major / 2 minor

Summary. The paper introduces GenLIP, a minimalist generative pre-training framework for Vision Transformers (ViTs) aimed at multimodal large language models (MLLMs). It trains a ViT to predict language tokens directly from visual tokens using only a standard language modeling objective, without contrastive batch construction or an additional text decoder. The approach is claimed to offer simplicity (single transformer for visual and textual tokens), scalability with data and model size, and competitive or superior performance on diverse multimodal benchmarks. Specifically, when trained on 8B samples from Recap-DataComp-1B it matches or surpasses strong baselines despite using less data; continued pretraining on multi-resolution images at native aspect ratios further improves results on detail-sensitive tasks such as OCR and chart understanding.

Significance. If the results hold under rigorous controls, the work would be significant for demonstrating that a pure generative language-modeling objective can produce vision encoders that align effectively with autoregressive LLMs, thereby simplifying pretraining pipelines that currently rely on contrastive losses. The reported ability to achieve strong performance with substantially less data and to improve on fine-grained tasks via multi-resolution continued pretraining would be a practical contribution to MLLM vision-encoder design.

major comments (2)

[Abstract and Experiments section] Abstract and Experiments section: the central claim that the generative LM objective (rather than data quality) produces effective alignment is load-bearing, yet the manuscript provides no controlled ablation that holds the Recap-DataComp-1B dataset fixed while swapping the objective for a contrastive baseline. Without this comparison, gains over prior methods could be driven by the high-quality recaptions rather than the minimalist design.
[Experiments section] Experiments section: the abstract states competitive results on diverse benchmarks but supplies no details on experimental setup, exact baselines and their data volumes, error bars, or data exclusion criteria. This absence prevents verification of the claim that GenLIP matches or surpasses strong baselines with substantially less pretraining data.

minor comments (2)

[Abstract] Abstract: the phrase '8B samples from Recap-DataComp-1B' would benefit from explicit clarification whether this constitutes the full dataset or a curated subset.
[Introduction] The manuscript would be strengthened by citing prior generative vision-language pretraining works that also avoid contrastive losses, to better situate the novelty of the single-transformer design.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We agree that additional controls and details are needed to strengthen the claims and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract and Experiments section] Abstract and Experiments section: the central claim that the generative LM objective (rather than data quality) produces effective alignment is load-bearing, yet the manuscript provides no controlled ablation that holds the Recap-DataComp-1B dataset fixed while swapping the objective for a contrastive baseline. Without this comparison, gains over prior methods could be driven by the high-quality recaptions rather than the minimalist design.

Authors: We acknowledge that a controlled ablation holding the Recap-DataComp-1B dataset fixed while comparing the generative LM objective to a contrastive baseline would more rigorously isolate the contribution of the objective. In the revised manuscript, we will add this experiment by training a contrastive baseline on the identical dataset and reporting comparative results on the multimodal benchmarks. This will help substantiate that the performance gains are attributable to the generative approach rather than solely to data quality. revision: yes
Referee: [Experiments section] Experiments section: the abstract states competitive results on diverse benchmarks but supplies no details on experimental setup, exact baselines and their data volumes, error bars, or data exclusion criteria. This absence prevents verification of the claim that GenLIP matches or surpasses strong baselines with substantially less pretraining data.

Authors: We agree that the current manuscript lacks sufficient experimental details for full verification. The revised version will expand the Experiments section to include a complete description of the setup, exact baselines with their pretraining data volumes, error bars from repeated runs where feasible, and any data exclusion criteria. We will also ensure the abstract references these details for clarity. revision: yes

Circularity Check

0 steps flagged

No circularity detected; empirical generative pretraining framework with no load-bearing derivations or self-referential reductions

full rationale

The paper describes GenLIP as a direct application of standard language modeling to train a ViT to predict language tokens from visual tokens, without contrastive losses or extra decoders. No equations, uniqueness theorems, or parameter-fitting steps are presented that reduce claimed performance or alignment properties to inputs by construction. Results are reported from training on Recap-DataComp-1B and continued pretraining, with comparisons to baselines. The approach relies on experimental validation rather than tautological definitions or self-citation chains that would force the outcome. This is a standard empirical ML paper with self-contained claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard assumptions about transformer architectures and language modeling objectives; no free parameters, new entities, or ad-hoc axioms are explicitly introduced in the abstract.

axioms (1)

domain assumption A single transformer can jointly model visual and textual tokens under a language modeling objective
Invoked in the description of the single-transformer design for GenLIP.

pith-pipeline@v0.9.0 · 5542 in / 1204 out tokens · 24259 ms · 2026-05-09T18:53:36.754036+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

82 extracted references · 27 canonical work pages · 17 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Nocaps: Novel object captioning at scale

Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. Nocaps: Novel object captioning at scale. InProceedings of the IEEE/CVF international conference on computer vision, pages 8948–8957, 2019

2019
[3]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advancesin neural information processing systems, 35:23716–23736, 2022

2022
[4]

Multi-label cluster discrimination for visual representation learning

Xiang An, Kaicheng Yang, Xiangzi Dai, Ziyong Feng, and Jiankang Deng. Multi-label cluster discrimination for visual representation learning. InECCV, 2024

2024
[5]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review arXiv 2023
[7]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Vl-beit: Generative vision-language pretraining.arXiv preprint arXiv:2206.01127, 2022

Hangbo Bao, Wenhui Wang, Li Dong, and Furu Wei. Vl-beit: Generative vision-language pretraining.arXiv preprint arXiv:2206.01127, 2022

work page arXiv 2022
[9]

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

work page internal anchor Pith review arXiv 2024
[10]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568, 2025

work page Pith review arXiv 2025
[11]

A single transformer for scalable vision-language modeling

Yangyi Chen, Xingyao Wang, Hao Peng, and Heng Ji. A single transformer for scalable vision-language modeling. Transactions on Machine Learning Research, 2024

2024
[12]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024

2024
[14]

Reproducible scaling laws for contrastive language-image learning

Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2818–2829, 2023

2023
[15]

Meta clip 2: A worldwide scaling recipe

Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James R Glass, LIFEI HUANG, Jason E Weston, Luke Zettlemoyer, et al. Meta clip 2: A worldwide scaling recipe. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[16]

Vision Transformers Need Registers

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588, 2023

work page internal anchor Pith review arXiv 2023
[17]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

2009
[18]

Unveiling encoder-free vision-language models

Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, and Xinlong Wang. Unveiling encoder-free vision-language models. Advancesin Neural Information Processing Systems, 37:52545–52567, 2024. 16

2024
[19]

From pixels to words–towards native vision-language primitives at scale

Haiwen Diao, Mingxuan Li, Silei Wu, Linjun Dai, Xiaohua Wang, Hanming Deng, Lewei Lu, Dahua Lin, and Ziwei Liu. From pixels to words–towards native vision-language primitives at scale.arXiv preprint arXiv:2510.14979, 2025

work page arXiv 2025
[20]

Evev2: Improved baselines for encoder-free vision-language models

Haiwen Diao, Xiaotong Li, Yufeng Cui, Yueze Wang, Haoge Deng, Ting Pan, Wenxuan Wang, Huchuan Lu, and Xinlong Wang. Evev2: Improved baselines for encoder-free vision-language models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21014–21025, 2025

2025
[21]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In9th International Conference on Learning Representations, ICLR 2021. O...

2021
[22]

Improving clip training with language rewrites

Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, and Yonglong Tian. Improving clip training with language rewrites. Advancesin Neural Information Processing Systems, 36:35544–35575, 2023

2023
[23]

Multimodal autoregressive pre-training of large vision encoders

Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor G Turrisi da Costa, Louis Béthune, Zhe Gan, et al. Multimodal autoregressive pre-training of large vision encoders. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 9641–9654, 2025

2025
[24]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023

work page internal anchor Pith review arXiv 2023
[25]

Datacomp: In search of the next generation of multimodal datasets

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. Advancesin Neural Information Processing Systems, 36:27092–27112, 2023

2023
[26]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017

2017
[27]

Infinity-mm: Scaling multimodal performance with large-scale and high-quality instruction data

Shuhao Gu, Jialing Zhang, Siyuan Zhou, Kevin Yu, Zhaohu Xing, Liangdong Wang, Zhou Cao, Jintao Jia, Zhuoyi Zhang, Yixuan Wang, et al. Infinity-mm: Scaling multimodal performance with large-scale and high-quality instruction data. CoRR, 2024

2024
[28]

Classification done right for vision-language pre-training

Zilong Huang, Qinghao Ye, Bingyi Kang, Jiashi Feng, and Haoqi Fan. Classification done right for vision-language pre-training. Advancesin Neural Information Processing Systems, 37:96483–96504, 2024

2024
[29]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019

2019
[30]

Unifying vision-language repre- sentation space with single-tower transformer

Jiho Jang, Chaerin Kong, Donghyeon Jeon, Seonhoon Kim, and Nojun Kwak. Unifying vision-language repre- sentation space with single-tower transformer. InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 980–988, 2023

2023
[31]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021

2021
[32]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InEuropean conference on computer vision, pages 235–251. Springer, 2016

2016
[33]

Veclip: Improving clip training via visual-enriched captions

Zhengfeng Lai, Haotian Zhang, Bowen Zhang, Wentao Wu, Haoping Bai, Aleksei Timofeev, Xianzhi Du, Zhe Gan, Jiulong Shan, Chen-Nee Chuah, et al. Veclip: Improving clip training via visual-enriched captions. InEuropean Conference on Computer Vision, pages 111–127. Springer, 2024

2024
[34]

The scalability of simplicity: Empirical analysis of vision-language learning with a single transformer

Weixian Lei, Jiacong Wang, Haochen Wang, Xiangtai Li, Jun Hao Liew, Jiashi Feng, and Zilong Huang. The scalability of simplicity: Empirical analysis of vision-language learning with a single transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 20758–20769, October 2025

2025
[35]

Llava-onevision: Easy visual task transfer.CoRR, 2024

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.CoRR, 2024. 17

2024
[36]

Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich visual comprehension.CoRR, 2024

Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, and Ying Shan. Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich visual comprehension.CoRR, 2024

2024
[37]

Align before fuse: Vision and language representation learning with momentum distillation.Advances in neural information processing systems, 34:9694–9705, 2021

Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation.Advances in neural information processing systems, 34:9694–9705, 2021

2021
[38]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInternational conference on machine learning, pages 12888–12900. PMLR, 2022

2022
[39]

Denseworld-1m: Towards detailed dense grounded caption in the real world.arXiv preprint arXiv:2506.24102, 2025

Xiangtai Li, Tao Zhang, Yanwei Li, Haobo Yuan, Shihao Chen, Yikang Zhou, Jiahao Meng, Yueyi Sun, Shilin Xu, Lu Qi, et al. Denseworld-1m: Towards detailed dense grounded caption in the real world.arXiv preprint arXiv:2506.24102, 2025

work page arXiv 2025
[40]

What if we recaption billions of web images with llama-3? InInternational Conference on Machine Learning

Xianhang Li, Haoqin Tu, Mude Hui, Zeyu Wang, Bingchen Zhao, Junfei Xiao, Sucheng Ren, Jieru Mei, Qing Liu, Huangjie Zheng, et al. What if we recaption billions of web images with llama-3? InInternational Conference on Machine Learning. PMLR, 2024

2024
[41]

Openvision: A fully-open, cost-effective family of advanced vision encoders for multimodal learning

Xianhang Li, Yanqing Liu, Haoqin Tu, and Cihang Xie. Openvision: A fully-open, cost-effective family of advanced vision encoders for multimodal learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3977–3987, 2025

2025
[42]

Densefusion-1m: Merging vision experts for comprehensive multimodal perception.Advances in Neural Information Processing Systems, 37:18535–18556, 2024

Xiaotong Li, Fan Zhang, Haiwen Diao, Yueze Wang, Xinlong Wang, and Lingyu Duan. Densefusion-1m: Merging vision experts for comprehensive multimodal perception.Advances in Neural Information Processing Systems, 37:18535–18556, 2024

2024
[43]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

2023
[44]

Llava-next: Improved reasoning, ocr, and world knowledge, january 2024.URL https://llava-vl.github

HaotianLiu, ChunyuanLi, YuhengLi, BoLi, YuanhanZhang, ShengShen, andYongJaeLee. Llava-next: Improved reasoning, ocr, and world knowledge, january 2024.URL https://llava-vl.github. io/blog/2024-01-30-llava-next, 1(8), 2024

2024
[45]

Clips: An enhanced clip framework for learning with synthetic captions.arXiv preprint arXiv:2411.16828, 2024

Yanqing Liu, Xianhang Li, Zeyu Wang, Bingchen Zhao, and Cihang Xie. Clips: An enhanced clip framework for learning with synthetic captions.arXiv preprint arXiv:2411.16828, 2024

work page arXiv 2024
[46]

Openvision 2: A family of generative pretrained visual encoders for multimodal learning.arXiv preprint arXiv:2509.01644, 2025

Yanqing Liu, Xianhang Li, Letian Zhang, Zirui Wang, Zeyu Zheng, Yuyin Zhou, and Cihang Xie. Openvision 2: A family of generative pretrained visual encoders for multimodal learning.arXiv preprint arXiv:2509.01644, 2025

work page arXiv 2025
[47]

Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024

Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024

2024
[48]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022

2022
[49]

Generation and comprehension of unambiguous object descriptions

Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016

2016
[50]

Chartqa: A benchmark for question answering about charts with visual and logical reasoning

Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the association for computational linguistics: ACL 2022, pages 2263–2279, 2022

2022
[51]

Docvqa: A dataset for vqa on document images

Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021

2021
[52]

Infographicvqa

Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. Infographicvqa. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1697–1706, 2022

2022
[53]

Dinov2: Learning robust visual features without supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. Transactionson Machine Learning Research Journal, 2024. 18

2024
[54]

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.arXiv preprint arXiv:2505.06708, 2025

work page internal anchor Pith review arXiv 2025
[55]

Qwen2.5 Technical Report

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021

2021
[57]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

2020
[58]

Textcaps: a dataset for image captioning with reading comprehension

Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image captioning with reading comprehension. InEuropean conference on computer vision, pages 742–758. Springer, 2020

2020
[59]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019

2019
[60]

Generative multimodal models are in-context learners

Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14398–14409, 2024

2024
[61]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

work page internal anchor Pith review arXiv 2024
[62]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review arXiv 2025
[63]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review arXiv 2026
[64]

Qwen2.5: A party of foundation models, September 2024

Qwen Team. Qwen2.5: A party of foundation models, September 2024. URLhttps://qwenlm.github.io/blog/ qwen2.5/

2024
[65]

Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advancesin Neural Information Processing Systems, 37:87310–87356, 2024

Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri Iyer, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advancesin Neural Information Processing Systems, 37:87310–87356, 2024

2024
[66]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[67]

Image captioners are scalable vision learners too.Advances in Neural Information Processing Systems, 36:46830–46855, 2023

Michael Tschannen, Manoj Kumar, Andreas Steiner, Xiaohua Zhai, Neil Houlsby, and Lucas Beyer. Image captioners are scalable vision learners too.Advances in Neural Information Processing Systems, 36:46830–46855, 2023

2023
[68]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprintarXiv:2502.14786, 2025

work page internal anchor Pith review arXiv 2025
[69]

arXiv preprint arXiv:2205.14100 , year=

Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language.arXiv preprint arXiv:2205.14100, 2022

work page arXiv 2022
[70]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 19

work page internal anchor Pith review Pith/arXiv arXiv 2024
[71]

arXiv preprint arXiv:2108.10904 , year=

Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. Simvlm: Simple visual language model pretraining with weak supervision.arXiv preprint arXiv:2108.10904, 2021

work page arXiv 2021
[72]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

work page internal anchor Pith review arXiv 2023
[73]

Region-based cluster discrimination for visual representation learning

Yin Xie, Kaicheng Yang, Xiang An, Kun Wu, Yongle Zhao, Weimo Deng, Zimin Ran, Yumeng Wang, Ziyong Feng, Miles Roy, Elezi Ismail, and Jiankang Deng. Region-based cluster discrimination for visual representation learning. In ICCV, 2025

2025
[74]

Demysti- fying clip data

Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying clip data.arXiv preprint arXiv:2309.16671, 2023

work page arXiv 2023
[75]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[76]

Alip: Adaptive language-image pre-training with synthetic caption

Kaicheng Yang, Jiankang Deng, Xiang An, Jiawei Li, Ziyong Feng, Jia Guo, Jing Yang, and Tongliang Liu. Alip: Adaptive language-image pre-training with synthetic caption. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2922–2931, 2023

2023
[77]

Coca: Contrastive captioners are image- text foundation models

Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models.arXiv preprint arXiv:2205.01917, 2022

work page arXiv 2022
[78]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023

2023
[79]

Glipv2: unifying localization and vl understanding

Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Harold Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, and Jianfeng Gao. Glipv2: unifying localization and vl understanding. In36th Conf. Neural Inf. Process. Syst. NeurIPS, 2022

2022
[80]

Lmms-eval: Reality check on the evaluation of large multimodal models

Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, et al. Lmms-eval: Reality check on the evaluation of large multimodal models. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 881–916, 2025

2025
[81]

Dreamlip: Language-image pre-training with long captions

Kecheng Zheng, Yifei Zhang, Wei Wu, Fan Lu, Shuailei Ma, Xin Jin, Wei Chen, and Yujun Shen. Dreamlip: Language-image pre-training with long captions. InEuropean Conference on Computer Vision, pages 73–90. Springer, 2024

2024

Showing first 80 references.