GenLIP pretrains ViTs to generate language tokens from images via LM objective without contrastive batches or extra decoders, matching baselines on less data and improving on OCR after multi-resolution continued pretraining.
Densefusion-1m: Merging vision experts for comprehensive multimodal perception.Advances in Neural Information Processing Systems, 37:18535–18556
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CV 2years
2026 2representative citing papers
citing papers explorer
-
Let ViT Speak: Generative Language-Image Pre-training
GenLIP pretrains ViTs to generate language tokens from images via LM objective without contrastive batches or extra decoders, matching baselines on less data and improving on OCR after multi-resolution continued pretraining.
- Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models