Recognition: 2 theorem links
· Lean TheoremInternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Pith reviewed 2026-05-13 22:42 UTC · model grok-4.3
The pith
InternVL scales a vision foundation model to 6 billion parameters and progressively aligns it with an LLM on web-scale image-text data to reach state-of-the-art performance on 32 visual-linguistic benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
InternVL scales the vision foundation model to 6 billion parameters and progressively aligns it with the LLM using web-scale image-text data from various sources, allowing broad application and state-of-the-art performance on 32 generic visual-linguistic benchmarks that include image-level and pixel-level recognition, zero-shot image and video classification, zero-shot image and video-text retrieval, and multi-modal dialogue systems.
What carries the argument
The InternVL architecture, which enlarges the vision encoder to 6 billion parameters and applies progressive alignment with an LLM on diverse web-scale image-text pairs to unify perception and language capabilities.
If this is right
- The model supports both image-level and pixel-level visual recognition at competitive accuracy.
- It enables zero-shot classification and retrieval for both still images and video.
- Linking InternVL with an LLM produces capable multi-modal dialogue systems.
- The 6B vision component serves as a practical substitute for the larger ViT-22B.
Where Pith is reading between the lines
- The same scaling-plus-alignment recipe could be tested on other modalities such as audio or 3D scenes to check whether the gains transfer.
- Wider deployment may reveal whether performance holds when the input distribution shifts away from the original web data sources.
- Integration into existing LLM tool-use pipelines could let single models handle mixed visual and textual instructions with fewer separate components.
Load-bearing premise
That scaling the vision encoder to 6 billion parameters and aligning it progressively on web-scale image-text data will produce generalizable state-of-the-art results across 32 diverse benchmarks without overfitting or source-specific biases.
What would settle it
Finding a new visual-linguistic benchmark, drawn from data sources outside the training mixture, on which InternVL falls below the performance of prior smaller models.
read the original abstract
The exponential growth of large language models (LLMs) has opened up numerous possibilities for multimodal AGI systems. However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs. In this work, we design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters and progressively aligns it with the LLM, using web-scale image-text data from various sources. This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks including visual perception tasks such as image-level or pixel-level recognition, vision-language tasks such as zero-shot image/video classification, zero-shot image/video-text retrieval, and link with LLMs to create multi-modal dialogue systems. It has powerful visual capabilities and can be a good alternative to the ViT-22B. We hope that our research could contribute to the development of multi-modal large models. Code and models are available at https://github.com/OpenGVLab/InternVL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces InternVL, a vision-language foundation model that scales the vision encoder to 6 billion parameters and progressively aligns it with an LLM using web-scale image-text data from various sources. It claims state-of-the-art performance on 32 generic visual-linguistic benchmarks, including image/pixel-level recognition, zero-shot image/video classification and retrieval, and multimodal dialogue systems, while positioning the model as a viable alternative to ViT-22B.
Significance. If the performance claims hold after addressing data-contamination risks, the work would represent a meaningful step in scaling vision foundation models to 6B parameters and aligning them for broad multimodal tasks. The public release of code and models is a clear strength that supports reproducibility and community follow-up.
major comments (2)
- [§3.2 and §4] §3.2 and §4: The progressive alignment procedure on heterogeneous web-scale corpora is described, but no overlap statistics, decontamination steps, or membership inference results are reported between the collected training pairs and the 32 evaluation benchmarks (ImageNet, COCO, VQAv2, ActivityNet, etc.). Without this, the zero-shot and few-shot SOTA numbers cannot be reliably interpreted as generalization rather than leakage.
- [§5] §5 (Experiments): The manuscript asserts SOTA on 32 benchmarks after scaling and alignment, yet provides neither ablation tables isolating the contribution of the 6B vision encoder versus the alignment stages, nor error bars, nor full baseline comparisons with contemporaneous models. These omissions make the central scaling claim difficult to verify from the reported results.
minor comments (2)
- [Abstract] The abstract and introduction would benefit from a concise table summarizing the 32 benchmarks and the precise metrics on which SOTA is claimed.
- [§3] Notation for the vision encoder size (6B) and the LLM component should be introduced earlier and used consistently throughout the method sections.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript to incorporate additional analyses where feasible.
read point-by-point responses
-
Referee: [§3.2 and §4] §3.2 and §4: The progressive alignment procedure on heterogeneous web-scale corpora is described, but no overlap statistics, decontamination steps, or membership inference results are reported between the collected training pairs and the 32 evaluation benchmarks (ImageNet, COCO, VQAv2, ActivityNet, etc.). Without this, the zero-shot and few-shot SOTA numbers cannot be reliably interpreted as generalization rather than leakage.
Authors: We agree that explicit decontamination reporting is necessary to substantiate the zero-shot claims. In the revised version we will add a dedicated subsection under §3.2 that reports (i) n-gram overlap statistics between the web-scale training pairs and each of the 32 evaluation benchmarks, (ii) the exact decontamination filters applied (e.g., exact URL and caption matching), and (iii) membership-inference results obtained via a simple loss-threshold attack on a held-out subset of the benchmarks. These additions will allow readers to assess the degree of leakage directly. revision: yes
-
Referee: [§5] §5 (Experiments): The manuscript asserts SOTA on 32 benchmarks after scaling and alignment, yet provides neither ablation tables isolating the contribution of the 6B vision encoder versus the alignment stages, nor error bars, nor full baseline comparisons with contemporaneous models. These omissions make the central scaling claim difficult to verify from the reported results.
Authors: We acknowledge that the current experimental section would benefit from more granular ablations. We will insert a new ablation table in §5 that isolates (a) the effect of scaling the vision encoder from 1B to 6B parameters while keeping the alignment procedure fixed, and (b) the incremental gains from each stage of the progressive alignment. Where multiple random seeds were run, we will report mean ± standard deviation. We will also expand the baseline table to include additional contemporaneous models (e.g., recent 2023–2024 vision-language models) that were omitted for space reasons in the original submission. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper describes an empirical pipeline: scaling a vision encoder to 6B parameters, followed by progressive alignment stages on web-scale image-text corpora, then direct evaluation on 32 external benchmarks (ImageNet, COCO, VQAv2, etc.). No load-bearing step reduces by construction to a fitted parameter renamed as prediction, a self-citation chain, or a self-definitional loop; architectural choices and training objectives are stated explicitly and the reported SOTA numbers rest on independent test sets rather than internal re-derivation of the inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Transformer-based vision and language models follow established scaling laws when trained on web-scale image-text pairs
- domain assumption Standard benchmark suites are sufficient to demonstrate general visual-linguistic capabilities
Forward citations
Cited by 23 Pith papers
-
S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images
S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.
-
CATS: Curvature Aware Temporal Selection for efficient long video understanding
CATS uses temporal curvature of query-frame relevance to select informative frames, achieving 93-95% of heavy multi-stage accuracy at 3-4% of the preprocessing cost on long-video benchmarks.
-
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding
Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
-
Mosaic: Multimodal Jailbreak against Closed-Source VLMs via Multi-View Ensemble Optimization
Mosaic combines text perturbation, multi-view image optimization, and surrogate model ensembles to reduce reliance on any single open-source model and achieve higher attack success rates on commercial closed-source VLMs.
-
Anisotropic Modality Align
Modality representations share dominant semantic geometry but have an anisotropic residual gap; AnisoAlign corrects source representations boundedly using target geometry for unpaired alignment.
-
Agentic AI for Remote Sensing: Technical Challenges and Research Directions
Agentic AI faces structural challenges in remote sensing due to geospatial data properties and workflow constraints, requiring EO-native agents built around structured state, tool-aware reasoning, and validity-aware e...
-
If you're waiting for a sign... that might not be it! Mitigating Trust Boundary Confusion from Visual Injections on Vision-Language Agentic Systems
LVLM-based agents exhibit trust boundary confusion with visual injections and a multi-agent defense separating perception from decision-making reduces misleading responses while preserving correct ones.
-
CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference
CodecSight reuses video codec signals for online patch pruning before the vision transformer and selective KV-cache refresh in the LLM, delivering up to 3x higher throughput and 87% lower GPU compute than prior baseli...
-
AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis
AICA-Bench evaluates 23 VLMs on affective image analysis, identifies weak intensity calibration and shallow descriptions as limitations, and proposes training-free Grounded Affective Tree Prompting to improve performance.
-
Long Context Transfer from Language to Vision
Extending language model context length enables LMMs to process over 200K visual tokens from long videos without video training, achieving SOTA on Video-MME via dense frame sampling.
-
Are We on the Right Way for Evaluating Large Vision-Language Models?
Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...
-
Agentic AI for Remote Sensing: Technical Challenges and Research Directions
Agentic AI for remote sensing requires new designs centered on structured geospatial state, tool-aware reasoning, verifier-guided execution, and physical validity rather than generic extensions.
-
From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media
VLMs recover reliable population-level trends in climate change visual discourse on social media even when per-image accuracy is only moderate.
-
Are Face Embeddings Compatible Across Deep Neural Network Models?
Simple affine transformations align face embeddings across different DNN models, substantially improving cross-model identification and verification performance.
-
Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation
Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.
-
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B a...
-
LLaVA-OneVision: Easy Visual Task Transfer
LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
-
When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise
Mild rotations and noise significantly increase relation hallucinations in VLMs across models and datasets, with prompt augmentation and preprocessing offering only partial mitigation.
-
When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise
Mild rotations and noise significantly increase relation hallucinations in VLMs across models and datasets, with prompt and preprocessing fixes providing only partial relief.
-
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
-
PaliGemma: A versatile 3B VLM for transfer
PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
-
A Survey on Hallucination in Large Vision-Language Models
This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.
Reference graph
Works this paper leans on
-
[1]
Towards zero- shot cross-lingual image retrieval
Pranav Aggarwal and Ajinkya Kale. Towards zero- shot cross-lingual image retrieval. arXiv preprint arXiv:2012.05107, 2020. 8, 10, 16
-
[2]
Nocaps: Novel object cap- tioning at scale
Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. Nocaps: Novel object cap- tioning at scale. In ICCV, pages 8948–8957, 2019. 8, 17, 18
work page 2019
-
[3]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. NeurIPS, 35:23716–23736, 2022. 1, 3, 8
work page 2022
-
[4]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xi- aodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Day- iheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfe...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Baichuan 2: Open large-scale language models
Baichuan. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023. 3
-
[7]
Beit: Bert pre- training of image transformers
Hangbo Bao, Li Dong, and Furu Wei. Beit: Bert pre- training of image transformers. In ICLR, 2022. 6, 11, 12
work page 2022
-
[8]
Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models
Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. NeurIPS, 32, 2019. 7, 15
work page 2019
-
[9]
Bird- snap: Large-scale fine-grained visual categorization of birds
Thomas Berg, Jiongxin Liu, Seung Woo Lee, Michelle L Alexander, David W Jacobs, and Peter N Belhumeur. Bird- snap: Large-scale fine-grained visual categorization of birds. In CVPR, pages 2011–2018, 2014. 11, 16
work page 2011
-
[10]
Are we done with imagenet? arXiv preprint arXiv:2006.07159, 2020
Lucas Beyer, Olivier J H ´enaff, Alexander Kolesnikov, Xi- aohua Zhai, and A ¨aron van den Oord. Are we done with imagenet? arXiv preprint arXiv:2006.07159, 2020. 6, 15
-
[11]
Contrastive language-image pre-training for the italian language
Federico Bianchi, Giuseppe Attanasio, Raphael Pisoni, Sil- via Terragni, Gabriele Sarti, and Sri Lakshmi. Contrastive language-image pre-training for the italian language. arXiv preprint arXiv:2108.08688, 2021. 7
-
[12]
Scene text visual question answering
Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marc ¸al Rusinol, Ernest Valveny, CV Jawahar, and Dimos- thenis Karatzas. Scene text visual question answering. In ICCV, pages 4291–4301, 2019. 5, 17
work page 2019
-
[13]
Food-101–mining discriminative components with random forests
Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. In ECCV, pages 446–461, 2014. 11, 16
work page 2014
-
[14]
Coyo-700m: Image-text pair dataset, 2022
Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset, 2022. 5, 13, 15
work page 2022
-
[15]
Yuxuan Cai, Yizhuang Zhou, Qi Han, Jianjian Sun, Xiang- wen Kong, Jun Li, and Xiangyu Zhang. Reversible column networks. arXiv preprint arXiv:2212.11696, 2022. 3
-
[16]
Cross-lingual and multilingual clip
Fredrik Carlsson, Philipp Eisen, Faton Rekathati, and Mag- nus Sahlgren. Cross-lingual and multilingual clip. In Pro- ceedings of the Thirteenth Language Resources and Evalu- ation Conference, pages 6848–6854, 2022. 7, 8, 10
work page 2022
-
[17]
Quo vadis, action recognition? a new model and the kinetics dataset
Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pages 6299–6308, 2017. 7, 8, 13, 16
work page 2017
-
[18]
Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short note about kinetics-
- [19]
-
[20]
A short note on the kinetics-700 human action dataset
Joao Carreira, Eric Noland, Chloe Hillier, and Andrew Zis- serman. A short note on the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987, 2019. 7, 8, 13, 16
-
[21]
Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts
Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, pages 3558–3568, 2021. 5, 13, 15
work page 2021
-
[22]
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multi- modal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023. 1, 3, 8
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Microsoft COCO Captions: Data Collection and Evaluation Server
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft coco captions: Data collection and eval- uation server. arXiv preprint arXiv:1504.00325, 2015. 5, 7, 8, 16, 17, 18
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[24]
Pali: A jointly-scaled multilingual language-image model
Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergio- vanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly-scaled multilingual language-image model. In ICLR, 2022. 1, 3, 4
work page 2022
-
[25]
Pali-x: On scaling up a multilingual vision and language model
Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Se- bastian Goodman, Xiao Wang, Yi Tay, et al. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023. 8
-
[26]
Vision transformer adapter for dense predictions
Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. In ICLR, 2022. 3
work page 2022
-
[27]
Altclip: Altering the lan- guage encoder in clip for extended language capabilities
Zhongzhi Chen, Guang Liu, Bo-Wen Zhang, Fulong Ye, Qinghong Yang, and Ledell Wu. Altclip: Altering the lan- guage encoder in clip for extended language capabilities. arXiv preprint arXiv:2211.06679, 2022. 7, 8, 10
-
[28]
Remote sens- ing image scene classification: Benchmark and state of the art
Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sens- ing image scene classification: Benchmark and state of the art. Proceedings of the IEEE , 105(10):1865–1883, 2017. 11 19
work page 2017
-
[29]
Describing tex- tures in the wild
Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing tex- tures in the wild. In CVPR, pages 3606–3613, 2014. 11, 16
work page 2014
-
[30]
Simple and effective multi-paragraph reading comprehension
Christopher Clark and Matt Gardner. Simple and effective multi-paragraph reading comprehension. arXiv preprint arXiv:1710.10723, 2017. 5, 17
-
[31]
An analysis of single-layer networks in unsupervised feature learning
Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In AISTAT, pages 215–223, 2011. 11, 16
work page 2011
-
[32]
Mmsegmentation: Open- mmlab semantic segmentation toolbox and benchmark,
MMSegmentation Contributors. Mmsegmentation: Open- mmlab semantic segmentation toolbox and benchmark,
-
[33]
Efficient and effective text encoding for chinese llama and alpaca
Yiming Cui, Ziqing Yang, and Xin Yao. Efficient and ef- fective text encoding for chinese llama and alpaca. arXiv preprint arXiv:2304.08177, 2023. 2, 3, 4, 5, 6, 11, 12
-
[34]
Deformable convolu- tional networks
Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolu- tional networks. In ICCV, pages 764–773, 2017. 3
work page 2017
-
[35]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Wenliang Dai, Junnan Li, Dongxu Li, AnthonyMeng Huat, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision- language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023. 1, 8, 11
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
Flashattention: Fast and memory-efficient exact attention with io-awareness
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christo- pher R ´e. Flashattention: Fast and memory-efficient exact attention with io-awareness. NeurIPS, 35:16344–16359,
-
[37]
Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jos ´e MF Moura, Devi Parikh, and Dhruv Batra. Visual dialog. In CVPR, pages 326–335, 2017. 5, 18
work page 2017
-
[38]
Scaling vision transformers to 22 billion pa- rameters
Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdul- mohsin, et al. Scaling vision transformers to 22 billion pa- rameters. In ICML, pages 7480–7512, 2023. 3, 4, 6, 7, 12
work page 2023
-
[39]
Imagenet: A large-scale hierarchical im- age database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical im- age database. In CVPR, pages 248–255, 2009. 2, 3, 6, 7, 9, 10, 11, 12, 13, 15
work page 2009
-
[40]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[41]
Repvgg: Making vgg- style convnets great again
Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong Han, Guiguang Ding, and Jian Sun. Repvgg: Making vgg- style convnets great again. In CVPR, pages 13733–13742,
-
[42]
Dreamllm: Synergistic multimodal com- prehension and creation
Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, et al. Dreamllm: Synergistic multimodal com- prehension and creation. arXiv preprint arXiv:2309.11499,
-
[43]
An image is worth 16x16 words: Trans- formers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. In ICLR, 2020. 3, 4
work page 2020
-
[44]
PaLM-E: An Embodied Multimodal Language Model
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Glm: General language model pretraining with autoregressive blank infilling
Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. In ACL, pages 320–335, 2022. 3, 11
work page 2022
-
[46]
The pascal visual object classes challenge: A retrospective
Mark Everingham, SM Ali Eslami, Luc Van Gool, Christo- pher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes challenge: A retrospective. IJCV, 111:98–136, 2015. 11, 16
work page 2015
-
[47]
Eva: Exploring the limits of masked visual represen- tation learning at scale
Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual represen- tation learning at scale. arXiv preprint arXiv:2211.07636,
-
[48]
Eva-02: A visual representation for neon genesis
Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva-02: A visual represen- tation for neon genesis. arXiv preprint arXiv:2303.11331,
-
[49]
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. JMLR, 23(1):5232–5270,
-
[50]
Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning gener- ative visual models from few training examples: An incre- mental bayesian approach tested on 101 object categories. In CVPRW, pages 178–178, 2004. 11, 15
work page 2004
-
[51]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive eval- uation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023. 8, 18
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
arXiv preprint arXiv:2304.15010 , year=
Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xi- angyu Yue, et al. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010,
-
[53]
Challenges in representation learning: A report on three machine learning contests
Ian J Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukier- ski, Yichuan Tang, David Thaler, Dong-Hyun Lee, et al. Challenges in representation learning: A report on three machine learning contests. In ICONIP, pages 117–124,
- [54]
-
[55]
Making the v in vqa matter: El- evating the role of image understanding in visual question answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: El- evating the role of image understanding in visual question answering. In CVPR, pages 6904–6913, 2017. 5, 8, 17, 18
work page 2017
-
[56]
Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark
Jiaxi Gu, Xiaojun Meng, Guansong Lu, Lu Hou, Niu Minzhe, Xiaodan Liang, Lewei Yao, Runhui Huang, Wei 20 Zhang, Xin Jiang, et al. Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark. NeurIPS, 35: 26418–26431, 2022. 5, 7, 10, 13, 15
work page 2022
-
[57]
Vizwiz grand challenge: Answering visual questions from blind people
Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In CVPR, pages 3608–3617, 2018. 8, 18
work page 2018
-
[58]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016. 1, 3, 15
work page 2016
-
[59]
Masked autoencoders are scal- able vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scal- able vision learners. In CVPR, pages 16000–16009, 2022. 6, 12
work page 2022
-
[60]
Eurosat: A novel dataset and deep learn- ing benchmark for land use and land cover classification
Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learn- ing benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Obser- vations and Remote Sensing , 12(7):2217–2226, 2019. 11, 16
work page 2019
-
[61]
The many faces of ro- bustness: A critical analysis of out-of-distribution general- ization
Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kada- vath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of ro- bustness: A critical analysis of out-of-distribution general- ization. In ICCV, pages 8340–8349, 2021. 6, 7, 15
work page 2021
-
[62]
Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Stein- hardt, and Dawn Song. Natural adversarial examples. In CVPR, pages 15262–15271, 2021. 6, 7, 15
work page 2021
-
[63]
Squeeze-and-excitation networks
Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In CVPR, pages 7132–7141, 2018. 3
work page 2018
-
[64]
Deep networks with stochastic depth
Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kil- ian Q Weinberger. Deep networks with stochastic depth. In ECCV, pages 646–661, 2016. 12, 13
work page 2016
-
[65]
Gqa: A new dataset for real-world visual reasoning and compositional question answering
Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, pages 6700–6709, 2019. 5, 8, 17, 18
work page 2019
-
[66]
Densenet: Im- plementing efficient convnet descriptor pyramids
Forrest Iandola, Matt Moskewicz, Sergey Karayev, Ross Girshick, Trevor Darrell, and Kurt Keutzer. Densenet: Im- plementing efficient convnet descriptor pyramids. arXiv preprint arXiv:1404.1869, 2014. 3
-
[67]
Introducing idefics: An open reproduction of state-of-the-art visual language model
IDEFICS. Introducing idefics: An open reproduction of state-of-the-art visual language model. https:// huggingface.co/blog/idefics, 2023. 8
work page 2023
-
[68]
Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Han- naneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open- clip. Zenodo. Version 0.1. https://doi.org/10. 5281/zenodo.5143773 , 2021. DOI: 10.5281/zen- odo.5143773. 3, 6, 7, 8, 10, 11
-
[69]
Batch normalization: Accelerating deep network training by reducing internal co- variate shift
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal co- variate shift. In ICML, pages 448–456, 2015. 12
work page 2015
-
[70]
Mural: multimodal, multitask retrieval across languages
Aashi Jain, Mandy Guo, Krishna Srinivasan, Ting Chen, Sneha Kudugunta, Chao Jia, Yinfei Yang, and Jason Baldridge. Mural: multimodal, multitask retrieval across languages. arXiv preprint arXiv:2109.05125, 2021. 10
-
[71]
Scaling up visual and vision-language representation learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, pages 4904–4916, 2021. 2, 3, 10
work page 2021
-
[72]
A diagram is worth a dozen images
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In ECCV, pages 235–251, 2016. 5, 17
work page 2016
-
[73]
3d object representations for fine-grained categorization
Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In ICCVW, pages 554–561, 2013. 11, 15
work page 2013
-
[74]
Imagenet classification with deep convolutional neural net- works
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural net- works. NeurIPS, 25, 2012. 3
work page 2012
-
[75]
Learning multiple layers of features from tiny images
Alex Krizhevsky et al. Learning multiple layers of features from tiny images. 2009. 11, 15, 16
work page 2009
-
[76]
Lisa: Reasoning seg- mentation via large language model
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning seg- mentation via large language model. arXiv preprint arXiv:2308.00692, 2023. 3
-
[77]
Clip benchmark: Clip-like model evalua- tion
LAION-AI. Clip benchmark: Clip-like model evalua- tion. https://github.com/LAION- AI/CLIP_ benchmark, 2023. 7, 15
work page 2023
-
[78]
Fluency-guided cross-lingual image captioning
Weiyu Lan, Xirong Li, and Jianfeng Dong. Fluency-guided cross-lingual image captioning. In ACM MM, pages 1549– 1557, 2017. 7, 8, 10, 12, 16
work page 2017
-
[79]
Gradient-based learning applied to document recognition
Yann LeCun, L ´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324,
-
[80]
Otter: A multi-modal model 9 with in-context instruction tuning
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023. 3, 11
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.