pith. machine review for the scientific record. sign in

arxiv: 2312.14238 · v3 · submitted 2023-12-21 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Bin Li, Guo Chen, Jiannan Wu, Jifeng Dai, Lewei Lu, Muyan Zhong, Ping Luo, Qinglong Zhang, Sen Xing, Tong Lu, Weijie Su, Wenhai Wang, Xizhou Zhu, Yu Qiao, Zhe Chen

Pith reviewed 2026-05-13 22:42 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision-language modelsfoundation modelsmultimodal alignmentzero-shot learningvisual perceptionlarge language modelsimage-text datamulti-modal dialogue
0
0 comments X

The pith

InternVL scales a vision foundation model to 6 billion parameters and progressively aligns it with an LLM on web-scale image-text data to reach state-of-the-art performance on 32 visual-linguistic benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces InternVL as a vision-language foundation model that enlarges the vision component to 6 billion parameters. It then aligns this vision encoder step by step with a large language model using large collections of image-text pairs gathered from the web. The resulting model handles both coarse image recognition and fine pixel-level tasks, delivers strong zero-shot results on image and video classification and retrieval, and combines with language models for multi-turn dialogue. A reader would care because the work shows one concrete way to bring vision scaling closer to the rapid progress already seen in language models. The authors position the model as a direct, smaller alternative to the much larger ViT-22B.

Core claim

InternVL scales the vision foundation model to 6 billion parameters and progressively aligns it with the LLM using web-scale image-text data from various sources, allowing broad application and state-of-the-art performance on 32 generic visual-linguistic benchmarks that include image-level and pixel-level recognition, zero-shot image and video classification, zero-shot image and video-text retrieval, and multi-modal dialogue systems.

What carries the argument

The InternVL architecture, which enlarges the vision encoder to 6 billion parameters and applies progressive alignment with an LLM on diverse web-scale image-text pairs to unify perception and language capabilities.

If this is right

  • The model supports both image-level and pixel-level visual recognition at competitive accuracy.
  • It enables zero-shot classification and retrieval for both still images and video.
  • Linking InternVL with an LLM produces capable multi-modal dialogue systems.
  • The 6B vision component serves as a practical substitute for the larger ViT-22B.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same scaling-plus-alignment recipe could be tested on other modalities such as audio or 3D scenes to check whether the gains transfer.
  • Wider deployment may reveal whether performance holds when the input distribution shifts away from the original web data sources.
  • Integration into existing LLM tool-use pipelines could let single models handle mixed visual and textual instructions with fewer separate components.

Load-bearing premise

That scaling the vision encoder to 6 billion parameters and aligning it progressively on web-scale image-text data will produce generalizable state-of-the-art results across 32 diverse benchmarks without overfitting or source-specific biases.

What would settle it

Finding a new visual-linguistic benchmark, drawn from data sources outside the training mixture, on which InternVL falls below the performance of prior smaller models.

read the original abstract

The exponential growth of large language models (LLMs) has opened up numerous possibilities for multimodal AGI systems. However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs. In this work, we design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters and progressively aligns it with the LLM, using web-scale image-text data from various sources. This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks including visual perception tasks such as image-level or pixel-level recognition, vision-language tasks such as zero-shot image/video classification, zero-shot image/video-text retrieval, and link with LLMs to create multi-modal dialogue systems. It has powerful visual capabilities and can be a good alternative to the ViT-22B. We hope that our research could contribute to the development of multi-modal large models. Code and models are available at https://github.com/OpenGVLab/InternVL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces InternVL, a vision-language foundation model that scales the vision encoder to 6 billion parameters and progressively aligns it with an LLM using web-scale image-text data from various sources. It claims state-of-the-art performance on 32 generic visual-linguistic benchmarks, including image/pixel-level recognition, zero-shot image/video classification and retrieval, and multimodal dialogue systems, while positioning the model as a viable alternative to ViT-22B.

Significance. If the performance claims hold after addressing data-contamination risks, the work would represent a meaningful step in scaling vision foundation models to 6B parameters and aligning them for broad multimodal tasks. The public release of code and models is a clear strength that supports reproducibility and community follow-up.

major comments (2)
  1. [§3.2 and §4] §3.2 and §4: The progressive alignment procedure on heterogeneous web-scale corpora is described, but no overlap statistics, decontamination steps, or membership inference results are reported between the collected training pairs and the 32 evaluation benchmarks (ImageNet, COCO, VQAv2, ActivityNet, etc.). Without this, the zero-shot and few-shot SOTA numbers cannot be reliably interpreted as generalization rather than leakage.
  2. [§5] §5 (Experiments): The manuscript asserts SOTA on 32 benchmarks after scaling and alignment, yet provides neither ablation tables isolating the contribution of the 6B vision encoder versus the alignment stages, nor error bars, nor full baseline comparisons with contemporaneous models. These omissions make the central scaling claim difficult to verify from the reported results.
minor comments (2)
  1. [Abstract] The abstract and introduction would benefit from a concise table summarizing the 32 benchmarks and the precise metrics on which SOTA is claimed.
  2. [§3] Notation for the vision encoder size (6B) and the LLM component should be introduced earlier and used consistently throughout the method sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript to incorporate additional analyses where feasible.

read point-by-point responses
  1. Referee: [§3.2 and §4] §3.2 and §4: The progressive alignment procedure on heterogeneous web-scale corpora is described, but no overlap statistics, decontamination steps, or membership inference results are reported between the collected training pairs and the 32 evaluation benchmarks (ImageNet, COCO, VQAv2, ActivityNet, etc.). Without this, the zero-shot and few-shot SOTA numbers cannot be reliably interpreted as generalization rather than leakage.

    Authors: We agree that explicit decontamination reporting is necessary to substantiate the zero-shot claims. In the revised version we will add a dedicated subsection under §3.2 that reports (i) n-gram overlap statistics between the web-scale training pairs and each of the 32 evaluation benchmarks, (ii) the exact decontamination filters applied (e.g., exact URL and caption matching), and (iii) membership-inference results obtained via a simple loss-threshold attack on a held-out subset of the benchmarks. These additions will allow readers to assess the degree of leakage directly. revision: yes

  2. Referee: [§5] §5 (Experiments): The manuscript asserts SOTA on 32 benchmarks after scaling and alignment, yet provides neither ablation tables isolating the contribution of the 6B vision encoder versus the alignment stages, nor error bars, nor full baseline comparisons with contemporaneous models. These omissions make the central scaling claim difficult to verify from the reported results.

    Authors: We acknowledge that the current experimental section would benefit from more granular ablations. We will insert a new ablation table in §5 that isolates (a) the effect of scaling the vision encoder from 1B to 6B parameters while keeping the alignment procedure fixed, and (b) the incremental gains from each stage of the progressive alignment. Where multiple random seeds were run, we will report mean ± standard deviation. We will also expand the baseline table to include additional contemporaneous models (e.g., recent 2023–2024 vision-language models) that were omitted for space reasons in the original submission. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes an empirical pipeline: scaling a vision encoder to 6B parameters, followed by progressive alignment stages on web-scale image-text corpora, then direct evaluation on 32 external benchmarks (ImageNet, COCO, VQAv2, etc.). No load-bearing step reduces by construction to a fitted parameter renamed as prediction, a self-citation chain, or a self-definitional loop; architectural choices and training objectives are stated explicitly and the reported SOTA numbers rest on independent test sets rather than internal re-derivation of the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work relies on standard transformer architectures and scaling practices from prior literature without introducing new free parameters, invented entities, or ad-hoc axioms beyond the domain assumption that web-scale data supports broad generalization.

axioms (2)
  • domain assumption Transformer-based vision and language models follow established scaling laws when trained on web-scale image-text pairs
    Invoked implicitly when claiming that scaling to 6B parameters and alignment will yield SOTA results across diverse tasks.
  • domain assumption Standard benchmark suites are sufficient to demonstrate general visual-linguistic capabilities
    The claim of broad applicability rests on performance across 32 existing benchmarks without new evaluation protocols.

pith-pipeline@v0.9.0 · 5535 in / 1507 out tokens · 53645 ms · 2026-05-13T22:42:55.432347+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

    cs.CV 2026-04 unverdicted novelty 8.0

    S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.

  2. CATS: Curvature Aware Temporal Selection for efficient long video understanding

    cs.CV 2026-05 unverdicted novelty 7.0

    CATS uses temporal curvature of query-frame relevance to select informative frames, achieving 93-95% of heavy multi-stage accuracy at 3-4% of the preprocessing cost on long-video benchmarks.

  3. Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding

    cs.CV 2026-04 unverdicted novelty 7.0

    Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.

  4. Mosaic: Multimodal Jailbreak against Closed-Source VLMs via Multi-View Ensemble Optimization

    cs.CV 2026-04 unverdicted novelty 7.0

    Mosaic combines text perturbation, multi-view image optimization, and surrogate model ensembles to reduce reliance on any single open-source model and achieve higher attack success rates on commercial closed-source VLMs.

  5. Anisotropic Modality Align

    cs.MM 2026-05 unverdicted novelty 6.0

    Modality representations share dominant semantic geometry but have an anisotropic residual gap; AnisoAlign corrects source representations boundedly using target geometry for unpaired alignment.

  6. Agentic AI for Remote Sensing: Technical Challenges and Research Directions

    cs.CV 2026-04 unverdicted novelty 6.0

    Agentic AI faces structural challenges in remote sensing due to geospatial data properties and workflow constraints, requiring EO-native agents built around structured state, tool-aware reasoning, and validity-aware e...

  7. If you're waiting for a sign... that might not be it! Mitigating Trust Boundary Confusion from Visual Injections on Vision-Language Agentic Systems

    cs.CV 2026-04 unverdicted novelty 6.0

    LVLM-based agents exhibit trust boundary confusion with visual injections and a multi-agent defense separating perception from decision-making reduces misleading responses while preserving correct ones.

  8. CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference

    cs.DC 2026-04 unverdicted novelty 6.0

    CodecSight reuses video codec signals for online patch pruning before the vision transformer and selective KV-cache refresh in the LLM, delivering up to 3x higher throughput and 87% lower GPU compute than prior baseli...

  9. AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis

    cs.CV 2026-04 unverdicted novelty 6.0

    AICA-Bench evaluates 23 VLMs on affective image analysis, identifies weak intensity calibration and shallow descriptions as limitations, and proposes training-free Grounded Affective Tree Prompting to improve performance.

  10. Long Context Transfer from Language to Vision

    cs.CV 2024-06 unverdicted novelty 6.0

    Extending language model context length enables LMMs to process over 200K visual tokens from long videos without video training, achieving SOTA on Video-MME via dense frame sampling.

  11. Are We on the Right Way for Evaluating Large Vision-Language Models?

    cs.CV 2024-03 conditional novelty 6.0

    Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...

  12. Agentic AI for Remote Sensing: Technical Challenges and Research Directions

    cs.CV 2026-04 unverdicted novelty 5.0

    Agentic AI for remote sensing requires new designs centered on structured geospatial state, tool-aware reasoning, verifier-guided execution, and physical validity rather than generic extensions.

  13. From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media

    cs.CV 2026-04 unverdicted novelty 5.0

    VLMs recover reliable population-level trends in climate change visual discourse on social media even when per-image accuracy is only moderate.

  14. Are Face Embeddings Compatible Across Deep Neural Network Models?

    cs.CV 2026-04 unverdicted novelty 5.0

    Simple affine transformations align face embeddings across different DNN models, substantially improving cross-model identification and verification performance.

  15. Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation

    cs.CV 2026-04 unverdicted novelty 5.0

    Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.

  16. DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    cs.CV 2024-12 accept novelty 5.0

    DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B a...

  17. LLaVA-OneVision: Easy Visual Task Transfer

    cs.CV 2024-08 unverdicted novelty 5.0

    LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.

  18. When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise

    cs.CV 2026-05 unverdicted novelty 4.0

    Mild rotations and noise significantly increase relation hallucinations in VLMs across models and datasets, with prompt augmentation and preprocessing offering only partial mitigation.

  19. When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise

    cs.CV 2026-05 unverdicted novelty 4.0

    Mild rotations and noise significantly increase relation hallucinations in VLMs across models and datasets, with prompt and preprocessing fixes providing only partial relief.

  20. VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    cs.CV 2025-01 unverdicted novelty 4.0

    VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.

  21. PaliGemma: A versatile 3B VLM for transfer

    cs.CV 2024-07 unverdicted novelty 4.0

    PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.

  22. How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    cs.CV 2024-04 unverdicted novelty 4.0

    InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

  23. A Survey on Hallucination in Large Vision-Language Models

    cs.CV 2024-02 unverdicted novelty 3.0

    This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.

Reference graph

Works this paper leans on

189 extracted references · 189 canonical work pages · cited by 21 Pith papers · 28 internal anchors

  1. [1]

    Towards zero- shot cross-lingual image retrieval

    Pranav Aggarwal and Ajinkya Kale. Towards zero- shot cross-lingual image retrieval. arXiv preprint arXiv:2012.05107, 2020. 8, 10, 16

  2. [2]

    Nocaps: Novel object cap- tioning at scale

    Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. Nocaps: Novel object cap- tioning at scale. In ICCV, pages 8948–8957, 2019. 8, 17, 18

  3. [3]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. NeurIPS, 35:23716–23736, 2022. 1, 3, 8

  4. [4]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xi- aodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Day- iheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfe...

  5. [5]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 ,

  6. [6]

    Baichuan 2: Open large-scale language models

    Baichuan. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023. 3

  7. [7]

    Beit: Bert pre- training of image transformers

    Hangbo Bao, Li Dong, and Furu Wei. Beit: Bert pre- training of image transformers. In ICLR, 2022. 6, 11, 12

  8. [8]

    Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models

    Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. NeurIPS, 32, 2019. 7, 15

  9. [9]

    Bird- snap: Large-scale fine-grained visual categorization of birds

    Thomas Berg, Jiongxin Liu, Seung Woo Lee, Michelle L Alexander, David W Jacobs, and Peter N Belhumeur. Bird- snap: Large-scale fine-grained visual categorization of birds. In CVPR, pages 2011–2018, 2014. 11, 16

  10. [10]

    Are we done with imagenet? arXiv preprint arXiv:2006.07159, 2020

    Lucas Beyer, Olivier J H ´enaff, Alexander Kolesnikov, Xi- aohua Zhai, and A ¨aron van den Oord. Are we done with imagenet? arXiv preprint arXiv:2006.07159, 2020. 6, 15

  11. [11]

    Contrastive language-image pre-training for the italian language

    Federico Bianchi, Giuseppe Attanasio, Raphael Pisoni, Sil- via Terragni, Gabriele Sarti, and Sri Lakshmi. Contrastive language-image pre-training for the italian language. arXiv preprint arXiv:2108.08688, 2021. 7

  12. [12]

    Scene text visual question answering

    Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marc ¸al Rusinol, Ernest Valveny, CV Jawahar, and Dimos- thenis Karatzas. Scene text visual question answering. In ICCV, pages 4291–4301, 2019. 5, 17

  13. [13]

    Food-101–mining discriminative components with random forests

    Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. In ECCV, pages 446–461, 2014. 11, 16

  14. [14]

    Coyo-700m: Image-text pair dataset, 2022

    Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset, 2022. 5, 13, 15

  15. [15]

    Reversible column networks

    Yuxuan Cai, Yizhuang Zhou, Qi Han, Jianjian Sun, Xiang- wen Kong, Jun Li, and Xiangyu Zhang. Reversible column networks. arXiv preprint arXiv:2212.11696, 2022. 3

  16. [16]

    Cross-lingual and multilingual clip

    Fredrik Carlsson, Philipp Eisen, Faton Rekathati, and Mag- nus Sahlgren. Cross-lingual and multilingual clip. In Pro- ceedings of the Thirteenth Language Resources and Evalu- ation Conference, pages 6848–6854, 2022. 7, 8, 10

  17. [17]

    Quo vadis, action recognition? a new model and the kinetics dataset

    Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pages 6299–6308, 2017. 7, 8, 13, 16

  18. [18]

    A short note about kinetics-

    Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short note about kinetics-

  19. [19]

    7, 13, 16

    arXiv preprint arXiv:1808.01340, 2018. 7, 13, 16

  20. [20]

    A short note on the kinetics-700 human action dataset

    Joao Carreira, Eric Noland, Chloe Hillier, and Andrew Zis- serman. A short note on the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987, 2019. 7, 8, 13, 16

  21. [21]

    Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

    Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, pages 3558–3568, 2021. 5, 13, 15

  22. [22]

    Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

    Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multi- modal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023. 1, 3, 8

  23. [23]

    Microsoft COCO Captions: Data Collection and Evaluation Server

    Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft coco captions: Data collection and eval- uation server. arXiv preprint arXiv:1504.00325, 2015. 5, 7, 8, 16, 17, 18

  24. [24]

    Pali: A jointly-scaled multilingual language-image model

    Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergio- vanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly-scaled multilingual language-image model. In ICLR, 2022. 1, 3, 4

  25. [25]

    Pali-x: On scaling up a multilingual vision and language model

    Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Se- bastian Goodman, Xiao Wang, Yi Tay, et al. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023. 8

  26. [26]

    Vision transformer adapter for dense predictions

    Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. In ICLR, 2022. 3

  27. [27]

    Altclip: Altering the lan- guage encoder in clip for extended language capabilities

    Zhongzhi Chen, Guang Liu, Bo-Wen Zhang, Fulong Ye, Qinghong Yang, and Ledell Wu. Altclip: Altering the lan- guage encoder in clip for extended language capabilities. arXiv preprint arXiv:2211.06679, 2022. 7, 8, 10

  28. [28]

    Remote sens- ing image scene classification: Benchmark and state of the art

    Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sens- ing image scene classification: Benchmark and state of the art. Proceedings of the IEEE , 105(10):1865–1883, 2017. 11 19

  29. [29]

    Describing tex- tures in the wild

    Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing tex- tures in the wild. In CVPR, pages 3606–3613, 2014. 11, 16

  30. [30]

    Simple and effective multi-paragraph reading comprehension

    Christopher Clark and Matt Gardner. Simple and effective multi-paragraph reading comprehension. arXiv preprint arXiv:1710.10723, 2017. 5, 17

  31. [31]

    An analysis of single-layer networks in unsupervised feature learning

    Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In AISTAT, pages 215–223, 2011. 11, 16

  32. [32]

    Mmsegmentation: Open- mmlab semantic segmentation toolbox and benchmark,

    MMSegmentation Contributors. Mmsegmentation: Open- mmlab semantic segmentation toolbox and benchmark,

  33. [33]

    Efficient and effective text encoding for chinese llama and alpaca

    Yiming Cui, Ziqing Yang, and Xin Yao. Efficient and ef- fective text encoding for chinese llama and alpaca. arXiv preprint arXiv:2304.08177, 2023. 2, 3, 4, 5, 6, 11, 12

  34. [34]

    Deformable convolu- tional networks

    Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolu- tional networks. In ICCV, pages 764–773, 2017. 3

  35. [35]

    InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

    Wenliang Dai, Junnan Li, Dongxu Li, AnthonyMeng Huat, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision- language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023. 1, 8, 11

  36. [36]

    Flashattention: Fast and memory-efficient exact attention with io-awareness

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christo- pher R ´e. Flashattention: Fast and memory-efficient exact attention with io-awareness. NeurIPS, 35:16344–16359,

  37. [37]

    Visual dialog

    Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jos ´e MF Moura, Devi Parikh, and Dhruv Batra. Visual dialog. In CVPR, pages 326–335, 2017. 5, 18

  38. [38]

    Scaling vision transformers to 22 billion pa- rameters

    Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdul- mohsin, et al. Scaling vision transformers to 22 billion pa- rameters. In ICML, pages 7480–7512, 2023. 3, 4, 6, 7, 12

  39. [39]

    Imagenet: A large-scale hierarchical im- age database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical im- age database. In CVPR, pages 248–255, 2009. 2, 3, 6, 7, 9, 10, 11, 12, 13, 15

  40. [40]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. 2, 3

  41. [41]

    Repvgg: Making vgg- style convnets great again

    Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong Han, Guiguang Ding, and Jian Sun. Repvgg: Making vgg- style convnets great again. In CVPR, pages 13733–13742,

  42. [42]

    Dreamllm: Synergistic multimodal com- prehension and creation

    Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, et al. Dreamllm: Synergistic multimodal com- prehension and creation. arXiv preprint arXiv:2309.11499,

  43. [43]

    An image is worth 16x16 words: Trans- formers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. In ICLR, 2020. 3, 4

  44. [44]

    PaLM-E: An Embodied Multimodal Language Model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023. 3

  45. [45]

    Glm: General language model pretraining with autoregressive blank infilling

    Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. In ACL, pages 320–335, 2022. 3, 11

  46. [46]

    The pascal visual object classes challenge: A retrospective

    Mark Everingham, SM Ali Eslami, Luc Van Gool, Christo- pher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes challenge: A retrospective. IJCV, 111:98–136, 2015. 11, 16

  47. [47]

    Eva: Exploring the limits of masked visual represen- tation learning at scale

    Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual represen- tation learning at scale. arXiv preprint arXiv:2211.07636,

  48. [48]

    Eva-02: A visual representation for neon genesis

    Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva-02: A visual represen- tation for neon genesis. arXiv preprint arXiv:2303.11331,

  49. [49]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. JMLR, 23(1):5232–5270,

  50. [50]

    Learning gener- ative visual models from few training examples: An incre- mental bayesian approach tested on 101 object categories

    Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning gener- ative visual models from few training examples: An incre- mental bayesian approach tested on 101 object categories. In CVPRW, pages 178–178, 2004. 11, 15

  51. [51]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive eval- uation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023. 8, 18

  52. [52]

    arXiv preprint arXiv:2304.15010 , year=

    Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xi- angyu Yue, et al. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010,

  53. [53]

    Challenges in representation learning: A report on three machine learning contests

    Ian J Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukier- ski, Yichuan Tang, David Thaler, Dong-Hyun Lee, et al. Challenges in representation learning: A report on three machine learning contests. In ICONIP, pages 117–124,

  54. [54]

    Google bard

    Google. Google bard. https://bard.google.com/,

  55. [55]

    Making the v in vqa matter: El- evating the role of image understanding in visual question answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: El- evating the role of image understanding in visual question answering. In CVPR, pages 6904–6913, 2017. 5, 8, 17, 18

  56. [56]

    Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark

    Jiaxi Gu, Xiaojun Meng, Guansong Lu, Lu Hou, Niu Minzhe, Xiaodan Liang, Lewei Yao, Runhui Huang, Wei 20 Zhang, Xin Jiang, et al. Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark. NeurIPS, 35: 26418–26431, 2022. 5, 7, 10, 13, 15

  57. [57]

    Vizwiz grand challenge: Answering visual questions from blind people

    Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In CVPR, pages 3608–3617, 2018. 8, 18

  58. [58]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016. 1, 3, 15

  59. [59]

    Masked autoencoders are scal- able vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scal- able vision learners. In CVPR, pages 16000–16009, 2022. 6, 12

  60. [60]

    Eurosat: A novel dataset and deep learn- ing benchmark for land use and land cover classification

    Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learn- ing benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Obser- vations and Remote Sensing , 12(7):2217–2226, 2019. 11, 16

  61. [61]

    The many faces of ro- bustness: A critical analysis of out-of-distribution general- ization

    Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kada- vath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of ro- bustness: A critical analysis of out-of-distribution general- ization. In ICCV, pages 8340–8349, 2021. 6, 7, 15

  62. [62]

    Natural adversarial examples

    Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Stein- hardt, and Dawn Song. Natural adversarial examples. In CVPR, pages 15262–15271, 2021. 6, 7, 15

  63. [63]

    Squeeze-and-excitation networks

    Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In CVPR, pages 7132–7141, 2018. 3

  64. [64]

    Deep networks with stochastic depth

    Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kil- ian Q Weinberger. Deep networks with stochastic depth. In ECCV, pages 646–661, 2016. 12, 13

  65. [65]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, pages 6700–6709, 2019. 5, 8, 17, 18

  66. [66]

    Densenet: Im- plementing efficient convnet descriptor pyramids

    Forrest Iandola, Matt Moskewicz, Sergey Karayev, Ross Girshick, Trevor Darrell, and Kurt Keutzer. Densenet: Im- plementing efficient convnet descriptor pyramids. arXiv preprint arXiv:1404.1869, 2014. 3

  67. [67]

    Introducing idefics: An open reproduction of state-of-the-art visual language model

    IDEFICS. Introducing idefics: An open reproduction of state-of-the-art visual language model. https:// huggingface.co/blog/idefics, 2023. 8

  68. [68]

    Open- clip

    Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Han- naneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open- clip. Zenodo. Version 0.1. https://doi.org/10. 5281/zenodo.5143773 , 2021. DOI: 10.5281/zen- odo.5143773. 3, 6, 7, 8, 10, 11

  69. [69]

    Batch normalization: Accelerating deep network training by reducing internal co- variate shift

    Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal co- variate shift. In ICML, pages 448–456, 2015. 12

  70. [70]

    Mural: multimodal, multitask retrieval across languages

    Aashi Jain, Mandy Guo, Krishna Srinivasan, Ting Chen, Sneha Kudugunta, Chao Jia, Yinfei Yang, and Jason Baldridge. Mural: multimodal, multitask retrieval across languages. arXiv preprint arXiv:2109.05125, 2021. 10

  71. [71]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, pages 4904–4916, 2021. 2, 3, 10

  72. [72]

    A diagram is worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In ECCV, pages 235–251, 2016. 5, 17

  73. [73]

    3d object representations for fine-grained categorization

    Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In ICCVW, pages 554–561, 2013. 11, 15

  74. [74]

    Imagenet classification with deep convolutional neural net- works

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural net- works. NeurIPS, 25, 2012. 3

  75. [75]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky et al. Learning multiple layers of features from tiny images. 2009. 11, 15, 16

  76. [76]

    Lisa: Reasoning seg- mentation via large language model

    Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning seg- mentation via large language model. arXiv preprint arXiv:2308.00692, 2023. 3

  77. [77]

    Clip benchmark: Clip-like model evalua- tion

    LAION-AI. Clip benchmark: Clip-like model evalua- tion. https://github.com/LAION- AI/CLIP_ benchmark, 2023. 7, 15

  78. [78]

    Fluency-guided cross-lingual image captioning

    Weiyu Lan, Xirong Li, and Jianfeng Dong. Fluency-guided cross-lingual image captioning. In ACM MM, pages 1549– 1557, 2017. 7, 8, 10, 12, 16

  79. [79]

    Gradient-based learning applied to document recognition

    Yann LeCun, L ´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324,

  80. [80]

    Otter: A multi-modal model 9 with in-context instruction tuning

    Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023. 3, 11

Showing first 80 references.