arxiv: 2312.14238 · v3 · submitted 2023-12-21 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Bin Li, Guo Chen, Jiannan Wu, Jifeng Dai, Lewei Lu, Muyan Zhong, Ping Luo, Qinglong Zhang, Sen Xing, Tong Lu, Weijie Su, Wenhai Wang, Xizhou Zhu, Yu Qiao, Zhe Chen

Pith reviewed 2026-05-13 22:42 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language modelsfoundation modelsmultimodal alignmentzero-shot learningvisual perceptionlarge language modelsimage-text datamulti-modal dialogue

0 comments

The pith

InternVL scales a vision foundation model to 6 billion parameters and progressively aligns it with an LLM on web-scale image-text data to reach state-of-the-art performance on 32 visual-linguistic benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces InternVL as a vision-language foundation model that enlarges the vision component to 6 billion parameters. It then aligns this vision encoder step by step with a large language model using large collections of image-text pairs gathered from the web. The resulting model handles both coarse image recognition and fine pixel-level tasks, delivers strong zero-shot results on image and video classification and retrieval, and combines with language models for multi-turn dialogue. A reader would care because the work shows one concrete way to bring vision scaling closer to the rapid progress already seen in language models. The authors position the model as a direct, smaller alternative to the much larger ViT-22B.

Core claim

InternVL scales the vision foundation model to 6 billion parameters and progressively aligns it with the LLM using web-scale image-text data from various sources, allowing broad application and state-of-the-art performance on 32 generic visual-linguistic benchmarks that include image-level and pixel-level recognition, zero-shot image and video classification, zero-shot image and video-text retrieval, and multi-modal dialogue systems.

What carries the argument

The InternVL architecture, which enlarges the vision encoder to 6 billion parameters and applies progressive alignment with an LLM on diverse web-scale image-text pairs to unify perception and language capabilities.

If this is right

The model supports both image-level and pixel-level visual recognition at competitive accuracy.
It enables zero-shot classification and retrieval for both still images and video.
Linking InternVL with an LLM produces capable multi-modal dialogue systems.
The 6B vision component serves as a practical substitute for the larger ViT-22B.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same scaling-plus-alignment recipe could be tested on other modalities such as audio or 3D scenes to check whether the gains transfer.
Wider deployment may reveal whether performance holds when the input distribution shifts away from the original web data sources.
Integration into existing LLM tool-use pipelines could let single models handle mixed visual and textual instructions with fewer separate components.

Load-bearing premise

That scaling the vision encoder to 6 billion parameters and aligning it progressively on web-scale image-text data will produce generalizable state-of-the-art results across 32 diverse benchmarks without overfitting or source-specific biases.

What would settle it

Finding a new visual-linguistic benchmark, drawn from data sources outside the training mixture, on which InternVL falls below the performance of prior smaller models.

read the original abstract

The exponential growth of large language models (LLMs) has opened up numerous possibilities for multimodal AGI systems. However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs. In this work, we design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters and progressively aligns it with the LLM, using web-scale image-text data from various sources. This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks including visual perception tasks such as image-level or pixel-level recognition, vision-language tasks such as zero-shot image/video classification, zero-shot image/video-text retrieval, and link with LLMs to create multi-modal dialogue systems. It has powerful visual capabilities and can be a good alternative to the ViT-22B. We hope that our research could contribute to the development of multi-modal large models. Code and models are available at https://github.com/OpenGVLab/InternVL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

InternVL ships a usable 6B vision backbone with broad benchmark numbers, but the zero-shot claims rest on unverified data overlap.

read the letter

The main thing here is that the authors trained a 6B vision encoder, aligned it progressively to an LLM on mixed web-scale image-text pairs, and report competitive or leading numbers on 32 standard benchmarks covering perception, retrieval, classification, and dialogue. They also release the code and weights, which matters more than the abstract suggests. That combination gives the field a practical alternative to larger closed models like ViT-22B for downstream work. The progressive alignment stages look like a straightforward extension of existing vision-language recipes, but executing them at this scale and documenting the results across so many tasks is still useful empirical data. The public artifacts make it easy to test directly. The soft spot is the training data. The stress-test note flags the lack of overlap statistics between their web corpora and the evaluation sets (ImageNet, COCO, VQAv2, etc.). If modest leakage exists, the zero-shot and few-shot gains become harder to interpret as genuine generalization. The paper does not appear to report decontamination steps or contamination audits, so that remains an open question rather than a settled strength. Methods sections are present in the full text but light on ablations for the alignment schedule and data mixing ratios. This is the kind of work that belongs in a reading group for anyone building multimodal systems. It is worth a serious referee because the model release and scale are concrete, even if the generalization story needs tighter evidence on data hygiene. I would send it to review rather than desk-reject.

Referee Report

2 major / 2 minor

Summary. The paper introduces InternVL, a vision-language foundation model that scales the vision encoder to 6 billion parameters and progressively aligns it with an LLM using web-scale image-text data from various sources. It claims state-of-the-art performance on 32 generic visual-linguistic benchmarks, including image/pixel-level recognition, zero-shot image/video classification and retrieval, and multimodal dialogue systems, while positioning the model as a viable alternative to ViT-22B.

Significance. If the performance claims hold after addressing data-contamination risks, the work would represent a meaningful step in scaling vision foundation models to 6B parameters and aligning them for broad multimodal tasks. The public release of code and models is a clear strength that supports reproducibility and community follow-up.

major comments (2)

[§3.2 and §4] §3.2 and §4: The progressive alignment procedure on heterogeneous web-scale corpora is described, but no overlap statistics, decontamination steps, or membership inference results are reported between the collected training pairs and the 32 evaluation benchmarks (ImageNet, COCO, VQAv2, ActivityNet, etc.). Without this, the zero-shot and few-shot SOTA numbers cannot be reliably interpreted as generalization rather than leakage.
[§5] §5 (Experiments): The manuscript asserts SOTA on 32 benchmarks after scaling and alignment, yet provides neither ablation tables isolating the contribution of the 6B vision encoder versus the alignment stages, nor error bars, nor full baseline comparisons with contemporaneous models. These omissions make the central scaling claim difficult to verify from the reported results.

minor comments (2)

[Abstract] The abstract and introduction would benefit from a concise table summarizing the 32 benchmarks and the precise metrics on which SOTA is claimed.
[§3] Notation for the vision encoder size (6B) and the LLM component should be introduced earlier and used consistently throughout the method sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript to incorporate additional analyses where feasible.

read point-by-point responses

Referee: [§3.2 and §4] §3.2 and §4: The progressive alignment procedure on heterogeneous web-scale corpora is described, but no overlap statistics, decontamination steps, or membership inference results are reported between the collected training pairs and the 32 evaluation benchmarks (ImageNet, COCO, VQAv2, ActivityNet, etc.). Without this, the zero-shot and few-shot SOTA numbers cannot be reliably interpreted as generalization rather than leakage.

Authors: We agree that explicit decontamination reporting is necessary to substantiate the zero-shot claims. In the revised version we will add a dedicated subsection under §3.2 that reports (i) n-gram overlap statistics between the web-scale training pairs and each of the 32 evaluation benchmarks, (ii) the exact decontamination filters applied (e.g., exact URL and caption matching), and (iii) membership-inference results obtained via a simple loss-threshold attack on a held-out subset of the benchmarks. These additions will allow readers to assess the degree of leakage directly. revision: yes
Referee: [§5] §5 (Experiments): The manuscript asserts SOTA on 32 benchmarks after scaling and alignment, yet provides neither ablation tables isolating the contribution of the 6B vision encoder versus the alignment stages, nor error bars, nor full baseline comparisons with contemporaneous models. These omissions make the central scaling claim difficult to verify from the reported results.

Authors: We acknowledge that the current experimental section would benefit from more granular ablations. We will insert a new ablation table in §5 that isolates (a) the effect of scaling the vision encoder from 1B to 6B parameters while keeping the alignment procedure fixed, and (b) the incremental gains from each stage of the progressive alignment. Where multiple random seeds were run, we will report mean ± standard deviation. We will also expand the baseline table to include additional contemporaneous models (e.g., recent 2023–2024 vision-language models) that were omitted for space reasons in the original submission. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes an empirical pipeline: scaling a vision encoder to 6B parameters, followed by progressive alignment stages on web-scale image-text corpora, then direct evaluation on 32 external benchmarks (ImageNet, COCO, VQAv2, etc.). No load-bearing step reduces by construction to a fitted parameter renamed as prediction, a self-citation chain, or a self-definitional loop; architectural choices and training objectives are stated explicitly and the reported SOTA numbers rest on independent test sets rather than internal re-derivation of the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work relies on standard transformer architectures and scaling practices from prior literature without introducing new free parameters, invented entities, or ad-hoc axioms beyond the domain assumption that web-scale data supports broad generalization.

axioms (2)

domain assumption Transformer-based vision and language models follow established scaling laws when trained on web-scale image-text pairs
Invoked implicitly when claiming that scaling to 6B parameters and alignment will yield SOTA results across diverse tasks.
domain assumption Standard benchmark suites are sufficient to demonstrate general visual-linguistic capabilities
The claim of broad applicability rests on performance across 32 existing benchmarks without new evaluation protocols.

pith-pipeline@v0.9.0 · 5535 in / 1507 out tokens · 53645 ms · 2026-05-13T22:42:55.432347+00:00 · methodology

discussion (0)

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images
cs.CV 2026-04 unverdicted novelty 8.0

S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.
CATS: Curvature Aware Temporal Selection for efficient long video understanding
cs.CV 2026-05 unverdicted novelty 7.0

CATS uses temporal curvature of query-frame relevance to select informative frames, achieving 93-95% of heavy multi-stage accuracy at 3-4% of the preprocessing cost on long-video benchmarks.
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding
cs.CV 2026-04 unverdicted novelty 7.0

Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
Mosaic: Multimodal Jailbreak against Closed-Source VLMs via Multi-View Ensemble Optimization
cs.CV 2026-04 unverdicted novelty 7.0

Mosaic combines text perturbation, multi-view image optimization, and surrogate model ensembles to reduce reliance on any single open-source model and achieve higher attack success rates on commercial closed-source VLMs.
Anisotropic Modality Align
cs.MM 2026-05 unverdicted novelty 6.0

Modality representations share dominant semantic geometry but have an anisotropic residual gap; AnisoAlign corrects source representations boundedly using target geometry for unpaired alignment.
Agentic AI for Remote Sensing: Technical Challenges and Research Directions
cs.CV 2026-04 unverdicted novelty 6.0

Agentic AI faces structural challenges in remote sensing due to geospatial data properties and workflow constraints, requiring EO-native agents built around structured state, tool-aware reasoning, and validity-aware e...
If you're waiting for a sign... that might not be it! Mitigating Trust Boundary Confusion from Visual Injections on Vision-Language Agentic Systems
cs.CV 2026-04 unverdicted novelty 6.0

LVLM-based agents exhibit trust boundary confusion with visual injections and a multi-agent defense separating perception from decision-making reduces misleading responses while preserving correct ones.
CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference
cs.DC 2026-04 unverdicted novelty 6.0

CodecSight reuses video codec signals for online patch pruning before the vision transformer and selective KV-cache refresh in the LLM, delivering up to 3x higher throughput and 87% lower GPU compute than prior baseli...
AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis
cs.CV 2026-04 unverdicted novelty 6.0

AICA-Bench evaluates 23 VLMs on affective image analysis, identifies weak intensity calibration and shallow descriptions as limitations, and proposes training-free Grounded Affective Tree Prompting to improve performance.
Long Context Transfer from Language to Vision
cs.CV 2024-06 unverdicted novelty 6.0

Extending language model context length enables LMMs to process over 200K visual tokens from long videos without video training, achieving SOTA on Video-MME via dense frame sampling.
Are We on the Right Way for Evaluating Large Vision-Language Models?
cs.CV 2024-03 conditional novelty 6.0

Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...
Agentic AI for Remote Sensing: Technical Challenges and Research Directions
cs.CV 2026-04 unverdicted novelty 5.0

Agentic AI for remote sensing requires new designs centered on structured geospatial state, tool-aware reasoning, verifier-guided execution, and physical validity rather than generic extensions.
From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media
cs.CV 2026-04 unverdicted novelty 5.0

VLMs recover reliable population-level trends in climate change visual discourse on social media even when per-image accuracy is only moderate.
Are Face Embeddings Compatible Across Deep Neural Network Models?
cs.CV 2026-04 unverdicted novelty 5.0

Simple affine transformations align face embeddings across different DNN models, substantially improving cross-model identification and verification performance.
Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation
cs.CV 2026-04 unverdicted novelty 5.0

Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
cs.CV 2024-12 accept novelty 5.0

DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B a...
LLaVA-OneVision: Easy Visual Task Transfer
cs.CV 2024-08 unverdicted novelty 5.0

LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise
cs.CV 2026-05 unverdicted novelty 4.0

Mild rotations and noise significantly increase relation hallucinations in VLMs across models and datasets, with prompt augmentation and preprocessing offering only partial mitigation.
When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise
cs.CV 2026-05 unverdicted novelty 4.0

Mild rotations and noise significantly increase relation hallucinations in VLMs across models and datasets, with prompt and preprocessing fixes providing only partial relief.
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
cs.CV 2025-01 unverdicted novelty 4.0

VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
PaliGemma: A versatile 3B VLM for transfer
cs.CV 2024-07 unverdicted novelty 4.0

PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
cs.CV 2024-04 unverdicted novelty 4.0

InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
A Survey on Hallucination in Large Vision-Language Models
cs.CV 2024-02 unverdicted novelty 3.0

This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.

Reference graph

Works this paper leans on

189 extracted references · 189 canonical work pages · cited by 21 Pith papers · 28 internal anchors

[1]

Towards zero- shot cross-lingual image retrieval

Pranav Aggarwal and Ajinkya Kale. Towards zero- shot cross-lingual image retrieval. arXiv preprint arXiv:2012.05107, 2020. 8, 10, 16

work page arXiv 2012
[2]

Nocaps: Novel object cap- tioning at scale

Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. Nocaps: Novel object cap- tioning at scale. In ICCV, pages 8948–8957, 2019. 8, 17, 18

work page 2019
[3]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. NeurIPS, 35:23716–23736, 2022. 1, 3, 8

work page 2022
[4]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xi- aodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Day- iheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfe...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 ,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Baichuan 2: Open large-scale language models

Baichuan. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023. 3

work page arXiv 2023
[7]

Beit: Bert pre- training of image transformers

Hangbo Bao, Li Dong, and Furu Wei. Beit: Bert pre- training of image transformers. In ICLR, 2022. 6, 11, 12

work page 2022
[8]

Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models

Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. NeurIPS, 32, 2019. 7, 15

work page 2019
[9]

Bird- snap: Large-scale fine-grained visual categorization of birds

Thomas Berg, Jiongxin Liu, Seung Woo Lee, Michelle L Alexander, David W Jacobs, and Peter N Belhumeur. Bird- snap: Large-scale fine-grained visual categorization of birds. In CVPR, pages 2011–2018, 2014. 11, 16

work page 2011
[10]

Are we done with imagenet? arXiv preprint arXiv:2006.07159, 2020

Lucas Beyer, Olivier J H ´enaff, Alexander Kolesnikov, Xi- aohua Zhai, and A ¨aron van den Oord. Are we done with imagenet? arXiv preprint arXiv:2006.07159, 2020. 6, 15

work page arXiv 2006
[11]

Contrastive language-image pre-training for the italian language

Federico Bianchi, Giuseppe Attanasio, Raphael Pisoni, Sil- via Terragni, Gabriele Sarti, and Sri Lakshmi. Contrastive language-image pre-training for the italian language. arXiv preprint arXiv:2108.08688, 2021. 7

work page arXiv 2021
[12]

Scene text visual question answering

Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marc ¸al Rusinol, Ernest Valveny, CV Jawahar, and Dimos- thenis Karatzas. Scene text visual question answering. In ICCV, pages 4291–4301, 2019. 5, 17

work page 2019
[13]

Food-101–mining discriminative components with random forests

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. In ECCV, pages 446–461, 2014. 11, 16

work page 2014
[14]

Coyo-700m: Image-text pair dataset, 2022

Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset, 2022. 5, 13, 15

work page 2022
[15]

Reversible column networks

Yuxuan Cai, Yizhuang Zhou, Qi Han, Jianjian Sun, Xiang- wen Kong, Jun Li, and Xiangyu Zhang. Reversible column networks. arXiv preprint arXiv:2212.11696, 2022. 3

work page arXiv 2022
[16]

Cross-lingual and multilingual clip

Fredrik Carlsson, Philipp Eisen, Faton Rekathati, and Mag- nus Sahlgren. Cross-lingual and multilingual clip. In Pro- ceedings of the Thirteenth Language Resources and Evalu- ation Conference, pages 6848–6854, 2022. 7, 8, 10

work page 2022
[17]

Quo vadis, action recognition? a new model and the kinetics dataset

Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pages 6299–6308, 2017. 7, 8, 13, 16

work page 2017
[18]

A short note about kinetics-

Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short note about kinetics-

work page
[19]

7, 13, 16

arXiv preprint arXiv:1808.01340, 2018. 7, 13, 16

work page arXiv 2018
[20]

A short note on the kinetics-700 human action dataset

Joao Carreira, Eric Noland, Chloe Hillier, and Andrew Zis- serman. A short note on the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987, 2019. 7, 8, 13, 16

work page arXiv 1907
[21]

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, pages 3558–3568, 2021. 5, 13, 15

work page 2021
[22]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multi- modal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023. 1, 3, 8

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft coco captions: Data collection and eval- uation server. arXiv preprint arXiv:1504.00325, 2015. 5, 7, 8, 16, 17, 18

work page internal anchor Pith review Pith/arXiv arXiv 2015
[24]

Pali: A jointly-scaled multilingual language-image model

Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergio- vanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly-scaled multilingual language-image model. In ICLR, 2022. 1, 3, 4

work page 2022
[25]

Pali-x: On scaling up a multilingual vision and language model

Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Se- bastian Goodman, Xiao Wang, Yi Tay, et al. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023. 8

work page arXiv 2023
[26]

Vision transformer adapter for dense predictions

Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. In ICLR, 2022. 3

work page 2022
[27]

Altclip: Altering the lan- guage encoder in clip for extended language capabilities

Zhongzhi Chen, Guang Liu, Bo-Wen Zhang, Fulong Ye, Qinghong Yang, and Ledell Wu. Altclip: Altering the lan- guage encoder in clip for extended language capabilities. arXiv preprint arXiv:2211.06679, 2022. 7, 8, 10

work page arXiv 2022
[28]

Remote sens- ing image scene classification: Benchmark and state of the art

Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sens- ing image scene classification: Benchmark and state of the art. Proceedings of the IEEE , 105(10):1865–1883, 2017. 11 19

work page 2017
[29]

Describing tex- tures in the wild

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing tex- tures in the wild. In CVPR, pages 3606–3613, 2014. 11, 16

work page 2014
[30]

Simple and effective multi-paragraph reading comprehension

Christopher Clark and Matt Gardner. Simple and effective multi-paragraph reading comprehension. arXiv preprint arXiv:1710.10723, 2017. 5, 17

work page arXiv 2017
[31]

An analysis of single-layer networks in unsupervised feature learning

Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In AISTAT, pages 215–223, 2011. 11, 16

work page 2011
[32]

Mmsegmentation: Open- mmlab semantic segmentation toolbox and benchmark,

MMSegmentation Contributors. Mmsegmentation: Open- mmlab semantic segmentation toolbox and benchmark,

work page
[33]

Efficient and effective text encoding for chinese llama and alpaca

Yiming Cui, Ziqing Yang, and Xin Yao. Efficient and ef- fective text encoding for chinese llama and alpaca. arXiv preprint arXiv:2304.08177, 2023. 2, 3, 4, 5, 6, 11, 12

work page arXiv 2023
[34]

Deformable convolu- tional networks

Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolu- tional networks. In ICCV, pages 764–773, 2017. 3

work page 2017
[35]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Wenliang Dai, Junnan Li, Dongxu Li, AnthonyMeng Huat, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision- language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023. 1, 8, 11

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Flashattention: Fast and memory-efficient exact attention with io-awareness

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christo- pher R ´e. Flashattention: Fast and memory-efficient exact attention with io-awareness. NeurIPS, 35:16344–16359,

work page
[37]

Visual dialog

Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jos ´e MF Moura, Devi Parikh, and Dhruv Batra. Visual dialog. In CVPR, pages 326–335, 2017. 5, 18

work page 2017
[38]

Scaling vision transformers to 22 billion pa- rameters

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdul- mohsin, et al. Scaling vision transformers to 22 billion pa- rameters. In ICML, pages 7480–7512, 2023. 3, 4, 6, 7, 12

work page 2023
[39]

Imagenet: A large-scale hierarchical im- age database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical im- age database. In CVPR, pages 248–255, 2009. 2, 3, 6, 7, 9, 10, 11, 12, 13, 15

work page 2009
[40]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2018
[41]

Repvgg: Making vgg- style convnets great again

Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong Han, Guiguang Ding, and Jian Sun. Repvgg: Making vgg- style convnets great again. In CVPR, pages 13733–13742,

work page
[42]

Dreamllm: Synergistic multimodal com- prehension and creation

Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, et al. Dreamllm: Synergistic multimodal com- prehension and creation. arXiv preprint arXiv:2309.11499,

work page arXiv
[43]

An image is worth 16x16 words: Trans- formers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. In ICLR, 2020. 3, 4

work page 2020
[44]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Glm: General language model pretraining with autoregressive blank infilling

Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. In ACL, pages 320–335, 2022. 3, 11

work page 2022
[46]

The pascal visual object classes challenge: A retrospective

Mark Everingham, SM Ali Eslami, Luc Van Gool, Christo- pher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes challenge: A retrospective. IJCV, 111:98–136, 2015. 11, 16

work page 2015
[47]

Eva: Exploring the limits of masked visual represen- tation learning at scale

Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual represen- tation learning at scale. arXiv preprint arXiv:2211.07636,

work page arXiv
[48]

Eva-02: A visual representation for neon genesis

Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva-02: A visual represen- tation for neon genesis. arXiv preprint arXiv:2303.11331,

work page arXiv
[49]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. JMLR, 23(1):5232–5270,

work page
[50]

Learning gener- ative visual models from few training examples: An incre- mental bayesian approach tested on 101 object categories

Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning gener- ative visual models from few training examples: An incre- mental bayesian approach tested on 101 object categories. In CVPRW, pages 178–178, 2004. 11, 15

work page 2004
[51]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive eval- uation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023. 8, 18

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

arXiv preprint arXiv:2304.15010 , year=

Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xi- angyu Yue, et al. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010,

work page arXiv
[53]

Challenges in representation learning: A report on three machine learning contests

Ian J Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukier- ski, Yichuan Tang, David Thaler, Dong-Hyun Lee, et al. Challenges in representation learning: A report on three machine learning contests. In ICONIP, pages 117–124,

work page
[54]

Google bard

Google. Google bard. https://bard.google.com/,

work page
[55]

Making the v in vqa matter: El- evating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: El- evating the role of image understanding in visual question answering. In CVPR, pages 6904–6913, 2017. 5, 8, 17, 18

work page 2017
[56]

Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark

Jiaxi Gu, Xiaojun Meng, Guansong Lu, Lu Hou, Niu Minzhe, Xiaodan Liang, Lewei Yao, Runhui Huang, Wei 20 Zhang, Xin Jiang, et al. Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark. NeurIPS, 35: 26418–26431, 2022. 5, 7, 10, 13, 15

work page 2022
[57]

Vizwiz grand challenge: Answering visual questions from blind people

Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In CVPR, pages 3608–3617, 2018. 8, 18

work page 2018
[58]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016. 1, 3, 15

work page 2016
[59]

Masked autoencoders are scal- able vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scal- able vision learners. In CVPR, pages 16000–16009, 2022. 6, 12

work page 2022
[60]

Eurosat: A novel dataset and deep learn- ing benchmark for land use and land cover classification

Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learn- ing benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Obser- vations and Remote Sensing , 12(7):2217–2226, 2019. 11, 16

work page 2019
[61]

The many faces of ro- bustness: A critical analysis of out-of-distribution general- ization

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kada- vath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of ro- bustness: A critical analysis of out-of-distribution general- ization. In ICCV, pages 8340–8349, 2021. 6, 7, 15

work page 2021
[62]

Natural adversarial examples

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Stein- hardt, and Dawn Song. Natural adversarial examples. In CVPR, pages 15262–15271, 2021. 6, 7, 15

work page 2021
[63]

Squeeze-and-excitation networks

Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In CVPR, pages 7132–7141, 2018. 3

work page 2018
[64]

Deep networks with stochastic depth

Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kil- ian Q Weinberger. Deep networks with stochastic depth. In ECCV, pages 646–661, 2016. 12, 13

work page 2016
[65]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, pages 6700–6709, 2019. 5, 8, 17, 18

work page 2019
[66]

Densenet: Im- plementing efficient convnet descriptor pyramids

Forrest Iandola, Matt Moskewicz, Sergey Karayev, Ross Girshick, Trevor Darrell, and Kurt Keutzer. Densenet: Im- plementing efficient convnet descriptor pyramids. arXiv preprint arXiv:1404.1869, 2014. 3

work page arXiv 2014
[67]

Introducing idefics: An open reproduction of state-of-the-art visual language model

IDEFICS. Introducing idefics: An open reproduction of state-of-the-art visual language model. https:// huggingface.co/blog/idefics, 2023. 8

work page 2023
[68]

Open- clip

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Han- naneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open- clip. Zenodo. Version 0.1. https://doi.org/10. 5281/zenodo.5143773 , 2021. DOI: 10.5281/zen- odo.5143773. 3, 6, 7, 8, 10, 11

work page doi:10.5281/zen- 2021
[69]

Batch normalization: Accelerating deep network training by reducing internal co- variate shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal co- variate shift. In ICML, pages 448–456, 2015. 12

work page 2015
[70]

Mural: multimodal, multitask retrieval across languages

Aashi Jain, Mandy Guo, Krishna Srinivasan, Ting Chen, Sneha Kudugunta, Chao Jia, Yinfei Yang, and Jason Baldridge. Mural: multimodal, multitask retrieval across languages. arXiv preprint arXiv:2109.05125, 2021. 10

work page arXiv 2021
[71]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, pages 4904–4916, 2021. 2, 3, 10

work page 2021
[72]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In ECCV, pages 235–251, 2016. 5, 17

work page 2016
[73]

3d object representations for fine-grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In ICCVW, pages 554–561, 2013. 11, 15

work page 2013
[74]

Imagenet classification with deep convolutional neural net- works

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural net- works. NeurIPS, 25, 2012. 3

work page 2012
[75]

Learning multiple layers of features from tiny images

Alex Krizhevsky et al. Learning multiple layers of features from tiny images. 2009. 11, 15, 16

work page 2009
[76]

Lisa: Reasoning seg- mentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning seg- mentation via large language model. arXiv preprint arXiv:2308.00692, 2023. 3

work page arXiv 2023
[77]

Clip benchmark: Clip-like model evalua- tion

LAION-AI. Clip benchmark: Clip-like model evalua- tion. https://github.com/LAION- AI/CLIP_ benchmark, 2023. 7, 15

work page 2023
[78]

Fluency-guided cross-lingual image captioning

Weiyu Lan, Xirong Li, and Jianfeng Dong. Fluency-guided cross-lingual image captioning. In ACM MM, pages 1549– 1557, 2017. 7, 8, 10, 12, 16

work page 2017
[79]

Gradient-based learning applied to document recognition

Yann LeCun, L ´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324,

work page
[80]

Otter: A multi-modal model 9 with in-context instruction tuning

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023. 3, 11

work page arXiv 2023

Showing first 80 references.