arxiv: 2111.11432 · v1 · submitted 2021-11-22 · 💻 cs.CV · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Florence: A New Foundation Model for Computer Vision

Lu Yuan , Dongdong Chen , Yi-Ling Chen , Noel Codella , Xiyang Dai , Jianfeng Gao , Houdong Hu , Xuedong Huang

show 15 more authors

Boxin Li Chunyuan Li Ce Liu Mengchen Liu Zicheng Liu Yumao Lu Yu Shi Lijuan Wang Jianfeng Wang Bin Xiao Zhen Xiao Jianwei Yang Michael Zeng Luowei Zhou Pengchuan Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 09:34 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords computer visionfoundation modelimage-text pretrainingzero-shot transfermultimodal representationsobject detectionvideo action recognitionvisual question answering

0 comments

The pith

Florence expands vision models from coarse scene representations to fine objects, videos, and extra modalities like depth using web-scale image-text data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Florence as a computer vision foundation model trained on large-scale web image-text pairs. Its goal is to create representations that adapt to many tasks with little extra work, covering classification, retrieval, detection, visual question answering, captioning, video retrieval, and action recognition. The model widens its scope from static images to dynamic videos and from basic RGB to added signals such as depth and captions. Florence reports new state-of-the-art numbers on most of 44 benchmarks, including 83.74 percent top-1 zero-shot accuracy on ImageNet-1K, 62.4 mAP on COCO detection after fine-tuning, 80.36 on VQA, and 87.8 on Kinetics-600.

Core claim

Florence is a foundation model that learns universal visual-language representations from Web-scale image-text data, enabling easy adaptation to diverse computer vision tasks ranging from image classification and object detection to video action recognition and visual question answering, while achieving new state-of-the-art performance on the majority of 44 representative benchmarks.

What carries the argument

Florence, the model that builds shared image-text representations and then extends them from coarse scenes to fine objects, from static frames to video sequences, and from RGB to additional signals such as depth and captions.

If this is right

Supports zero-shot transfer to novel images and objects without task-specific training.
Delivers 62.4 mAP on COCO object detection after standard fine-tuning.
Reaches 80.36 accuracy on visual question answering and 87.8 on Kinetics-600 action recognition.
Works across fully supervised fine-tuning, linear probing, few-shot, and zero-shot settings.
Handles both static image tasks and dynamic video tasks within the same base model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

One model could eventually replace separate systems now used for images, videos, and depth sensing.
Adding still more signals such as audio or 3D geometry might further reduce the need for task-specific fine-tuning.
Real-world robotics or long-video monitoring would be a direct test of whether the generalization holds under continuous input.
If the pattern scales, training compute could shift from many narrow models to fewer broad ones.

Load-bearing premise

That training on diverse web-scale image-text data produces representations that generalize well with minimal customization across static images, videos, fine-grained objects, and additional modalities such as depth and captions.

What would settle it

A new benchmark set of fine-grained video or depth tasks where Florence requires heavy per-task retraining or falls below existing specialized models.

read the original abstract

Automated visual understanding of our diverse and open world demands computer vision models to generalize well with minimal customization for specific tasks, similar to human vision. Computer vision foundation models, which are trained on diverse, large-scale dataset and can be adapted to a wide range of downstream tasks, are critical for this mission to solve real-world computer vision applications. While existing vision foundation models such as CLIP, ALIGN, and Wu Dao 2.0 focus mainly on mapping images and textual representations to a cross-modal shared representation, we introduce a new computer vision foundation model, Florence, to expand the representations from coarse (scene) to fine (object), from static (images) to dynamic (videos), and from RGB to multiple modalities (caption, depth). By incorporating universal visual-language representations from Web-scale image-text data, our Florence model can be easily adapted for various computer vision tasks, such as classification, retrieval, object detection, VQA, image caption, video retrieval and action recognition. Moreover, Florence demonstrates outstanding performance in many types of transfer learning: fully sampled fine-tuning, linear probing, few-shot transfer and zero-shot transfer for novel images and objects. All of these properties are critical for our vision foundation model to serve general purpose vision tasks. Florence achieves new state-of-the-art results in majority of 44 representative benchmarks, e.g., ImageNet-1K zero-shot classification with top-1 accuracy of 83.74 and the top-5 accuracy of 97.18, 62.4 mAP on COCO fine tuning, 80.36 on VQA, and 87.8 on Kinetics-600.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Florence shows image-text pretraining can stretch to video and depth with wide benchmark coverage, but the minimal-customization claim needs method details to hold up.

read the letter

The main point on Florence is that it takes the CLIP-style image-text pretraining approach and applies it to a broader set of tasks, including video action recognition on Kinetics-600 and depth-related work, while reporting new highs across most of 44 benchmarks like 83.74 top-1 zero-shot on ImageNet-1K and 62.4 mAP on COCO fine-tuning. The model is positioned as adaptable with minimal changes for classification, retrieval, detection, VQA, captioning, and video tasks, which is the core engineering contribution here. What stands out is the reported scope: one backbone handling static images to dynamic video and RGB to additional signals like depth and captions, all starting from web-scale image-text data. The numbers on zero-shot, few-shot, linear probing, and full fine-tuning transfers look competitive on the surface. The soft spots sit in the missing pieces. The abstract gives no training recipe, no ablations on how video transfer actually works, and no statistical checks, so it is hard to tell whether the 87.8 on Kinetics-600 comes from the pretraining itself or from later task-specific additions like temporal layers or auxiliary losses. The stress-test concern about unshown adaptation details for dynamic data lands because the paper claims training solely on image-text yet delivers strong video results; if the full methods show non-minimal components, the foundation-model story rests more on post-hoc work than the initial training. This is aimed at groups building large-scale vision systems who want to see how far a single representation can stretch across modalities. A reader focused on multimodal or robotics applications would find the benchmark spread useful even if they have to dig for the exact adaptation steps. It deserves a serious referee because the empirical breadth is real and the results are strong enough to warrant checking the methods and reproducibility claims.

Referee Report

2 major / 1 minor

Summary. The paper introduces Florence, a computer vision foundation model trained on web-scale image-text data. It expands visual representations from coarse to fine, static to dynamic, and RGB to multi-modal. The model is claimed to be adaptable to various tasks with minimal customization and achieves new state-of-the-art results on the majority of 44 representative benchmarks, including 83.74% top-1 and 97.18% top-5 accuracy on ImageNet-1K zero-shot classification, 62.4 mAP on COCO fine-tuning, 80.36 on VQA, and 87.8 on Kinetics-600.

Significance. If the results are substantiated with full training and adaptation details, Florence would be a significant contribution as a versatile foundation model capable of handling diverse vision tasks across modalities with strong generalization from image-text pretraining.

major comments (2)

Abstract: The abstract asserts training solely on Web-scale image-text data yet reports SOTA performance on video action recognition (Kinetics-600 at 87.8) and VQA (80.36) with 'minimal customization'. This claim is load-bearing for the foundation model narrative but lacks any description of the adaptation procedure for dynamic inputs or additional modalities, making it impossible to assess whether the performance stems from the pretraining or from task-specific engineering.
Abstract: No training details, baselines, statistical tests, or ablation studies are provided to support the strong performance numbers across 44 benchmarks, which is load-bearing for verifying the central generalization claims.

minor comments (1)

The manuscript should include a clear table or section summarizing all 44 benchmarks with direct comparisons to prior work and exact adaptation methods used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below and will make revisions to improve clarity where appropriate.

read point-by-point responses

Referee: Abstract: The abstract asserts training solely on Web-scale image-text data yet reports SOTA performance on video action recognition (Kinetics-600 at 87.8) and VQA (80.36) with 'minimal customization'. This claim is load-bearing for the foundation model narrative but lacks any description of the adaptation procedure for dynamic inputs or additional modalities, making it impossible to assess whether the performance stems from the pretraining or from task-specific engineering.

Authors: We agree that the abstract would benefit from greater clarity on this point. The model is pretrained solely on web-scale image-text pairs to obtain universal visual-language representations. For video action recognition, adaptation consists of sampling frames, applying the image encoder, and using lightweight temporal aggregation (e.g., mean pooling or a small 3D convolution head) without retraining the core model. For VQA, visual features are extracted and fused with question text via a minimal multimodal head. These procedures are described in the method and adaptation sections of the full manuscript. We will revise the abstract to briefly note the adaptation strategies for dynamic and multimodal inputs. revision: yes
Referee: Abstract: No training details, baselines, statistical tests, or ablation studies are provided to support the strong performance numbers across 44 benchmarks, which is load-bearing for verifying the central generalization claims.

Authors: The full manuscript includes pretraining details (data scale, architecture, optimization) in Section 3, direct baseline comparisons for all 44 benchmarks in the experimental tables, and ablation studies in Section 5 analyzing key components such as hierarchical representations. While formal statistical tests (e.g., p-values) are not reported for every benchmark, the consistent large-margin improvements across diverse tasks support the generalization claims. We will add a concise summary of training settings and highlight the ablation results more prominently, possibly in an expanded abstract or dedicated paragraph. revision: partial

Circularity Check

0 steps flagged

No circularity: Florence reports direct empirical benchmark results from large-scale pretraining

full rationale

The paper describes training Florence on web-scale image-text data and evaluates it via standard held-out benchmarks (ImageNet-1K zero-shot, COCO mAP, VQA, Kinetics-600). No mathematical derivation, prediction step, or first-principles claim is present that reduces to its own inputs by construction. Performance numbers are measured outcomes, not fitted parameters renamed as predictions. Self-citations (if any) are not load-bearing for any uniqueness theorem or ansatz; the central claims rest on experimental transfer results rather than self-referential definitions. This is the expected non-finding for an empirical foundation-model paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The work implicitly relies on standard scaling laws for foundation models and the assumption that web image-text pairs are sufficient for the claimed generalization.

pith-pipeline@v0.9.0 · 5679 in / 1161 out tokens · 27740 ms · 2026-05-16T09:34:47.258359+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Omni-NegCLIP: Enhancing CLIP with Front-Layer Contrastive Fine-Tuning for Comprehensive Negation Understanding
cs.CV 2026-03 unverdicted novelty 7.0

Omni-NegCLIP improves CLIP's negation understanding by up to 52.65% on presence-based and 12.50% on absence-based tasks through front-layer fine-tuning with specialized contrastive losses.
WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition
cs.CV 2026-03 unverdicted novelty 7.0

WikiCLIP delivers an efficient contrastive baseline for open-domain visual entity recognition that improves accuracy by 16% on OVEN unseen entities and runs nearly 100 times faster than leading generative models.
VideoChat: Chat-Centric Video Understanding
cs.CV 2023-05 conditional novelty 7.0

VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.
Visual Instruction Tuning
cs.CV 2023-04 unverdicted novelty 7.0

LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
cs.CV 2023-01 unverdicted novelty 7.0

BLIP-2 bootstraps vision-language pre-training from frozen image encoders and LLMs via a lightweight two-stage Querying Transformer, delivering SOTA results with 54x fewer trainable parameters than Flamingo80B on zero...
PaLI: A Jointly-Scaled Multilingual Language-Image Model
cs.CV 2022-09 conditional novelty 7.0

PaLI jointly scales a 4B-parameter vision transformer with language models on a new 10B multilingual image-text dataset to reach state-of-the-art results on vision-language tasks while keeping a simple modular design.
Flamingo: a Visual Language Model for Few-Shot Learning
cs.CV 2022-04 unverdicted novelty 7.0

Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
Cluster-Aware Neural Collapse Prompt Tuning for Long-Tailed Generalization of Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

CPT creates cluster-invariant spaces from pre-trained VLM semantics and applies neural collapse losses to boost long-tail performance and unseen-class generalization in prompt tuning.
Joint Semantic Token Selection and Prompt Optimization for Interpretable Prompt Learning
cs.CV 2026-05 unverdicted novelty 6.0

IPL alternates discrete semantic token selection using approximate submodular optimization with continuous prompt optimization to boost both interpretability and task performance in vision-language model adaptation.
CLAMP: Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining
cs.RO 2026-01 unverdicted novelty 6.0

CLAMP pretrains 3D multi-view encoders with contrastive learning on point clouds and actions, then initializes diffusion policies for more sample-efficient fine-tuning on robotic tasks.
LLaVA-Video: Video Instruction Tuning With Synthetic Data
cs.CV 2024-10 unverdicted novelty 6.0

LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
Demystifying CLIP Data
cs.CV 2023-09 accept novelty 6.0

MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
cs.CV 2023-03 unverdicted novelty 6.0

MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.
CoCa: Contrastive Captioners are Image-Text Foundation Models
cs.CV 2022-05 accept novelty 6.0

CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection
cs.CV 2022-03 conditional novelty 6.0

DINO reaches 51.3 AP on COCO val2017 with a ResNet-50 backbone after 24 epochs, a +2.7 AP gain over the prior best DETR variant.
From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media
cs.CV 2026-04 unverdicted novelty 5.0

VLMs recover reliable population-level trends in climate change visual discourse on social media even when per-image accuracy is only moderate.
Foundation Models Defining A New Era In Sensor-based Human Activity Recognition: A Survey And Outlook
eess.SP 2026-04 accept novelty 5.0

The survey organizes foundation models for sensor-based HAR into a lifecycle taxonomy and identifies three trajectories: HAR-specific models from scratch, adaptation of general time-series models, and integration with...
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
cs.CV 2023-12 unverdicted novelty 5.0

InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
Split and Aggregation Learning for Foundation Models Over Mobile Embodied AI Network (MEAN): A Comprehensive Survey
cs.IT 2026-05 unverdicted novelty 3.0

The paper surveys split and aggregation learning for foundation models in 6G networks to improve efficiency, resource use, and data privacy in distributed AI.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · cited by 19 Pith papers · 9 internal anchors

[1]

W., Alexander, M

Berg, T., Liu, J., Lee, S. W., Alexander, M. L., Jacobs, D. W., and Belhumeur, P. N. Birdsnap: Large-scale ﬁne-grained visual categorization of birds. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2019– 2026,

work page 2014
[2]

On the Opportunities and Risks of Foundation Models

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosse- lut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N., Chen, A., Creel, K., Davis, J. Q., Demszky, D., Donahue, C., Doumbouya, M., Durmus, E., Ermon, S., Etchemendy, J., Ethayarajh, K., Fei-Fei, L., Finn, C.,...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-V oss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford,...

work page internal anchor Pith review Pith/arXiv arXiv 2005
[4]

Learning the best pooling strategy for visual semantic embedding

Chen, J., Hu, H., Wu, H., Jiang, Y ., and Wang, C. Learning the best pooling strategy for visual semantic embedding. In arXiv preprint arXiv:2011.04305, 2020a. Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A simple framework for contrastive learning of visual rep- resentations. In Proceedings of the 37th International Conference on Machine Learning...

work page arXiv 2011
[5]

Dynamic head: Unifying object detection heads with attentions

Dai, X., Chen, Y ., Xiao, B., Chen, D., Liu, M., Yuan, L., and Zhang, L. Dynamic head: Unifying object detection heads with attentions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7373–7382, June 2021a. Dai, X., Chen, Y ., Yang, J., Zhang, P., Yuan, L., and Zhang, L. Dynamic detr: End-to-end object dete...

work page arXiv 2009
[6]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for lan- guage understanding. In arXiv 1810.04805,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Cswin transformer: A general vision transformer backbone with cross-shaped windows

Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., Chen, D., and Guo, B. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In arXiv 2107.00652,

work page arXiv
[8]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021a. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani,...

work page internal anchor Pith review Pith/arXiv arXiv 2010
[9]

Scaling deep contrastive learning batch size under memory limited setup

Gao, L., Zhang, Y ., Han, J., and Callan, J. Scaling deep contrastive learning batch size under memory limited setup. In arXiv 2101.06983,

work page arXiv
[10]

Rich fea- ture hierarchies for accurate object detection and semantic segmentation

Girshick, R., Donahue, J., Darrell, T., and Malik, J. Rich fea- ture hierarchies for accurate object detection and semantic segmentation. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587,

work page 2014
[11]

V ., Sung, Y ., Li, Z., and Duerig, T

Jia, C., Yang, Y ., Xia, Y ., Chen, Y .-T., Parekh, Z., Pham, H., Le, Q. V ., Sung, Y ., Li, Z., and Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In arXiv 2102.05918,

work page arXiv
[12]

Big transfer (bit): Gen- eral visual representation learning

Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung, J., Gelly, S., and Houlsby, N. Big transfer (bit): Gen- eral visual representation learning. In arXiv 1912.11370,

work page arXiv 1912
[13]

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Krishna, R., Zhu, Y ., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y ., Li, L.-J., Shamma, D. A., Bernstein, M., and Fei-Fei, L. Visual genome: Con- necting language and vision using crowdsourced dense image annotations. In arXiv 1602.07332,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Swin transformer: Hierarchical vision trans- former using shifted windows

Liu, Z., Lin, Y ., Cao, Y ., Hu, H., Wei, Y ., Zhang, Z., Lin, S., and Guo, B. Swin transformer: Hierarchical vision trans- former using shifted windows. International Conference on Computer Vision (ICCV), 2021a. Liu, Z., Ning, J., Cao, Y ., Wei, Y ., Zhang, Z., Lin, S., and Hu, H. Video swin transformer. arXiv preprint arXiv:2106.13230, 2021b. Miech, A.,...

work page arXiv
[15]

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J., and Lazebnik, S. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In arXiv 1505.04870,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Imagebert: Cross-modal pre-training with large- scale weak-supervised image-text data

Qi, D., Su, L., Song, J., Cui, E., Bharti, T., and Sacheti, A. Imagebert: Cross-modal pre-training with large- scale weak-supervised image-text data. arXiv preprint- arXiv:2001.07966,

work page arXiv 2001
[17]

Learning Transferable Visual Models From Natural Language Supervision

Florence: A New Foundation Model for Computer Vision Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models from natural language supervision. InarXiv 2103.00020,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Zero-Shot Text-to-Image Generation

Ramesh, A., Pavlov, M., Goh, G., Gray, S., V oss, C., Rad- ford, A., Chen, M., and Sutskever, I. Zero-shot text-to- image generation. In arXiv 2102.12092,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

S., Piergiovanni, A., Arnab, A., Dehghani, M., and Angelova, A

Ryoo, M. S., Piergiovanni, A., Arnab, A., Dehghani, M., and Angelova, A. Tokenlearner: What can 8 learned tokens do for images and videos? In arXiv 2106.11297,

work page arXiv
[20]

Minivlm: A smaller and faster vision- language model

Wang, J., Hu, X., Zhang, P., Li, X., Wang, L., Zhang, L., Gao, J., and Liu, Z. Minivlm: A smaller and faster vision- language model. arXiv preprint arXiv:2012.06946,

work page arXiv 2012
[21]

Wang, X., Peng, Y ., Lu, L., Lu, Z., Bagheri, M., and Sum- mers, R. M. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classi- ﬁcation and localization of common thorax diseases. In arXiv 1705.02315,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

W., Dai, Z., Tsvetkov, Y ., and Cao, Y

Wang, Z., Yu, J., Yu, A. W., Dai, Z., Tsvetkov, Y ., and Cao, Y . Simvlm: Simple visual language model pretraining with weak supervision. In arXiv 2108.10904,

work page arXiv
[23]

Focal self-attention for local-global interactions in vision transformers

Yang, J., Li, C., Zhang, P., Dai, X., Xiao, B., Yuan, L., and Gao, J. Focal self-attention for local-global interactions in vision transformers. In arXiv 2107.00641,

work page arXiv
[24]

Filip: Fine- grained interactive language-image pre-training

Yao, L., Huang, R., Hou, L., Lu, G., Niu, M., Xu, H., Liang, X., Li, Z., Jiang, X., and Xu, C. Filip: Fine- grained interactive language-image pre-training. In arXiv 2111.07783,

work page arXiv
[25]

Ernie-vil: Knowledge enhanced vision- language representations through scene graph

Yu, F., Tang, J., Yin, W., Sun, Y ., Tian, H., Wu, H., and Wang, H. Ernie-vil: Knowledge enhanced vision- language representations through scene graph. arXiv preprint arXiv:2006.16934,

work page arXiv 2006
[26]

Scaling vision transformers

Zhai, X., Kolesnikov, A., Houlsby, N., and Beyer, L. Scaling vision transformers. In arXiv 2106.04560,

work page arXiv
[27]

Multi-scale vision longformer: A new vision Florence: A New Foundation Model for Computer Vision transformer for high-resolution image encoding

Zhang, P., Dai, X., Yang, J., Xiao, B., Yuan, L., Zhang, L., and Gao, J. Multi-scale vision longformer: A new vision Florence: A New Foundation Model for Computer Vision transformer for high-resolution image encoding. ICCV 2021, 2021a. Zhang, P., Li, X., Hu, X., Yang, J., Zhang, L., Wang, L., Choi, Y ., and Gao, J. Vinvl: Revisiting visual representa- tio...

work page arXiv 2021
[28]

D., and Le, Q

Zoph, B., Ghiasi, G., Lin, T.-Y ., Cui, Y ., Liu, H., Cubuk, E. D., and Le, Q. Rethinking pre-training and self- training. In NeurIPS, 2020

work page 2020