pith. machine review for the scientific record. sign in

arxiv: 2111.11432 · v1 · submitted 2021-11-22 · 💻 cs.CV · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Florence: A New Foundation Model for Computer Vision

Authors on Pith no claims yet

Pith reviewed 2026-05-16 09:34 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords computer visionfoundation modelimage-text pretrainingzero-shot transfermultimodal representationsobject detectionvideo action recognitionvisual question answering
0
0 comments X

The pith

Florence expands vision models from coarse scene representations to fine objects, videos, and extra modalities like depth using web-scale image-text data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Florence as a computer vision foundation model trained on large-scale web image-text pairs. Its goal is to create representations that adapt to many tasks with little extra work, covering classification, retrieval, detection, visual question answering, captioning, video retrieval, and action recognition. The model widens its scope from static images to dynamic videos and from basic RGB to added signals such as depth and captions. Florence reports new state-of-the-art numbers on most of 44 benchmarks, including 83.74 percent top-1 zero-shot accuracy on ImageNet-1K, 62.4 mAP on COCO detection after fine-tuning, 80.36 on VQA, and 87.8 on Kinetics-600.

Core claim

Florence is a foundation model that learns universal visual-language representations from Web-scale image-text data, enabling easy adaptation to diverse computer vision tasks ranging from image classification and object detection to video action recognition and visual question answering, while achieving new state-of-the-art performance on the majority of 44 representative benchmarks.

What carries the argument

Florence, the model that builds shared image-text representations and then extends them from coarse scenes to fine objects, from static frames to video sequences, and from RGB to additional signals such as depth and captions.

If this is right

  • Supports zero-shot transfer to novel images and objects without task-specific training.
  • Delivers 62.4 mAP on COCO object detection after standard fine-tuning.
  • Reaches 80.36 accuracy on visual question answering and 87.8 on Kinetics-600 action recognition.
  • Works across fully supervised fine-tuning, linear probing, few-shot, and zero-shot settings.
  • Handles both static image tasks and dynamic video tasks within the same base model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • One model could eventually replace separate systems now used for images, videos, and depth sensing.
  • Adding still more signals such as audio or 3D geometry might further reduce the need for task-specific fine-tuning.
  • Real-world robotics or long-video monitoring would be a direct test of whether the generalization holds under continuous input.
  • If the pattern scales, training compute could shift from many narrow models to fewer broad ones.

Load-bearing premise

That training on diverse web-scale image-text data produces representations that generalize well with minimal customization across static images, videos, fine-grained objects, and additional modalities such as depth and captions.

What would settle it

A new benchmark set of fine-grained video or depth tasks where Florence requires heavy per-task retraining or falls below existing specialized models.

read the original abstract

Automated visual understanding of our diverse and open world demands computer vision models to generalize well with minimal customization for specific tasks, similar to human vision. Computer vision foundation models, which are trained on diverse, large-scale dataset and can be adapted to a wide range of downstream tasks, are critical for this mission to solve real-world computer vision applications. While existing vision foundation models such as CLIP, ALIGN, and Wu Dao 2.0 focus mainly on mapping images and textual representations to a cross-modal shared representation, we introduce a new computer vision foundation model, Florence, to expand the representations from coarse (scene) to fine (object), from static (images) to dynamic (videos), and from RGB to multiple modalities (caption, depth). By incorporating universal visual-language representations from Web-scale image-text data, our Florence model can be easily adapted for various computer vision tasks, such as classification, retrieval, object detection, VQA, image caption, video retrieval and action recognition. Moreover, Florence demonstrates outstanding performance in many types of transfer learning: fully sampled fine-tuning, linear probing, few-shot transfer and zero-shot transfer for novel images and objects. All of these properties are critical for our vision foundation model to serve general purpose vision tasks. Florence achieves new state-of-the-art results in majority of 44 representative benchmarks, e.g., ImageNet-1K zero-shot classification with top-1 accuracy of 83.74 and the top-5 accuracy of 97.18, 62.4 mAP on COCO fine tuning, 80.36 on VQA, and 87.8 on Kinetics-600.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Florence, a computer vision foundation model trained on web-scale image-text data. It expands visual representations from coarse to fine, static to dynamic, and RGB to multi-modal. The model is claimed to be adaptable to various tasks with minimal customization and achieves new state-of-the-art results on the majority of 44 representative benchmarks, including 83.74% top-1 and 97.18% top-5 accuracy on ImageNet-1K zero-shot classification, 62.4 mAP on COCO fine-tuning, 80.36 on VQA, and 87.8 on Kinetics-600.

Significance. If the results are substantiated with full training and adaptation details, Florence would be a significant contribution as a versatile foundation model capable of handling diverse vision tasks across modalities with strong generalization from image-text pretraining.

major comments (2)
  1. Abstract: The abstract asserts training solely on Web-scale image-text data yet reports SOTA performance on video action recognition (Kinetics-600 at 87.8) and VQA (80.36) with 'minimal customization'. This claim is load-bearing for the foundation model narrative but lacks any description of the adaptation procedure for dynamic inputs or additional modalities, making it impossible to assess whether the performance stems from the pretraining or from task-specific engineering.
  2. Abstract: No training details, baselines, statistical tests, or ablation studies are provided to support the strong performance numbers across 44 benchmarks, which is load-bearing for verifying the central generalization claims.
minor comments (1)
  1. The manuscript should include a clear table or section summarizing all 44 benchmarks with direct comparisons to prior work and exact adaptation methods used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below and will make revisions to improve clarity where appropriate.

read point-by-point responses
  1. Referee: Abstract: The abstract asserts training solely on Web-scale image-text data yet reports SOTA performance on video action recognition (Kinetics-600 at 87.8) and VQA (80.36) with 'minimal customization'. This claim is load-bearing for the foundation model narrative but lacks any description of the adaptation procedure for dynamic inputs or additional modalities, making it impossible to assess whether the performance stems from the pretraining or from task-specific engineering.

    Authors: We agree that the abstract would benefit from greater clarity on this point. The model is pretrained solely on web-scale image-text pairs to obtain universal visual-language representations. For video action recognition, adaptation consists of sampling frames, applying the image encoder, and using lightweight temporal aggregation (e.g., mean pooling or a small 3D convolution head) without retraining the core model. For VQA, visual features are extracted and fused with question text via a minimal multimodal head. These procedures are described in the method and adaptation sections of the full manuscript. We will revise the abstract to briefly note the adaptation strategies for dynamic and multimodal inputs. revision: yes

  2. Referee: Abstract: No training details, baselines, statistical tests, or ablation studies are provided to support the strong performance numbers across 44 benchmarks, which is load-bearing for verifying the central generalization claims.

    Authors: The full manuscript includes pretraining details (data scale, architecture, optimization) in Section 3, direct baseline comparisons for all 44 benchmarks in the experimental tables, and ablation studies in Section 5 analyzing key components such as hierarchical representations. While formal statistical tests (e.g., p-values) are not reported for every benchmark, the consistent large-margin improvements across diverse tasks support the generalization claims. We will add a concise summary of training settings and highlight the ablation results more prominently, possibly in an expanded abstract or dedicated paragraph. revision: partial

Circularity Check

0 steps flagged

No circularity: Florence reports direct empirical benchmark results from large-scale pretraining

full rationale

The paper describes training Florence on web-scale image-text data and evaluates it via standard held-out benchmarks (ImageNet-1K zero-shot, COCO mAP, VQA, Kinetics-600). No mathematical derivation, prediction step, or first-principles claim is present that reduces to its own inputs by construction. Performance numbers are measured outcomes, not fitted parameters renamed as predictions. Self-citations (if any) are not load-bearing for any uniqueness theorem or ansatz; the central claims rest on experimental transfer results rather than self-referential definitions. This is the expected non-finding for an empirical foundation-model paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The work implicitly relies on standard scaling laws for foundation models and the assumption that web image-text pairs are sufficient for the claimed generalization.

pith-pipeline@v0.9.0 · 5679 in / 1161 out tokens · 27740 ms · 2026-05-16T09:34:47.258359+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Omni-NegCLIP: Enhancing CLIP with Front-Layer Contrastive Fine-Tuning for Comprehensive Negation Understanding

    cs.CV 2026-03 unverdicted novelty 7.0

    Omni-NegCLIP improves CLIP's negation understanding by up to 52.65% on presence-based and 12.50% on absence-based tasks through front-layer fine-tuning with specialized contrastive losses.

  2. WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition

    cs.CV 2026-03 unverdicted novelty 7.0

    WikiCLIP delivers an efficient contrastive baseline for open-domain visual entity recognition that improves accuracy by 16% on OVEN unseen entities and runs nearly 100 times faster than leading generative models.

  3. VideoChat: Chat-Centric Video Understanding

    cs.CV 2023-05 conditional novelty 7.0

    VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.

  4. Visual Instruction Tuning

    cs.CV 2023-04 unverdicted novelty 7.0

    LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.

  5. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    cs.CV 2023-01 unverdicted novelty 7.0

    BLIP-2 bootstraps vision-language pre-training from frozen image encoders and LLMs via a lightweight two-stage Querying Transformer, delivering SOTA results with 54x fewer trainable parameters than Flamingo80B on zero...

  6. PaLI: A Jointly-Scaled Multilingual Language-Image Model

    cs.CV 2022-09 conditional novelty 7.0

    PaLI jointly scales a 4B-parameter vision transformer with language models on a new 10B multilingual image-text dataset to reach state-of-the-art results on vision-language tasks while keeping a simple modular design.

  7. Flamingo: a Visual Language Model for Few-Shot Learning

    cs.CV 2022-04 unverdicted novelty 7.0

    Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.

  8. Cluster-Aware Neural Collapse Prompt Tuning for Long-Tailed Generalization of Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    CPT creates cluster-invariant spaces from pre-trained VLM semantics and applies neural collapse losses to boost long-tail performance and unseen-class generalization in prompt tuning.

  9. Joint Semantic Token Selection and Prompt Optimization for Interpretable Prompt Learning

    cs.CV 2026-05 unverdicted novelty 6.0

    IPL alternates discrete semantic token selection using approximate submodular optimization with continuous prompt optimization to boost both interpretability and task performance in vision-language model adaptation.

  10. CLAMP: Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining

    cs.RO 2026-01 unverdicted novelty 6.0

    CLAMP pretrains 3D multi-view encoders with contrastive learning on point clouds and actions, then initializes diffusion policies for more sample-efficient fine-tuning on robotic tasks.

  11. LLaVA-Video: Video Instruction Tuning With Synthetic Data

    cs.CV 2024-10 unverdicted novelty 6.0

    LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.

  12. Demystifying CLIP Data

    cs.CV 2023-09 accept novelty 6.0

    MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.

  13. MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

    cs.CV 2023-03 unverdicted novelty 6.0

    MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.

  14. CoCa: Contrastive Captioners are Image-Text Foundation Models

    cs.CV 2022-05 accept novelty 6.0

    CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.

  15. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

    cs.CV 2022-03 conditional novelty 6.0

    DINO reaches 51.3 AP on COCO val2017 with a ResNet-50 backbone after 24 epochs, a +2.7 AP gain over the prior best DETR variant.

  16. From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media

    cs.CV 2026-04 unverdicted novelty 5.0

    VLMs recover reliable population-level trends in climate change visual discourse on social media even when per-image accuracy is only moderate.

  17. Foundation Models Defining A New Era In Sensor-based Human Activity Recognition: A Survey And Outlook

    eess.SP 2026-04 accept novelty 5.0

    The survey organizes foundation models for sensor-based HAR into a lifecycle taxonomy and identifies three trajectories: HAR-specific models from scratch, adaptation of general time-series models, and integration with...

  18. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    cs.CV 2023-12 unverdicted novelty 5.0

    InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.

  19. Split and Aggregation Learning for Foundation Models Over Mobile Embodied AI Network (MEAN): A Comprehensive Survey

    cs.IT 2026-05 unverdicted novelty 3.0

    The paper surveys split and aggregation learning for foundation models in 6G networks to improve efficiency, resource use, and data privacy in distributed AI.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · cited by 19 Pith papers · 9 internal anchors

  1. [1]

    W., Alexander, M

    Berg, T., Liu, J., Lee, S. W., Alexander, M. L., Jacobs, D. W., and Belhumeur, P. N. Birdsnap: Large-scale fine-grained visual categorization of birds. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2019– 2026,

  2. [2]

    On the Opportunities and Risks of Foundation Models

    Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosse- lut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N., Chen, A., Creel, K., Davis, J. Q., Demszky, D., Donahue, C., Doumbouya, M., Durmus, E., Ermon, S., Etchemendy, J., Ethayarajh, K., Fei-Fei, L., Finn, C.,...

  3. [3]

    Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-V oss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford,...

  4. [4]

    Learning the best pooling strategy for visual semantic embedding

    Chen, J., Hu, H., Wu, H., Jiang, Y ., and Wang, C. Learning the best pooling strategy for visual semantic embedding. In arXiv preprint arXiv:2011.04305, 2020a. Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A simple framework for contrastive learning of visual rep- resentations. In Proceedings of the 37th International Conference on Machine Learning...

  5. [5]

    Dynamic head: Unifying object detection heads with attentions

    Dai, X., Chen, Y ., Xiao, B., Chen, D., Liu, M., Yuan, L., and Zhang, L. Dynamic head: Unifying object detection heads with attentions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7373–7382, June 2021a. Dai, X., Chen, Y ., Yang, J., Zhang, P., Yuan, L., and Zhang, L. Dynamic detr: End-to-end object dete...

  6. [6]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for lan- guage understanding. In arXiv 1810.04805,

  7. [7]

    Cswin transformer: A general vision transformer backbone with cross-shaped windows

    Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., Chen, D., and Guo, B. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In arXiv 2107.00652,

  8. [8]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021a. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani,...

  9. [9]

    Scaling deep contrastive learning batch size under memory limited setup

    Gao, L., Zhang, Y ., Han, J., and Callan, J. Scaling deep contrastive learning batch size under memory limited setup. In arXiv 2101.06983,

  10. [10]

    Rich fea- ture hierarchies for accurate object detection and semantic segmentation

    Girshick, R., Donahue, J., Darrell, T., and Malik, J. Rich fea- ture hierarchies for accurate object detection and semantic segmentation. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587,

  11. [11]

    V ., Sung, Y ., Li, Z., and Duerig, T

    Jia, C., Yang, Y ., Xia, Y ., Chen, Y .-T., Parekh, Z., Pham, H., Le, Q. V ., Sung, Y ., Li, Z., and Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In arXiv 2102.05918,

  12. [12]

    Big transfer (bit): Gen- eral visual representation learning

    Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung, J., Gelly, S., and Houlsby, N. Big transfer (bit): Gen- eral visual representation learning. In arXiv 1912.11370,

  13. [13]

    Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

    Krishna, R., Zhu, Y ., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y ., Li, L.-J., Shamma, D. A., Bernstein, M., and Fei-Fei, L. Visual genome: Con- necting language and vision using crowdsourced dense image annotations. In arXiv 1602.07332,

  14. [14]

    Swin transformer: Hierarchical vision trans- former using shifted windows

    Liu, Z., Lin, Y ., Cao, Y ., Hu, H., Wei, Y ., Zhang, Z., Lin, S., and Guo, B. Swin transformer: Hierarchical vision trans- former using shifted windows. International Conference on Computer Vision (ICCV), 2021a. Liu, Z., Ning, J., Cao, Y ., Wei, Y ., Zhang, Z., Lin, S., and Hu, H. Video swin transformer. arXiv preprint arXiv:2106.13230, 2021b. Miech, A.,...

  15. [15]

    Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

    Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J., and Lazebnik, S. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In arXiv 1505.04870,

  16. [16]

    Imagebert: Cross-modal pre-training with large- scale weak-supervised image-text data

    Qi, D., Su, L., Song, J., Cui, E., Bharti, T., and Sacheti, A. Imagebert: Cross-modal pre-training with large- scale weak-supervised image-text data. arXiv preprint- arXiv:2001.07966,

  17. [17]

    Learning Transferable Visual Models From Natural Language Supervision

    Florence: A New Foundation Model for Computer Vision Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models from natural language supervision. InarXiv 2103.00020,

  18. [18]

    Zero-Shot Text-to-Image Generation

    Ramesh, A., Pavlov, M., Goh, G., Gray, S., V oss, C., Rad- ford, A., Chen, M., and Sutskever, I. Zero-shot text-to- image generation. In arXiv 2102.12092,

  19. [19]

    S., Piergiovanni, A., Arnab, A., Dehghani, M., and Angelova, A

    Ryoo, M. S., Piergiovanni, A., Arnab, A., Dehghani, M., and Angelova, A. Tokenlearner: What can 8 learned tokens do for images and videos? In arXiv 2106.11297,

  20. [20]

    Minivlm: A smaller and faster vision- language model

    Wang, J., Hu, X., Zhang, P., Li, X., Wang, L., Zhang, L., Gao, J., and Liu, Z. Minivlm: A smaller and faster vision- language model. arXiv preprint arXiv:2012.06946,

  21. [21]

    Wang, X., Peng, Y ., Lu, L., Lu, Z., Bagheri, M., and Sum- mers, R. M. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classi- fication and localization of common thorax diseases. In arXiv 1705.02315,

  22. [22]

    W., Dai, Z., Tsvetkov, Y ., and Cao, Y

    Wang, Z., Yu, J., Yu, A. W., Dai, Z., Tsvetkov, Y ., and Cao, Y . Simvlm: Simple visual language model pretraining with weak supervision. In arXiv 2108.10904,

  23. [23]

    Focal self-attention for local-global interactions in vision transformers

    Yang, J., Li, C., Zhang, P., Dai, X., Xiao, B., Yuan, L., and Gao, J. Focal self-attention for local-global interactions in vision transformers. In arXiv 2107.00641,

  24. [24]

    Filip: Fine- grained interactive language-image pre-training

    Yao, L., Huang, R., Hou, L., Lu, G., Niu, M., Xu, H., Liang, X., Li, Z., Jiang, X., and Xu, C. Filip: Fine- grained interactive language-image pre-training. In arXiv 2111.07783,

  25. [25]

    Ernie-vil: Knowledge enhanced vision- language representations through scene graph

    Yu, F., Tang, J., Yin, W., Sun, Y ., Tian, H., Wu, H., and Wang, H. Ernie-vil: Knowledge enhanced vision- language representations through scene graph. arXiv preprint arXiv:2006.16934,

  26. [26]

    Scaling vision transformers

    Zhai, X., Kolesnikov, A., Houlsby, N., and Beyer, L. Scaling vision transformers. In arXiv 2106.04560,

  27. [27]

    Multi-scale vision longformer: A new vision Florence: A New Foundation Model for Computer Vision transformer for high-resolution image encoding

    Zhang, P., Dai, X., Yang, J., Xiao, B., Yuan, L., Zhang, L., and Gao, J. Multi-scale vision longformer: A new vision Florence: A New Foundation Model for Computer Vision transformer for high-resolution image encoding. ICCV 2021, 2021a. Zhang, P., Li, X., Hu, X., Yang, J., Zhang, L., Wang, L., Choi, Y ., and Gao, J. Vinvl: Revisiting visual representa- tio...

  28. [28]

    D., and Le, Q

    Zoph, B., Ghiasi, G., Lin, T.-Y ., Cui, Y ., Liu, H., Cubuk, E. D., and Le, Q. Rethinking pre-training and self- training. In NeurIPS, 2020