Recognition: 2 theorem links
· Lean TheoremCoCa: Contrastive Captioners are Image-Text Foundation Models
Pith reviewed 2026-05-15 10:48 UTC · model grok-4.3
The pith
CoCa jointly trains contrastive and captioning losses in one encoder-decoder to create image-text foundation models that reach new state-of-the-art on ImageNet and multimodal tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CoCa is an image-text encoder-decoder model pretrained end-to-end by computing a contrastive loss between unimodal image and text embeddings together with a captioning loss on multimodal decoder outputs; the decoder omits cross-attention in its first half to produce separate unimodal text representations before cascading the remaining layers that cross-attend to the image encoder.
What carries the argument
The cascaded decoder that produces unimodal text representations without cross-attention in its first half and multimodal image-text representations with cross-attention in its second half.
Load-bearing premise
Omitting cross-attention in the first half of the decoder layers cleanly separates unimodal text representations from multimodal ones without harming overall capacity or optimization stability.
What would settle it
An experiment that trains an otherwise identical model but keeps cross-attention in every decoder layer and measures whether zero-shot ImageNet top-1 accuracy or captioning metrics improve or degrade.
read the original abstract
Exploring large-scale pretrained foundation models is of significant interest in computer vision because these models can be quickly transferred to many downstream tasks. This paper presents Contrastive Captioner (CoCa), a minimalist design to pretrain an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss, thereby subsuming model capabilities from contrastive approaches like CLIP and generative methods like SimVLM. In contrast to standard encoder-decoder transformers where all decoder layers attend to encoder outputs, CoCa omits cross-attention in the first half of decoder layers to encode unimodal text representations, and cascades the remaining decoder layers which cross-attend to the image encoder for multimodal image-text representations. We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively. By sharing the same computational graph, the two training objectives are computed efficiently with minimal overhead. CoCa is pretrained end-to-end and from scratch on both web-scale alt-text data and annotated images by treating all labels simply as text, seamlessly unifying natural language supervision for representation learning. Empirically, CoCa achieves state-of-the-art performance with zero-shot transfer or minimal task-specific adaptation on a broad range of downstream tasks, spanning visual recognition (ImageNet, Kinetics-400/600/700, Moments-in-Time), crossmodal retrieval (MSCOCO, Flickr30K, MSR-VTT), multimodal understanding (VQA, SNLI-VE, NLVR2), and image captioning (MSCOCO, NoCaps). Notably on ImageNet classification, CoCa obtains 86.3% zero-shot top-1 accuracy, 90.6% with a frozen encoder and learned classification head, and new state-of-the-art 91.0% top-1 accuracy on ImageNet with a finetuned encoder.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CoCa, a minimalist image-text encoder-decoder foundation model pretrained jointly from scratch with contrastive loss on unimodal image and text embeddings and captioning loss on multimodal decoder outputs. The key design omits cross-attention in the first half of decoder layers to produce unimodal text representations for the contrastive objective while cascading the remaining layers for multimodal image-text representations; all labels are treated as text for unified supervision. It reports state-of-the-art results including 86.3% zero-shot top-1 on ImageNet, 90.6% with frozen encoder plus head, 91.0% finetuned top-1 on ImageNet, plus leading numbers on Kinetics, MSCOCO, VQA, and captioning tasks.
Significance. If the results hold, this work is significant for showing an efficient unification of contrastive and generative pretraining in a single computational graph with minimal overhead, yielding strong transfer across visual recognition, retrieval, and multimodal understanding benchmarks. The seamless treatment of all supervision as text and the reported consistent gains on multiple independent held-out test sets are notable strengths.
major comments (2)
- [Model Architecture / Decoder Design] The central architectural claim—that omitting cross-attention in the first half of decoder layers cleanly separates unimodal text representations (for contrastive loss) from multimodal ones (for captioning loss) with minimal overhead—lacks supporting ablation. No comparison is shown against a standard encoder-decoder transformer with cross-attention in every decoder layer or against applying the contrastive loss to multimodal embeddings. This choice is load-bearing for the unification claim and the reported ImageNet numbers (86.3% zero-shot, 91.0% finetuned).
- [Experiments and Results] The experimental results report strong absolute numbers (e.g., 91.0% ImageNet top-1 finetuned) without error bars, sensitivity analysis on the contrastive-vs-captioning loss weight (explicitly listed as a free parameter), or full hyperparameter tables. This weakens assessment of robustness and reproducibility of the gains.
minor comments (2)
- [Abstract] The abstract would benefit from a brief statement of model scale or training data volume to contextualize the results.
- [Methods] Notation for unimodal vs. multimodal embeddings should be defined once and used consistently to avoid minor ambiguity in the methods description.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and detailed review. We address each major comment below, indicating the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Model Architecture / Decoder Design] The central architectural claim—that omitting cross-attention in the first half of decoder layers cleanly separates unimodal text representations (for contrastive loss) from multimodal ones (for captioning loss) with minimal overhead—lacks supporting ablation. No comparison is shown against a standard encoder-decoder transformer with cross-attention in every decoder layer or against applying the contrastive loss to multimodal embeddings. This choice is load-bearing for the unification claim and the reported ImageNet numbers (86.3% zero-shot, 91.0% finetuned).
Authors: We thank the referee for highlighting this point. The motivation for the cascaded decoder design (unimodal text representations in early layers for contrastive loss, followed by multimodal layers for captioning) is described in Section 3.2, and the overall unification is supported by the end-to-end results across tasks. However, we agree that dedicated ablations would provide stronger evidence. In the revised manuscript we will add comparisons to a standard full cross-attention encoder-decoder baseline and to variants applying contrastive loss on multimodal embeddings; these will appear in Section 4.2 with accompanying efficiency and accuracy metrics. revision: yes
-
Referee: [Experiments and Results] The experimental results report strong absolute numbers (e.g., 91.0% ImageNet top-1 finetuned) without error bars, sensitivity analysis on the contrastive-vs-captioning loss weight (explicitly listed as a free parameter), or full hyperparameter tables. This weakens assessment of robustness and reproducibility of the gains.
Authors: We acknowledge the referee's concern about robustness and reproducibility. In the revised version we will report error bars for the primary ImageNet and downstream results (computed from multiple independent runs where compute permits), include a sensitivity sweep over the contrastive-to-captioning loss weight in the supplementary material, and provide complete hyperparameter tables in a new appendix section. revision: yes
Circularity Check
No circularity: empirical results on held-out benchmarks with no self-referential derivations
full rationale
The paper proposes an architectural variant (omitting cross-attention in early decoder layers) and reports direct empirical measurements such as 86.3% zero-shot ImageNet top-1 accuracy. These quantities are obtained by standard training and evaluation on external test sets; no equation or result is defined in terms of itself, no fitted parameter is renamed as a prediction, and no load-bearing premise reduces to a self-citation chain. The unification of contrastive and captioning objectives is achieved through explicit joint optimization rather than by construction, making the derivation chain self-contained against independent benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- contrastive vs captioning loss weight
axioms (1)
- standard math Standard multi-head attention and feed-forward transformer blocks behave as in the original Vaswani et al. formulation
Forward citations
Cited by 22 Pith papers
-
OZ-TAL: Online Zero-Shot Temporal Action Localization
Defines OZ-TAL task and presents a training-free VLM-based method that outperforms prior approaches for online and offline zero-shot temporal action localization on THUMOS14 and ActivityNet-1.3.
-
Bottleneck Tokens for Unified Multimodal Retrieval
Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.
-
InstrAct: Towards Action-Centric Understanding in Instructional Videos
InstrAction pretrains video foundation models using action-centric data filtering, hard negatives, an Action Perceiver module, DTW-Align, and Masked Action Modeling to reduce static bias and outperform prior models on...
-
InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding
InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.
-
LRM: Large Reconstruction Model for Single Image to 3D
LRM is a large transformer that predicts a NeRF directly from a single image after training on a million-object multi-view dataset.
-
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
BLIP-2 bootstraps vision-language pre-training from frozen image encoders and LLMs via a lightweight two-stage Querying Transformer, delivering SOTA results with 54x fewer trainable parameters than Flamingo80B on zero...
-
LAION-5B: An open large-scale dataset for training next generation image-text models
LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.
-
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Imagen achieves state-of-the-art photorealistic text-to-image generation by scaling a text-only pretrained T5 language model within a diffusion framework, reaching FID 7.27 on COCO without training on it.
-
What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion
Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.
-
Joint Semantic Token Selection and Prompt Optimization for Interpretable Prompt Learning
IPL alternates discrete semantic token selection using approximate submodular optimization with continuous prompt optimization to boost both interpretability and task performance in vision-language model adaptation.
-
Compared to What? Baselines and Metrics for Counterfactual Prompting
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...
-
Capabilities of Gemini Models in Medicine
Med-Gemini sets new records on 10 of 14 medical benchmarks including 91.1% on MedQA-USMLE, beats GPT-4V by 44.5% on multimodal tasks, and surpasses humans on medical text summarization.
-
Revisiting Feature Prediction for Learning Visual Representations from Video
V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
-
CogVLM: Visual Expert for Pretrained Language Models
CogVLM adds a trainable visual expert inside frozen language model layers for deep vision-language fusion and reports state-of-the-art results on ten cross-modal benchmarks while preserving NLP performance.
-
Vision Transformers Need Registers
Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
-
Aligning Text-to-Image Models using Human Feedback
A three-stage fine-tuning process uses human ratings to train a reward model and then improves text-to-image alignment by maximizing reward-weighted likelihood.
-
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Scaling an autoregressive Transformer to 20B parameters for text-to-image generation using image token sequences achieves new SOTA zero-shot FID of 7.23 and fine-tuned FID of 3.22 on MS-COCO.
-
Let ViT Speak: Generative Language-Image Pre-training
GenLIP pretrains ViTs to generate language tokens from visual tokens via autoregressive language modeling, matching strong baselines on multimodal tasks with less data.
-
From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media
VLMs recover reliable population-level trends in climate change visual discourse on social media even when per-image accuracy is only moderate.
-
Hallucination of Multimodal Large Language Models: A Survey
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
-
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
-
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.
Reference graph
Works this paper leans on
-
[1]
On the Opportunities and Risks of Foundation Models
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[2]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[3]
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[4]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020
work page 1901
-
[5]
Pathways: Asynchronous distributed dataflow for ml
Paul Barham, Aakanksha Chowdhery, Jeff Dean, Sanjay Ghemawat, Steven Hand, Dan Hurt, Michael Isard, Hyeontaek Lim, Ruoming Pang, Sudip Roy, et al. Pathways: Asynchronous distributed dataflow for ml. arXiv preprint arXiv:2203.12533, 2022
-
[6]
Rich feature hierarchies for accurate object detection and semantic segmentation
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014
work page 2014
-
[7]
Fully convolutional networks for se- mantic segmentation
Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for se- mantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015
work page 2015
-
[8]
Two-stream convolutional networks for action recognition in videos
Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, 27, 2014
work page 2014
-
[9]
Imagenet: A large- scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 13
work page 2009
-
[10]
Coatnet: Marrying convolution and attention for all data sizes
Zihang Dai, Hanxiao Liu, Quoc V Le, and Mingxing Tan. Coatnet: Marrying convolution and attention for all data sizes. Advances in Neural Information Processing Systems, 34:3965–3977, 2021
work page 2021
-
[11]
Co-training transformer with videos and images improves action recognition
Bowen Zhang, Jiahui Yu, Christopher Fifty, Wei Han, Andrew M Dai, Ruoming Pang, and Fei Sha. Co-training transformer with videos and images improves action recognition. arXiv preprint arXiv:2112.07175, 2021
-
[12]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021
work page 2021
-
[13]
Scaling up visual and vision-language representation learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun- Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021
work page 2021
-
[14]
Florence: A new foundation model for computer vision.arXiv preprint arXiv:2111.11432, 2021
Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021
-
[15]
Show and tell: A neural image caption generator
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015
work page 2015
-
[16]
Simvlm: Sim- ple visual language model pretraining with weak supervision
Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. Simvlm: Sim- ple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904, 2021
-
[17]
Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. arXiv preprint arXiv:2202.03052, 2022
-
[18]
Imagenet classification with deep convolutional neural networks
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012
work page 2012
-
[19]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017
work page 2017
-
[20]
Exploring the limits of weakly supervised pretraining
Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens Van Der Maaten. Exploring the limits of weakly supervised pretraining. In Proceedings of the European conference on computer vision (ECCV), pages 181–196, 2018
work page 2018
-
[21]
Scaling vision transform- ers, 2021
Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transform- ers, 2021
work page 2021
-
[22]
BEiT: BERT Pre-Training of Image Transformers
Hangbo Bao, Li Dong, and Furu Wei. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[23]
Masked autoencoders are scalable vision learners.arXiv:2111.06377, 2021
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377, 2021
-
[24]
Simmim: A simple framework for masked image modeling
Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. Simmim: A simple framework for masked image modeling. arXiv preprint arXiv:2111.09886, 2021
-
[25]
Lxmert: Learning cross-modality encoder representations from transformers,
Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019
-
[26]
Uniter: Universal image-text representation learning
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. In ECCV, 2020
work page 2020
-
[27]
Vinvl: Revisiting visual representations in vision-language models
Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5579–5588, June 2021. 14
work page 2021
-
[28]
Faster r-cnn: Towards real-time object detection with region proposal networks
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015
work page 2015
-
[29]
Vilt: Vision-and-language transformer without convolution or region supervision, 2021
Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-and-language transformer without convolution or region supervision, 2021
work page 2021
-
[30]
Vlmo: Unified vision-language pre- training with mixture-of-modality-experts
Wenhui Wang, Hangbo Bao, Li Dong, and Furu Wei. Vlmo: Unified vision-language pre- training with mixture-of-modality-experts. arXiv preprint arXiv:2111.02358, 2021
-
[31]
Unified contrastive learning in image-text-label space, 2022
Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Bin Xiao, Ce Liu, Lu Yuan, and Jianfeng Gao. Unified contrastive learning in image-text-label space, 2022
work page 2022
-
[32]
Lit: Zero-shot transfer with locked-image text tuning
Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. Lit: Zero-shot transfer with locked-image text tuning. arXiv preprint arXiv:2111.07991, 2021
-
[33]
Hieu Pham, Zihang Dai, Golnaz Ghiasi, Kenji Kawaguchi, Hanxiao Liu, Adams Wei Yu, Jiahui Yu, Yi-Ting Chen, Minh-Thang Luong, Yonghui Wu, Mingxing Tan, and Quoc V . Le. Combined scaling for open-vocabulary image classification, 2021
work page 2021
-
[34]
Answer-me: Multi-task open-vocabulary visual question answering
AJ Piergiovanni, Wei Li, Weicheng Kuo, Mohammad Saffar, Fred Bertsch, and Anelia An- gelova. Answer-me: Multi-task open-vocabulary visual question answering. arXiv preprint arXiv:2205.00949, 2022
-
[35]
Flava: A foundational language and vision alignment model
Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. Flava: A foundational language and vision alignment model. arXiv preprint arXiv:2112.04482, 2021
-
[36]
Align before fuse: Vision and language representation learning with momentum distillation
Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems, 34, 2021
work page 2021
-
[37]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation.arXiv preprint arXiv:2201.12086, 2022
-
[38]
The evolution of out-of-distribution robustness throughout fine-tuning
Anders Andreassen, Yasaman Bahri, Behnam Neyshabur, and Rebecca Roelofs. The evolution of out-of-distribution robustness throughout fine-tuning. arXiv preprint arXiv:2106.15831, 2021
-
[39]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021
work page 2021
-
[40]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016
work page 2016
-
[41]
A learning algorithm for continually running fully recurrent neural networks
Ronald J Williams and David Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280, 1989
work page 1989
-
[42]
Set transformer: A framework for attention-based permutation-invariant neural networks
Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. In International Conference on Machine Learning, pages 3744–3753. PMLR, 2019
work page 2019
-
[43]
Neural Machine Translation of Rare Words with Subword Units
Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[44]
Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates
Taku Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates. arXiv preprint arXiv:1804.10959, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[45]
Lingvo: a modular and scalable framework for sequence-to-sequence modeling, 2019
Jonathan Shen, Patrick Nguyen, Yonghui Wu, Zhifeng Chen, et al. Lingvo: a modular and scalable framework for sequence-to-sequence modeling, 2019
work page 2019
-
[46]
Gpipe: Efficient training of giant neural networks using pipeline parallelism
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, Hy- oukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019. 15
work page 2019
-
[47]
Automatic cross-replica sharding of weight update in data-parallel training
Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Hongjun Choi, Blake Hechtman, and Shibo Wang. Automatic cross-replica sharding of weight update in data-parallel training. arXiv preprint arXiv:2004.13336, 2020
-
[48]
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with condi- tional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[49]
Gspmd: general and scalable parallelization for ml computation graphs
Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yanping Huang, Rahul Joshi, Maxim Krikun, Dmitry Lepikhin, Andy Ly, Marcello Maggioni, et al. Gspmd: general and scalable parallelization for ml computation graphs. arXiv preprint arXiv:2105.04663, 2021
-
[50]
Adafactor: Adaptive learning rates with sublinear memory cost
Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pages 4596–4604. PMLR, 2018
work page 2018
-
[51]
Hieu Pham, Zihang Dai, Qizhe Xie, and Quoc V Le. Meta pseudo labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11557–11568, 2021
work page 2021
-
[52]
Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo- Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. arXiv preprint arXiv:2203.05482, 2022
-
[53]
Vivit: A video vision transformer
Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lu ˇci´c, and Cordelia Schmid. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6836–6846, 2021
work page 2021
-
[54]
Movinets: Mobile video networks for efficient video recognition
Dan Kondratyuk, Liangzhe Yuan, Yandong Li, Li Zhang, Mingxing Tan, Matthew Brown, and Boqing Gong. Movinets: Mobile video networks for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16020–16030, 2021
work page 2021
-
[55]
Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text
Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Advances in Neural Information Processing Systems, 34, 2021
work page 2021
-
[56]
Masked feature prediction for self-supervised visual pre-training
Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan Yuille, and Christoph Feichten- hofer. Masked feature prediction for self-supervised visual pre-training. arXiv preprint arXiv:2112.09133, 2021
-
[57]
The Kinetics Human Action Video Dataset
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya- narasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[58]
A Short Note about Kinetics-600
Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short note about kinetics-600. arXiv preprint arXiv:1808.01340, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[59]
A short note on the kinetics- 700 human action dataset
Joao Carreira, Eric Noland, Chloe Hillier, and Andrew Zisserman. A short note on the kinetics- 700 human action dataset. arXiv preprint arXiv:1907.06987, 2019
-
[60]
Moments in time dataset: one million videos for event understanding
Mathew Monfort, Alex Andonian, Bolei Zhou, Kandan Ramakrishnan, Sarah Adel Bargal, Tom Yan, Lisa Brown, Quanfu Fan, Dan Gutfreund, Carl V ondrick, et al. Moments in time dataset: one million videos for event understanding. IEEE transactions on pattern analysis and machine intelligence, 42(2):502–508, 2019
work page 2019
-
[61]
Filip: Fine-grained interactive language-image pre-training.ArXiv, abs/2111.07783, 2021
Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. Filip: Fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783, 2021
-
[62]
Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models
Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015
work page 2015
-
[63]
Microsoft COCO Captions: Data Collection and Evaluation Server
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015. 16
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[64]
Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15262–15271, 2021
work page 2021
-
[65]
The many faces of robustness: A critical analysis of out-of-distribution generalization
Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8349, 2021
work page 2021
-
[66]
Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, pages 5389–5400. PMLR, 2019
work page 2019
-
[67]
Learning robust global repre- sentations by penalizing local predictive power
Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global repre- sentations by penalizing local predictive power. Advances in Neural Information Processing Systems, 32, 2019
work page 2019
-
[68]
Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models
Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. Advances in neural information processing systems, 32, 2019
work page 2019
-
[69]
A straightforward framework for video retrieval using clip
Jesús Andrés Portillo-Quintero, José Carlos Ortiz-Bayliss, and Hugo Terashima-Marín. A straightforward framework for video retrieval using clip. arXiv preprint arXiv:2102.12443, 2021
-
[70]
arXiv preprint arXiv:2204.00598
Andy Zeng, Adrian Wong, Stefan Welker, Krzysztof Choromanski, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, and Pete Florence. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022
-
[71]
Msr-vtt: A large video description dataset for bridging video and language
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016
work page 2016
-
[72]
A short note on the Kinetics -700-2020 human action dataset
Lucas Smaira, João Carreira, Eric Noland, Ellen Clancy, Amy Wu, and Andrew Zisserman. A short note on the kinetics-700-2020 human action dataset. arXiv preprint arXiv:2010.10864, 2020
-
[73]
Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, and Kurt Keutzer. How much can clip benefit vision-and-language tasks?arXiv preprint arXiv:2107.06383, 2021
-
[74]
An empirical study of training end-to-end vision-and- language transformers
Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Zicheng Liu, Michael Zeng, et al. An empirical study of training end-to-end vision-and- language transformers. arXiv preprint arXiv:2111.02387, 2021
-
[75]
Making the v in vqa matter: Elevating the role of image understanding in visual question answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6904–6913, 2017
work page 2017
-
[76]
Visual Entailment: A Novel Task for Fine-Grained Image Understanding
Ning Xie, Farley Lai, Derek Doran, and Asim Kadav. Visual entailment: A novel task for fine-grained image understanding. arXiv preprint arXiv:1901.06706, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[77]
A Corpus for Reasoning About Natural Language Grounded in Photographs
Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. A corpus for reasoning about natural language grounded in photographs. arXiv preprint arXiv:1811.00491, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[78]
nocaps: novel object captioning at scale
Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. nocaps: novel object captioning at scale. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8948–8957, 2019
work page 2019
-
[79]
Self- critical sequence training for image captioning
Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. Self- critical sequence training for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7008–7024, 2017
work page 2017
-
[80]
Scaling up vision-language pre-training for image captioning
Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, and Li- juan Wang. Scaling up vision-language pre-training for image captioning. arXiv preprint arXiv:2111.12233, 2021. 17 A Visual Recognition Finetuning Details Hyper-parameter ImageNet Kinetics-400/600/700 Moments-in-Time Frozen-feature Finetuning Frozen-feature Finetuning Froze...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.