Recognition: 2 theorem links
GIT: A Generative Image-to-text Transformer for Vision and Language
Pith reviewed 2026-05-16 20:51 UTC · model grok-4.3
The pith
A simplified generative image-to-text transformer unifies vision-language tasks and sets new state-of-the-art results on 12 benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GIT establishes that a single image encoder and text decoder under language modeling, scaled in data and size, unifies vision-language tasks and achieves new state-of-the-art results on 12 benchmarks, surpassing human performance on TextCaps with 138.2 versus 125.5 CIDEr.
What carries the argument
The generative image-to-text transformer consisting of one image encoder and one text decoder trained with a single language modeling task.
Load-bearing premise
That increasing pre-training data volume and model size with a single language-modeling objective on a simple encoder-decoder architecture is enough to surpass prior specialized methods on vision-language tasks.
What would settle it
A model that retains complex multi-modal encoders plus external detectors and OCR but matches the same pre-training data volume and parameter count still falls short of GIT scores on the 12 reported benchmarks.
read the original abstract
In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering. While generative models provide a consistent network architecture between pre-training and fine-tuning, existing work typically contains complex structures (uni/multi-modal encoder/decoder) and depends on external modules such as object detectors/taggers and optical character recognition (OCR). In GIT, we simplify the architecture as one image encoder and one text decoder under a single language modeling task. We also scale up the pre-training data and the model size to boost the model performance. Without bells and whistles, our GIT establishes new state of the arts on 12 challenging benchmarks with a large margin. For instance, our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr). Furthermore, we present a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks. Codes are released at \url{https://github.com/microsoft/GenerativeImage2Text}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GIT, a generative image-to-text transformer consisting of a single image encoder and text decoder trained end-to-end under a unified language modeling objective. By scaling pre-training data volume and model size, the authors claim new state-of-the-art results on 12 vision-language benchmarks (including image/video captioning and VQA), with the model surpassing human performance on TextCaps (138.2 vs. 125.5 CIDEr) for the first time; they also present generation-based schemes for image classification and scene text recognition.
Significance. If the empirical claims hold after verification, the work would be significant for demonstrating that a minimal encoder-decoder architecture under a single LM objective can unify tasks and exceed prior complex models that rely on external modules (detectors, OCR). This would underscore the value of scale over architectural elaboration and simplify the design space for vision-language models. Code release supports reproducibility.
major comments (2)
- [Experiments] Experimental sections: the manuscript reports strong benchmark numbers and SOTA claims but provides no ablations or matched-scale re-runs that hold pre-training data volume and parameter count fixed while restoring external modules or multi-encoder designs from prior work. This is load-bearing for the central unification claim, as it leaves open whether gains derive primarily from scaling rather than simplification.
- [TextCaps Evaluation] TextCaps results (Table X, CIDEr row): the 138.2 score surpassing human performance is presented without error analysis or ablation on how the image encoder captures scene text in the absence of explicit OCR; this weakens confidence that the architecture generalizes beyond the specific training distribution.
minor comments (2)
- [Method] Notation for the image encoder and text decoder could be clarified with explicit equations showing the joint LM loss formulation.
- [Figures] Figure captions for benchmark comparisons should include the exact prior methods and their scales for direct visual comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the potential significance of our simplified architecture. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Experiments] Experimental sections: the manuscript reports strong benchmark numbers and SOTA claims but provides no ablations or matched-scale re-runs that hold pre-training data volume and parameter count fixed while restoring external modules or multi-encoder designs from prior work. This is load-bearing for the central unification claim, as it leaves open whether gains derive primarily from scaling rather than simplification.
Authors: We agree that matched-scale ablations re-implementing prior complex models (with external modules or multi-encoder designs) at identical data volume and parameter count would provide stronger isolation of the simplification benefit. Such experiments are computationally prohibitive at our scale. Our evidence instead rests on consistent outperformance of reported prior SOTA results that used more elaborate designs, combined with our internal scaling curves showing gains from model size and data. We will revise the experimental discussion to explicitly note this limitation and emphasize that the unification claim is supported by the single-architecture results rather than direct head-to-head re-runs. revision: partial
-
Referee: [TextCaps Evaluation] TextCaps results (Table X, CIDEr row): the 138.2 score surpassing human performance is presented without error analysis or ablation on how the image encoder captures scene text in the absence of explicit OCR; this weakens confidence that the architecture generalizes beyond the specific training distribution.
Authors: We acknowledge the value of error analysis for the TextCaps result. The image encoder is a Vision Transformer pre-trained on large-scale image-text pairs that naturally contain scene text, allowing implicit learning of text recognition within the generative objective. We will add a dedicated error analysis subsection with qualitative examples of generated captions on text-rich images, attention visualizations highlighting text regions, and a breakdown of failure cases. A controlled ablation removing all text from pre-training data is not feasible within our compute budget but would be a useful direction for future work; the current results on TextCaps already demonstrate generalization to scene-text understanding. revision: partial
Circularity Check
No circularity: empirical scaling results on public benchmarks
full rationale
The paper describes an empirical pipeline: a simplified single-encoder single-decoder transformer is pre-trained with a language-modeling objective on a large corpus and then fine-tuned/evaluated on standard vision-language benchmarks. No equations, uniqueness theorems, or fitted parameters are presented as 'predictions' that reduce to the inputs by construction. Performance numbers (e.g., TextCaps CIDEr) are reported outcomes of training and testing, not self-definitions or renamings of prior results. Self-citations, if present, are not load-bearing for the central claim; the unification argument rests on the observed benchmark margins rather than any circular reduction. This is a standard scaling experiment whose validity can be checked externally by re-training or re-evaluation.
Axiom & Free-Parameter Ledger
free parameters (2)
- model scale
- pre-training data volume
axioms (1)
- domain assumption A single language modeling objective on paired image-text data is sufficient to learn unified vision-language representations.
Forward citations
Cited by 18 Pith papers
-
WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition
WikiCLIP delivers an efficient contrastive baseline for open-domain visual entity recognition that improves accuracy by 16% on OVEN unseen entities and runs nearly 100 times faster than leading generative models.
-
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models
HallusionBench shows GPT-4V reaches only 31.42% accuracy on paired questions testing language hallucination and visual illusion in LVLMs, with other models below 16%.
-
Visual Instruction Tuning
LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
-
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
-
Language Is Not All You Need: Aligning Perception with Language Models
Kosmos-1 shows strong zero-shot and few-shot results on language tasks, image captioning, visual QA, OCR-free document understanding, and image recognition guided by text instructions.
-
PaLI: A Jointly-Scaled Multilingual Language-Image Model
PaLI jointly scales a 4B-parameter vision transformer with language models on a new 10B multilingual image-text dataset to reach state-of-the-art results on vision-language tasks while keeping a simple modular design.
-
VISOR: A Vision-Language Model-based Test Oracle for Testing Robot
VISOR applies VLMs to automate robot test oracles for correctness and quality assessment while reporting uncertainty, with evaluation on GPT and Gemini showing trade-offs in precision and recall but poor uncertainty c...
-
Embedding Arithmetic: A Lightweight, Tuning-Free Framework for Post-hoc Bias Mitigation in Text-to-Image Models
Embedding Arithmetic performs vector operations in the embedding space of T2I models to mitigate bias at inference time, outperforming baselines on diversity while preserving coherence via a new Concept Coherence Score.
-
ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs
ITIScore evaluates MLLM image captions via image-to-text-to-image reconstruction consistency and aligns with human judgments on a new 40K-caption benchmark.
-
From Broad Exploration to Stable Synthesis: Entropy-Guided Optimization for Autoregressive Image Generation
EG-GRPO improves autoregressive text-to-image models by reallocating RL updates according to token entropy, excluding low-entropy tokens from reward signals while adding entropy bonuses to high-entropy ones, yielding ...
-
CogVLM: Visual Expert for Pretrained Language Models
CogVLM adds a trainable visual expert inside frozen language model layers for deep vision-language fusion and reports state-of-the-art results on ten cross-modal benchmarks while preserving NLP performance.
-
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
A new dataset of 400k visual instructions including negative examples at three semantic levels reduces hallucinations in models like MiniGPT-4 when used for fine-tuning while improving benchmark performance.
-
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.
-
Sigmoid Loss for Language Image Pre-Training
SigLIP replaces softmax-based contrastive loss with a simple pairwise sigmoid loss for vision-language pre-training, decoupling batch size from normalization and reaching strong zero-shot performance with limited compute.
-
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.
-
Text-Guided Multi-Scale Frequency Representation Adaptation
FreqAdapter adapts multimodal models by text-guided multi-scale fine-tuning in the frequency domain, claiming better performance and efficiency than signal-space PEFT methods.
-
Let ViT Speak: Generative Language-Image Pre-training
GenLIP pretrains ViTs to generate language tokens from visual tokens via autoregressive language modeling, matching strong baselines on multimodal tasks with less data.
-
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.
Reference graph
Works this paper leans on
-
[1]
Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin,...
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[3]
98 on it. (13) GITTextcaps: The station logo. (14) GITTextcaps: A sign with the numbers 3, 642, 039, 055 on it. (15) GITMJST: republic GITMJST: peperoni GITMJST: jewellery GITMJST: promod GITTextcaps: The word republic is on a black background. (16) GITMJST: republic GITTextcaps: A sign that says republic on it. (17) GITTextcaps: A black sign that says 'p...
work page 2020
-
[4]
Bluetooth Beats buy beats by dr dre 3 wireless bluetooth headphones
The empire state building the empire state building was built in 1930 and is 1,472 feet tall. Bluetooth Beats buy beats by dr dre 3 wireless bluetooth headphones. (21) (22) (23) (24) (25) Others: banknote, jersey, food, celebrity, character, logo etc. ImId: 23576 ImId: 24480 ImId: 24973 ImId: 27193 ImId: 24407 Nike nfl Arizona Cardinals #32 Tyrann Mathieu...
work page 1930
-
[5]
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways.arXiv preprint arXiv:2204.02311,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Marcella Cornia, Lorenzo Baraldi, Giuseppe Fiameni, and Rita Cucchiara. Universal captioner: Long-tail vision-and-language model training through content-style separation.arXiv preprint arXiv:2111.12727,
-
[7]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
An empirical study of training end-to-end vision-and-language transformers
Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, Zicheng Liu, and Michael Zeng. An empirical study of training end-to-end vision-and-language transformers. arXiv preprint arXiv: 2111.02387,
-
[9]
Shancheng Fang, Hongtao Xie, Yuxin Wang, Zhendong Mao, and Yongdong Zhang. Read like humans: autonomous, bidirectional and iterative language modeling for scene text recognition. InCVPR, 2021a. Zhiyuan Fang, Jianfeng Wang, Xiaowei Hu, Lin Liang, Zhe Gan, Lijuan Wang, Yezhou Yang, and Zicheng Liu. Injecting semantic concepts into end-to-end image captionin...
-
[10]
Structured multimodal attentions for textvqa.arXiv preprint arXiv:2006.00753,
Chenyu Gao, Qi Zhu, Peng Wang, Hui Li, Yuliang Liu, Anton van den Hengel, and Qi Wu. Structured multimodal attentions for textvqa.arXiv preprint arXiv:2006.00753,
-
[11]
Captioning images taken by people who are blind.arXiv preprint arXiv:2002.08565,
Danna Gurari, Yinan Zhao, Meng Zhang, and Nilavra Bhattacharya. Captioning images taken by people who are blind.arXiv preprint arXiv:2002.08565,
-
[12]
Scaling up vision-language pre-training for image captioning.arXiv preprint arXiv:2111.12233, 2021a
Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, and Lijuan Wang. Scaling up vision-language pre-training for image captioning.arXiv preprint arXiv:2111.12233, 2021a. Xiaowei Hu, Xi Yin, Kevin Lin, Lijuan Wang, Lei Zhang, Jianfeng Gao, and Zicheng Liu. VIVO: surpassing human performance in novel object captioning with visual voca...
-
[13]
arXiv preprint arXiv:2004.00849 , year=
Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers.arXiv preprint arXiv:2004.00849,
-
[14]
Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition
Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Synthetic data and artificial neural networks for natural scene text recognition.arXiv preprint arXiv:1406.2227,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Icdar 2013 robust reading competition
Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura, Lluis Gomez i Bigorda, Sergi Robles Mestre, Joan Mas, David Fernandez Mota, Jon Almazan Almazan, and Lluis Pere De Las Heras. Icdar 2013 robust reading competition. InICDAR,
work page 2013
-
[16]
Icdar 2015 competition on robust reading
Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chandrasekhar, Shijian Lu, et al. Icdar 2015 competition on robust reading. InICDAR,
work page 2015
-
[17]
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Fei-Fei Li. Visual genome: Connecting language and vision using crowdsourced dense image annotations.arXiv preprint arXiv:1602.07332,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
mplug: Effective and efficient vision-language learning by cross-modal skip-connections
Chenliang Li, Haiyang Xu, Junfeng Tian, Wei Wang, Ming Yan, Bin Bi, Jiabo Ye, Hehong Chen, Guohai Xu, Zheng Cao, et al. mplug: Effective and efficient vision-language learning by cross-modal skip-connections. arXiv preprint arXiv:2205.12005, 2022a. Junnan Li, Ramprasaath R Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, and Steven Hoi. Align b...
-
[19]
Kevin Lin, Linjie Li, Chung-Ching Lin, Faisal Ahmed, Zhe Gan, Zicheng Liu, Yumao Lu, and Lijuan Wang. Swinbert: End-to-end transformers with sparse attention for video captioning.arXiv preprint arXiv:2111.13196, 2021a. Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollá...
-
[20]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Xudong Lin, Gedas Bertasius, Jue Wang, Shih-Fu Chang, Devi Parikh, and Lorenzo Torresani. Vx2text: End-to-end learning of video-based text generation from multimodal inputs. InCVPR, 2021b. Fen Liu, Guanghui Xu, Qi Wu, Qing Du, Wei Jia, and Mingkui Tan. Cascade reasoning network for text-based visual question answering. In Chang Wen Chen, Rita Cucchiara, X...
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[21]
Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, and Ming Zhou. Univl: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353,
-
[22]
Maskocr: Text recognition with masked encoder-decoder pretraining
Pengyuan Lyu, Chengquan Zhang, Shanshan Liu, Meina Qiao, Yangliu Xu, Liang Wu, Kun Yao, Junyu Han, Errui Ding, and Jingdong Wang. Maskocr: Text recognition with masked encoder-decoder pretraining. arXiv preprint arXiv:2206.00311,
-
[23]
Yixuan Qiao, Hao Chen, Jun Wang, Yihao Chen, Xianbin Ye, Ziliang Li, Xianbiao Qi, Peng Gao, and Guotong Xie. Winner team mia at textvqa challenge 2021: Vision-and-language representation learning with pre-trained sequence-to-sequence model.arXiv preprint arXiv:2106.15332,
-
[24]
End-to-end generative pretraining for multimodal video captioning.arXiv preprint arXiv:2201.08264,
Paul Hongsuck Seo, Arsha Nagrani, Anurag Arnab, and Cordelia Schmid. End-to-end generative pretraining for multimodal video captioning.arXiv preprint arXiv:2201.08264,
-
[25]
How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383, 2021
Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, and Kurt Keutzer. How much can CLIP benefit vision-and-language tasks?arXiv preprint arXiv:2107.06383,
-
[26]
Clip4caption ++: Multi-clip for video caption
Mingkang Tang, Zhanyu Wang, Zhaoyang Zeng, Fengyun Rao, and Dian Li. Clip4caption ++: Multi-clip for video caption. arXiv preprint arXiv:2110.05204,
-
[27]
Translating Videos to Natural Language Using Deep Recurrent Neural Networks
Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond J. Mooney, and Kate Saenko. Translating videos to natural language using deep recurrent neural networks.arXiv preprint arXiv:1412.4729,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
All in one: Exploring unified video-language pre-training
46 Alex Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu Cai, Jianping Wu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. All in one: Exploring unified video-language pre-training.arXiv preprint arXiv:2203.07303, 2022a. Bairui Wang, Lin Ma, Wei Zhang, Wenhao Jiang, Jingwen Wang, and Wei Liu. Controllable video captioning with pos sequence guida...
-
[29]
Jianfeng Wang, Xiaowei Hu, Zhe Gan, Zhengyuan Yang, Xiyang Dai, Zicheng Liu, Yumao Lu, and Lijuan Wang. UFO: A unified transformer for vision-language representation learning.arXiv preprint arXiv:2111.10023, 2021a. Kai Wang, Boris Babenko, and Serge Belongie. End-to-end scene text recognition. InICCV,
-
[30]
arXiv preprint arXiv:2202.03052 , year=
Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework.arXiv preprint arXiv:2202.03052, 2022b. Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. Vatex: A lar...
-
[31]
Yujia Xie, Luowei Zhou, Xiyang Dai, Lu Yuan, Nguyen Bach, Ce Liu, and Michael Zeng. Visual clues: Bridging vision and language foundations for image paragraph captioning.arXiv preprint arXiv:2206.01843,
-
[32]
Hongwei Xue, Yupan Huang, Bei Liu, Houwen Peng, Jianlong Fu, Houqiang Li, and Jiebo Luo. Prob- ing inter-modality: Visual parsing with self-attention for vision-language pre-training.arXiv preprint arXiv:2106.13488, 2021a. Hongwei Xue, Yupan Huang, Bei Liu, Houwen Peng, Jianlong Fu, Houqiang Li, and Jiebo Luo. Probing inter-modality: Visual parsing with s...
-
[33]
CoCa: Contrastive Captioners are Image-Text Foundation Models
Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models.arXiv preprint arXiv:2205.01917,
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Florence: A New Foundation Model for Computer Vision
Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang, Jianfeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, and Pengchuan Zhang. Florence: A new foundation model for computer vision.arXiv prepri...
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
Andy Zeng, Adrian Wong, Stefan Welker, Krzysztof Choromanski, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, et al. Socratic models: Composing zero-shot multimodal reasoning with language.arXiv preprint arXiv:2204.00598,
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
Xinxin Zhu, Longteng Guo, Peng Yao, Shichen Lu, Wei Liu, and Jing Liu. Vatex video captioning challenge 2020: Multi-view features and hybrid reward strategies for video captioning.arXiv preprint arXiv:1910.11102,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.