arxiv: 2305.18565 · v1 · pith:W4TPLI53new · submitted 2023-05-29 · 💻 cs.CV · cs.CL· cs.LG

PaLI-X: On Scaling up a Multilingual Vision and Language Model

Xi Chen , Josip Djolonga , Piotr Padlewski , Basil Mustafa , Soravit Changpinyo , Jialin Wu , Carlos Riquelme Ruiz , Sebastian Goodman

show 35 more authors

Xiao Wang Yi Tay Siamak Shakeri Mostafa Dehghani Daniel Salz Mario Lucic Michael Tschannen Arsha Nagrani Hexiang Hu Mandar Joshi Bo Pang Ceslee Montgomery Paulina Pietrzyk Marvin Ritter AJ Piergiovanni Matthias Minderer Filip Pavetic Austin Waters Gang Li Ibrahim Alabdulmohsin Lucas Beyer Julien Amelot Kenton Lee Andreas Peter Steiner Yang Li Daniel Keysers Anurag Arnab Yuanzhong Xu Keran Rong Alexander Kolesnikov Mojtaba Seyedhosseini Anelia Angelova Xiaohua Zhai Neil Houlsby Radu Soricut

This is my paper

Pith reviewed 2026-05-17 14:29 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.LG

keywords modelmultilingualpali-xtaskstrainingcaptioningcomplexdetection

0 comments

The pith

Scaling up PaLI-X sets new state-of-the-art on most vision and language benchmarks and shows emergent capabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

PaLI-X is scaled up in model size and training task variety to handle vision and language tasks multilingually. This produces top results on image captioning, question answering, document understanding, object detection and video tasks. The model surpasses previous bests on over 25 benchmarks and displays new skills such as complex counting and multilingual object detection that were not part of the direct training.

Core claim

PaLI-X advances the state-of-the-art on most vision-and-language benchmarks considered (25+ of them) by scaling up the size of the components and the breadth of its training task mixture. It achieves new levels of performance on multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot learning, as well as object detection, video question answering, and video captioning. Finally, emerging capabilities are observed, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.

What carries the argument

Scaling the PaLI-X multilingual vision and language model in size and training task mixture to improve performance and reveal new abilities.

Load-bearing premise

That larger model size combined with a broader training task mixture will produce higher benchmark scores and emergent behaviors without needing task-specific fine-tuning or architectural modifications.

What would settle it

Training an even larger PaLI-X on the expanded task set but finding no improvement over prior SOTA on most benchmarks or absence of complex counting ability.

read the original abstract

We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. PaLI-X advances the state-of-the-art on most vision-and-language benchmarks considered (25+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PaLI-X scales the prior PaLI model in size and task mix to post new numbers on 25+ vision-language benchmarks plus some emergent behaviors, but the attribution to scale alone depends on the exact evaluation protocols used.

read the letter

The core takeaway is that this paper trains a larger multilingual vision-language model with a wider task mixture and reports concrete gains across captioning, VQA, document understanding, video tasks, and few-shot learning, along with some behaviors like complex counting and multilingual detection that were not directly trained for. The results are presented as empirical outcomes from the scaled setup rather than any new theoretical derivation.

Referee Report

2 major / 2 minor

Summary. The manuscript presents PaLI-X, a scaled-up multilingual vision-and-language model. The authors increase both model size and the breadth of the training task mixture, reporting new state-of-the-art results on more than 25 vision-and-language benchmarks that include image captioning, visual question answering, document understanding, few-shot in-context learning, object detection, video QA, and video captioning. They additionally document emergent capabilities such as complex counting and multilingual object detection that were not explicitly present in the training mix.

Significance. If the benchmark gains and emergent behaviors are shown to arise under evaluation protocols that are at least as stringent as those used by prior work, the paper would provide concrete evidence that joint scaling of model capacity and task diversity yields both higher performance and qualitatively new abilities in multimodal models. The breadth of tasks covered and the observation of capabilities outside the explicit training distribution are strengths that could inform the design of future generalist vision-language systems.

major comments (2)

[Abstract] Abstract and results presentation: the headline claim of advancing the state of the art on 25+ benchmarks does not specify, for each benchmark or benchmark group, whether the reported numbers are zero-shot, few-shot, or obtained after task-specific fine-tuning. Because the training mixture already contains captioning, VQA, and document-understanding objectives, it is impossible to isolate the contribution of scale and task breadth from ordinary per-task supervised adaptation that earlier models also receive.
[Emergent capabilities discussion] Section describing emergent capabilities: the observations of complex counting and multilingual object detection are presented as qualitative evidence of emergence, yet the manuscript provides neither quantitative baselines from smaller PaLI variants nor explicit prompting protocols. Without these controls it is difficult to confirm that the behaviors are truly emergent rather than the result of the broader task mixture or prompting choices.

minor comments (2)

[Results tables] Table captions and benchmark lists should explicitly note the evaluation protocol (zero-shot / k-shot / fine-tuned) alongside each score so that readers can immediately compare with prior work.
[Training recipe] The description of the training task mixture would benefit from a concise table enumerating the proportion of each task type and the total number of examples seen during pre-training.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below, indicating where we will revise the manuscript for greater clarity while maintaining the integrity of our experimental claims.

read point-by-point responses

Referee: [Abstract] Abstract and results presentation: the headline claim of advancing the state of the art on 25+ benchmarks does not specify, for each benchmark or benchmark group, whether the reported numbers are zero-shot, few-shot, or obtained after task-specific fine-tuning. Because the training mixture already contains captioning, VQA, and document-understanding objectives, it is impossible to isolate the contribution of scale and task breadth from ordinary per-task supervised adaptation that earlier models also receive.

Authors: We agree that the abstract would benefit from greater specificity on evaluation protocols. In the revised manuscript we will update the abstract to state that results are reported under the standard protocols used by prior work on each benchmark (zero-shot or few-shot for some tasks, task-specific fine-tuning for others), with full per-benchmark details retained in the experimental sections and tables. Regarding isolation of scale versus task adaptation, we note that the paper's central contribution is the joint effect of increased model capacity and broader task mixture; direct comparisons to prior models trained on narrower mixtures are already provided. While a perfectly controlled factorial ablation isolating every variable is indeed difficult, the observed scaling trends and outperformance relative to smaller PaLI variants support the value of the combined approach. We will add a short clarifying paragraph in the introduction. revision: partial
Referee: [Emergent capabilities discussion] Section describing emergent capabilities: the observations of complex counting and multilingual object detection are presented as qualitative evidence of emergence, yet the manuscript provides neither quantitative baselines from smaller PaLI variants nor explicit prompting protocols. Without these controls it is difficult to confirm that the behaviors are truly emergent rather than the result of the broader task mixture or prompting choices.

Authors: We thank the referee for this observation. To strengthen the emergence claims we will incorporate quantitative results from smaller PaLI variants on the same counting and detection prompts, along with the exact prompting templates used. These comparisons are available from our internal scaling experiments and will be added to the revised section, allowing readers to assess whether the behaviors appear only at the largest scale. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical scaling results are self-contained observations

full rationale

The paper reports training a scaled multilingual vision-language model and its benchmark outcomes without any mathematical derivation chain, equations, or predictions that reduce to fitted inputs by construction. Claims rest on direct empirical results from model size increases and task mixture expansion, evaluated on external benchmarks; no self-definitional steps, self-citation load-bearing arguments, or renamed known results appear in the presented material.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of scaling model size and task breadth; no explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5614 in / 1029 out tokens · 11508 ms · 2026-05-17T14:29:04.937460+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PaLI-X advances the state-of-the-art on most vision-and-language benchmarks considered (25+ of them).

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
cs.CL 2023-11 unverdicted novelty 8.0

MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
cs.CV 2024-10 accept novelty 7.0

PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.
RT-H: Action Hierarchies Using Language
cs.RO 2024-03 conditional novelty 7.0

RT-H learns robot policies by first predicting language motions as an intermediate representation and then mapping those plus the high-level task to actions, yielding more robust multi-task performance and the ability...
Learning Interactive Real-World Simulators
cs.AI 2023-10 conditional novelty 7.0

UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
MiMIC: Mitigating Visual Modality Collapse in Universal Multimodal Retrieval While Avoiding Semantic Misalignment
cs.CV 2026-04 unverdicted novelty 6.0

MiMIC mitigates visual modality collapse and semantic misalignment in universal multimodal retrieval via fusion-in-decoder architecture and robust single-modality training.
OFlow: Injecting Object-Aware Temporal Flow Matching for Robust Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

OFlow unifies temporal foresight and object-aware reasoning inside a shared latent space via flow matching to improve VLA robustness in robotic manipulation under distribution shifts.
GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data
cs.RO 2025-05 unverdicted novelty 6.0

GraspVLA shows that pretraining a grasping model on a billion synthetic action frames enables zero-shot open-vocabulary performance and sim-to-real transfer.
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
cs.RO 2024-11 unverdicted novelty 6.0

CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new ro...
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
cs.CV 2024-03 unverdicted novelty 6.0

MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
Vision-Language Foundation Models as Effective Robot Imitators
cs.RO 2023-11 conditional novelty 6.0

RoboFlamingo adapts open-source vision-language models for robot manipulation tasks via single-step comprehension plus an explicit policy head, outperforming prior methods on benchmarks with only light fine-tuning.
Aligning Large Multimodal Models with Factually Augmented RLHF
cs.CV 2023-09 conditional novelty 6.0

Factually Augmented RLHF aligns large multimodal models to reduce hallucinations, reaching 94% of GPT-4 on LLaVA-Bench and 60% improvement on the new MMHAL-BENCH.
Otter: A Multi-Modal Model with In-Context Instruction Tuning
cs.CV 2023-05 unverdicted novelty 6.0

Otter is a multi-modal model instruction-tuned on the MIMIC-IT dataset of over 3 million in-context instruction-response pairs to improve convergence and generalization on tasks with multiple images and videos.
GR-3 Technical Report
cs.RO 2025-07 unverdicted novelty 5.0

GR-3 is a VLA model that generalizes to novel objects, environments, and abstract instructions, outperforms the π0 baseline, and integrates with the new ByteMini bi-manual mobile robot.
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
cs.RO 2025-07 unverdicted novelty 5.0

The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
cs.CV 2023-12 unverdicted novelty 5.0

InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings
cs.CV 2026-04 unverdicted novelty 4.0

MMCORE transfers VLM reasoning into diffusion-based image generation and editing via aligned latent embeddings from learnable queries, outperforming baselines on text-to-image and editing tasks.
PaliGemma: A versatile 3B VLM for transfer
cs.CV 2024-07 unverdicted novelty 4.0

PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.
Improved Baselines with Visual Instruction Tuning
cs.CV 2023-10 conditional novelty 4.0

Simple changes to LLaVA using CLIP-ViT-L-336px, an MLP connector, and academic VQA data yield state-of-the-art results on 11 benchmarks with only 1.2M public examples and one-day training on 8 A100 GPUs.

Reference graph

Works this paper leans on

99 extracted references · 99 canonical work pages · cited by 18 Pith papers · 4 internal anchors

[1]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Tom B. Brown, Benjamin Mann, Nick Ryder, Jared Kaplan Melanie Subbiah, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Christopher Hesse Clemens Winter, Mark Chen, Eric Sigler, Mateusz Litwin,...

work page 2020
[3]

GLaM: Efficient scaling of language models with mixture-of-experts

Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten P Bosma, Zongwei Zhou, Tao Wang, Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc Le, Yonghui Wu, Zhifeng Chen, and C...

work page 2022
[4]

Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H

Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yujing Z...

work page 2023
[5]

PaLI: A jointly-scaled multilingual language-image model

Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebas- tian Alexander Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karago...

work page 2023
[6]

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Je- natton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin ...

work page 2023
[7]

Tran, Xavier Garcia, Jason Wei, Xuezhi Wang, Hyung Won Chung, Siamak Shakeri, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Denny Zhou, Neil Houlsby, and Donald Metzler

Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Jason Wei, Xuezhi Wang, Hyung Won Chung, Siamak Shakeri, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Denny Zhou, Neil Houlsby, and Donald Metzler. UL2: Unifying language learning paradigms. In ICLR, 2023

work page 2023
[8]

In: Findings of the 61st Annual Meeting of the Association for Computational Linguistics (2023), https://arxiv.org/abs/ 2212.10505 10

Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, and Yasemin Altun. DePlot: One-shot visual language reasoning by plot-to-table translation. arXiv preprint arXiv:2212.10505, 2022

work page arXiv 2022
[9]

GIT: A generative image-to-text transformer for vision and language

Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. GIT: A generative image-to-text transformer for vision and language. TMLR, 2022

work page 2022
[10]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022

work page 2022
[11]

Language Is Not All You Need: Aligning Perception with Language Models

Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023
[13]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[14]

A Survey on In-context Learning

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. A survey for in-context learning. arXiv preprint arXiv:2301.00234, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[15]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, 2022

work page 2022
[16]

Least-to-most prompting enables complex reasoning in large language models

Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables complex reasoning in large language models. In ICLR, 2023

work page 2023
[17]

Larger language models do in-context learning differently

Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, et al. Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846, 2023

work page arXiv 2023
[18]

Meta-learning via language model in-context tuning

Yanda Chen, Ruiqi Zhong, Sheng Zha, George Karypis, and He He. Meta-learning via language model in-context tuning. In ACL, 2022

work page 2022
[19]

Unified-IO: A unified model for vision, language, and multi-modal tasks

Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, and Aniruddha Kembhavi. Unified-IO: A unified model for vision, language, and multi-modal tasks. In ICLR, 2023

work page 2023
[20]

Spotlight: Mobile UI understanding using vision-language models with a focus

Gang Li and Yang Li. Spotlight: Mobile UI understanding using vision-language models with a focus. In ICLR, 2023. 25

work page 2023
[21]

SimVLM: Simple visual language model pretraining with weak supervision

Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. SimVLM: Simple visual language model pretraining with weak supervision. In ICLR, 2022

work page 2022
[22]

Conceptual Captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual Captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018

work page 2018
[23]

Thapliyal, Jordi Pont-Tuset, Xi Chen, and Radu Soricut

Ashish V . Thapliyal, Jordi Pont-Tuset, Xi Chen, and Radu Soricut. Crossmodal-3600: A massively multilingual multimodal evaluation dataset. In EMNLP, 2022

work page 2022
[24]

PreSTU: Pre-training for scene-text understanding

Jihyung Kil, Soravit Changpinyo, Xi Chen, Hexiang Hu, Sebastian Goodman, Wei-Lun Chao, and Radu Soricut. PreSTU: Pre-training for scene-text understanding. arXiv preprint arXiv:2209.05534, 2022

work page arXiv 2022
[25]

All you may need for VQA are image captions

Soravit Changpinyo, Doron Kukliansky, Idan Szpektor, Xi Chen, Nan Ding, and Radu Soricut. All you may need for VQA are image captions. In NAACL, 2022

work page 2022
[26]

Pre-training image-language trans- formers for open-vocabulary tasks

AJ Piergiovanni, Weicheng Kuo, and Anelia Angelova. Pre-training image-language trans- formers for open-vocabulary tasks. In T4V: Transformers for Vision Workshop, Conference on Computer Vision and Pattern Recognition, 2022

work page 2022
[27]

Pix2Struct: Screenshot parsing as pretraining for visual language understanding

Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2Struct: Screenshot parsing as pretraining for visual language understanding. In ICML, 2023

work page 2023
[28]

Simple open-vocabulary object detection with vision transformers

Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, et al. Simple open-vocabulary object detection with vision transformers. In ECCV, 2022

work page 2022
[29]

Vector-quantized image modeling with improved VQGAN

Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved VQGAN. In ICLR, 2022

work page 2022
[30]

Deep visual-semantic alignments for generating image descriptions

Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, 2015

work page 2015
[31]

nocaps: Novel object captioning at scale

Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8948–8957, 2019

work page 2019
[32]

TextCaps: a dataset for image captioning with reading comprehension

Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. TextCaps: a dataset for image captioning with reading comprehension. In European conference on computer vision, pages 742–758, 2020

work page 2020
[33]

Captioning images taken by people who are blind

Danna Gurari, Yinan Zhao, Meng Zhang, and Nilavra Bhattacharya. Captioning images taken by people who are blind. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVII 16, pages 417–434. Springer, 2020

work page 2020
[34]

Screen2Words: Automatic mobile ui summarization with multimodal learning

Bryan Wang, Gang Li, Xin Zhou, Zhourong Chen, Tovi Grossman, and Yang Li. Screen2Words: Automatic mobile ui summarization with multimodal learning. In The 34th Annual ACM Symposium on User Interface Software and Technology, UIST ’21, page 498–510, New York, NY , USA, 2021. Association for Computing Machinery

work page 2021
[35]

Widget Captioning: Generating natural language description for mobile user interface elements

Yang Li, Gang Li, Luheng He, Jingjie Zheng, Hong Li, and Zhiwei Guan. Widget Captioning: Generating natural language description for mobile user interface elements. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5495–5510, Online, November 2020. Association for Computational Linguistics

work page 2020
[36]

Making the V in VQA matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 6904–6913, 2017

work page 2017
[37]

OK-VQA: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. OK-VQA: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019

work page 2019
[38]

Tallyqa: Answering complex counting questions

Manoj Acharya, Kushal Kafle, and Christopher Kanan. Tallyqa: Answering complex counting questions. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 8076–8084, 2019. 26

work page 2019
[39]

Towards VQA models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards VQA models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019

work page 2019
[40]

VizWiz grand challenge: Answering visual questions from blind people

Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. VizWiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 3608–3617, 2018

work page 2018
[41]

Jawahar, Ernest Valveny, and Dimosthenis Karatzas

Ali Furkan Biten, Rubèn Tito, Andrés Mafla, Lluis Gomez, Marçal Rusiñol, C.V . Jawahar, Ernest Valveny, and Dimosthenis Karatzas. Scene text visual question answering. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4290–4300, 2019

work page 2019
[42]

Ocr-vqa: Visual question answering by reading text in images

Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In ICDAR, 2019

work page 2019
[43]

Infographicvqa

Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawa- har. Infographicvqa. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1697–1706, 2022

work page 2022
[44]

Docvqa: A dataset for vqa on document images

Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021

work page 2021
[45]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In ECCV, 2016

work page 2016
[46]

ChartQA: A benchmark for question answering about charts with visual and logical reasoning

Ahmed Masry, Do Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Findings of ACL, 2022

work page 2022
[47]

Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities

Hexiang Hu, Yi Luan, Yang Chen, Urvashi Khandelwal, Mandar Joshi, Kenton Lee, Kristina Toutanova, and Ming-Wei Chang. Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities. arXiv preprint arXiv:2302.11154, 2023

work page arXiv 2023
[48]

Can pre-trained vision and language models answer visual information-seeking questions? arXiv preprint arXiv:2302.11713, 2023

Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming- Wei Chang. Can pre-trained vision and language models answer visual information-seeking questions? arXiv preprint arXiv:2302.11713, 2023

work page arXiv 2023
[49]

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun

Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, et al. Image as a foreign language: BEiT pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022

work page arXiv 2022
[50]

Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. PaLM-E: An embodied ...

work page 2023
[51]

Revisiting modulated convolutions for visual counting and beyond

Duy-Kien Nguyen, Vedanuj Goswami, and Xinlei Chen. Revisiting modulated convolutions for visual counting and beyond. ICLR, 2021

work page 2021
[52]

Winner team Mia at TextVQA challenge 2021: Vision-and-language represen- tation learning with pre-trained sequence-to-sequence model

Yixuan Qiao, Hao Chen, Jun Wang, Yihao Chen, Xianbin Ye, Ziliang Li, Xianbiao Qi, Peng Gao, and Guotong Xie. Winner team Mia at TextVQA challenge 2021: Vision-and-language represen- tation learning with pre-trained sequence-to-sequence model. arXiv preprint arXiv:2106.15332, 2021

work page arXiv 2021
[53]

Manmatha

Ali Furkan Biten, Ron Litman, Yusheng Xie, Srikar Appalaraju, and R. Manmatha. LaTr: Layout-aware transformer for scene-text VQA. In CVPR, 2022

work page 2022
[54]

Unifying vision, text, and layout for universal document processing

Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, and Mohit Bansal. Unifying vision, text, and layout for universal document processing. In CVPR, 2023

work page 2023
[55]

MSR-VTT: A large video description dataset for bridging video and language

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. MSR-VTT: A large video description dataset for bridging video and language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016. 27

work page 2016
[56]

V ATEX: A large-scale, high-quality multilingual dataset for video-and-language research

Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. V ATEX: A large-scale, high-quality multilingual dataset for video-and-language research. In ICCV, 2019

work page 2019
[57]

Dense- captioning events in videos

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense- captioning events in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017

work page 2017
[58]

Spoken Moments: Learning joint audio-visual representations from video descriptions

Mathew Monfort, SouYoung Jin, Alexander Liu, David Harwath, Rogerio Feris, James Glass, and Aude Oliva. Spoken Moments: Learning joint audio-visual representations from video descriptions. In CVPR, 2021

work page 2021
[59]

NExT-QA: Next phase of question- answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. NExT-QA: Next phase of question- answering to explaining temporal actions. In CVPR, 2021

work page 2021
[60]

Video question answering via gradually refined attention over appearance and motion

Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answering via gradually refined attention over appearance and motion. In MM, 2017

work page 2017
[61]

ActivityNet-QA: A dataset for understanding complex web videos via question answering

Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. ActivityNet-QA: A dataset for understanding complex web videos via question answering. In AAAI, 2019

work page 2019
[62]

End-to-end dense video captioning with parallel decoding

Teng Wang, Ruimao Zhang, Zhichao Lu, Feng Zheng, Ran Cheng, and Ping Luo. End-to-end dense video captioning with parallel decoding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6847–6857, 2021

work page 2021
[63]

VindLU: A recipe for effective video-and-language pretraining

Feng Cheng, Xizi Wang, Jie Lei, David Crandall, Mohit Bansal, and Gedas Bertasius. VindLU: A recipe for effective video-and-language pretraining. In CVPR, 2023

work page 2023
[64]

End-to-end generative pretraining for multimodal video captioning

Paul Hongsuck Seo, Arsha Nagrani, Anurag Arnab, and Cordelia Schmid. End-to-end generative pretraining for multimodal video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17959–17968, 2022

work page 2022
[65]

A multi-world approach to question answering about real- world scenes based on uncertain input

Mario Fritz Mateusz Malinowski. A multi-world approach to question answering about real- world scenes based on uncertain input. In NeurIPS, 2014

work page 2014
[66]

ImageNet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large- scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255, 2009

work page 2009
[67]

Are we done with ImageNet? arXiv preprint arXiv:2006.07159, 2020

Lucas Beyer, Olivier J Hénaff, Alexander Kolesnikov, Xiaohua Zhai, and Aäron van den Oord. Are we done with ImageNet? arXiv preprint arXiv:2006.07159, 2020

work page arXiv 2006
[68]

The many faces of robustness: A critical analysis of out-of-distribution generalization

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8349, 2021

work page 2021
[69]

Natural adversarial examples

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15262–15271, 2021

work page 2021
[70]

Learning robust global represen- tations by penalizing local predictive power

Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global represen- tations by penalizing local predictive power. In Advances in Neural Information Processing Systems, pages 10506–10518, 2019

work page 2019
[71]

Do ImageNet classifiers generalize to ImageNet? In International Conference on Machine Learning, pages 5389–5400, 2019

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do ImageNet classifiers generalize to ImageNet? In International Conference on Machine Learning, pages 5389–5400, 2019

work page 2019
[72]

Fleet, and Geoffrey E

Ting Chen, Saurabh Saxena, Lala Li, David J. Fleet, and Geoffrey E. Hinton. Pix2Seq: A language modeling framework for object detection. In The Tenth International Conference on Learning Representations, ICLR, 2022

work page 2022
[73]

Girshick

Agrim Gupta, Piotr Dollár, and Ross B. Girshick. LVIS: A dataset for large vocabulary instance segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019 , pages 5356–5364. Computer Vision Foundation / IEEE, 2019. 28

work page 2019
[74]

Open-vocabulary object detection via vision and language knowledge distillation

Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2022

work page 2022
[75]

Regionclip: Region-based language-image pretraining

Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, and Jianfeng Gao. Regionclip: Region-based language-image pretraining. In CVPR, 2022

work page 2022
[76]

Women also snowboard: Overcoming bias in captioning models

Lisa Anne Hendricks, Kaylee Burns, Kate Saenko, Trevor Darrell, and Anna Rohrbach. Women also snowboard: Overcoming bias in captioning models. In ECCV, 2018

work page 2018
[77]

Semantics derived automatically from language corpora contain human-like biases

Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan. Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334), 2017

work page 2017
[78]

Men also like shopping: Reducing gender bias amplification using corpus-level constraints

Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In EMNLP, 2017

work page 2017
[79]

Towards fairness in visual recognition: Effective strategies for bias mitigation

Zeyu Wang, Klint Qinami, Ioannis Christos Karakozis, Kyle Genova, Prem Nair, Kenji Hata, and Olga Russakovsky. Towards fairness in visual recognition: Effective strategies for bias mitigation. In CVPR, 2020

work page 2020
[80]

Gender shades: Intersectional accuracy disparities in commercial gender classification

Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In FAccT, 2018

work page 2018

Showing first 80 references.