pith. machine review for the scientific record. sign in

arxiv: 2305.18565 · v1 · pith:W4TPLI53new · submitted 2023-05-29 · 💻 cs.CV · cs.CL· cs.LG

PaLI-X: On Scaling up a Multilingual Vision and Language Model

Pith reviewed 2026-05-17 14:29 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.LG
keywords modelmultilingualpali-xtaskstrainingcaptioningcomplexdetection
0
0 comments X

The pith

Scaling up PaLI-X sets new state-of-the-art on most vision and language benchmarks and shows emergent capabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

PaLI-X is scaled up in model size and training task variety to handle vision and language tasks multilingually. This produces top results on image captioning, question answering, document understanding, object detection and video tasks. The model surpasses previous bests on over 25 benchmarks and displays new skills such as complex counting and multilingual object detection that were not part of the direct training.

Core claim

PaLI-X advances the state-of-the-art on most vision-and-language benchmarks considered (25+ of them) by scaling up the size of the components and the breadth of its training task mixture. It achieves new levels of performance on multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot learning, as well as object detection, video question answering, and video captioning. Finally, emerging capabilities are observed, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.

What carries the argument

Scaling the PaLI-X multilingual vision and language model in size and training task mixture to improve performance and reveal new abilities.

Load-bearing premise

That larger model size combined with a broader training task mixture will produce higher benchmark scores and emergent behaviors without needing task-specific fine-tuning or architectural modifications.

What would settle it

Training an even larger PaLI-X on the expanded task set but finding no improvement over prior SOTA on most benchmarks or absence of complex counting ability.

read the original abstract

We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. PaLI-X advances the state-of-the-art on most vision-and-language benchmarks considered (25+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents PaLI-X, a scaled-up multilingual vision-and-language model. The authors increase both model size and the breadth of the training task mixture, reporting new state-of-the-art results on more than 25 vision-and-language benchmarks that include image captioning, visual question answering, document understanding, few-shot in-context learning, object detection, video QA, and video captioning. They additionally document emergent capabilities such as complex counting and multilingual object detection that were not explicitly present in the training mix.

Significance. If the benchmark gains and emergent behaviors are shown to arise under evaluation protocols that are at least as stringent as those used by prior work, the paper would provide concrete evidence that joint scaling of model capacity and task diversity yields both higher performance and qualitatively new abilities in multimodal models. The breadth of tasks covered and the observation of capabilities outside the explicit training distribution are strengths that could inform the design of future generalist vision-language systems.

major comments (2)
  1. [Abstract] Abstract and results presentation: the headline claim of advancing the state of the art on 25+ benchmarks does not specify, for each benchmark or benchmark group, whether the reported numbers are zero-shot, few-shot, or obtained after task-specific fine-tuning. Because the training mixture already contains captioning, VQA, and document-understanding objectives, it is impossible to isolate the contribution of scale and task breadth from ordinary per-task supervised adaptation that earlier models also receive.
  2. [Emergent capabilities discussion] Section describing emergent capabilities: the observations of complex counting and multilingual object detection are presented as qualitative evidence of emergence, yet the manuscript provides neither quantitative baselines from smaller PaLI variants nor explicit prompting protocols. Without these controls it is difficult to confirm that the behaviors are truly emergent rather than the result of the broader task mixture or prompting choices.
minor comments (2)
  1. [Results tables] Table captions and benchmark lists should explicitly note the evaluation protocol (zero-shot / k-shot / fine-tuned) alongside each score so that readers can immediately compare with prior work.
  2. [Training recipe] The description of the training task mixture would benefit from a concise table enumerating the proportion of each task type and the total number of examples seen during pre-training.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below, indicating where we will revise the manuscript for greater clarity while maintaining the integrity of our experimental claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract and results presentation: the headline claim of advancing the state of the art on 25+ benchmarks does not specify, for each benchmark or benchmark group, whether the reported numbers are zero-shot, few-shot, or obtained after task-specific fine-tuning. Because the training mixture already contains captioning, VQA, and document-understanding objectives, it is impossible to isolate the contribution of scale and task breadth from ordinary per-task supervised adaptation that earlier models also receive.

    Authors: We agree that the abstract would benefit from greater specificity on evaluation protocols. In the revised manuscript we will update the abstract to state that results are reported under the standard protocols used by prior work on each benchmark (zero-shot or few-shot for some tasks, task-specific fine-tuning for others), with full per-benchmark details retained in the experimental sections and tables. Regarding isolation of scale versus task adaptation, we note that the paper's central contribution is the joint effect of increased model capacity and broader task mixture; direct comparisons to prior models trained on narrower mixtures are already provided. While a perfectly controlled factorial ablation isolating every variable is indeed difficult, the observed scaling trends and outperformance relative to smaller PaLI variants support the value of the combined approach. We will add a short clarifying paragraph in the introduction. revision: partial

  2. Referee: [Emergent capabilities discussion] Section describing emergent capabilities: the observations of complex counting and multilingual object detection are presented as qualitative evidence of emergence, yet the manuscript provides neither quantitative baselines from smaller PaLI variants nor explicit prompting protocols. Without these controls it is difficult to confirm that the behaviors are truly emergent rather than the result of the broader task mixture or prompting choices.

    Authors: We thank the referee for this observation. To strengthen the emergence claims we will incorporate quantitative results from smaller PaLI variants on the same counting and detection prompts, along with the exact prompting templates used. These comparisons are available from our internal scaling experiments and will be added to the revised section, allowing readers to assess whether the behaviors appear only at the largest scale. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical scaling results are self-contained observations

full rationale

The paper reports training a scaled multilingual vision-language model and its benchmark outcomes without any mathematical derivation chain, equations, or predictions that reduce to fitted inputs by construction. Claims rest on direct empirical results from model size increases and task mixture expansion, evaluated on external benchmarks; no self-definitional steps, self-citation load-bearing arguments, or renamed known results appear in the presented material.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of scaling model size and task breadth; no explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5614 in / 1029 out tokens · 11508 ms · 2026-05-17T14:29:04.937460+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

    cs.CL 2023-11 unverdicted novelty 8.0

    MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.

  2. PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

    cs.CV 2024-10 accept novelty 7.0

    PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.

  3. RT-H: Action Hierarchies Using Language

    cs.RO 2024-03 conditional novelty 7.0

    RT-H learns robot policies by first predicting language motions as an intermediate representation and then mapping those plus the high-level task to actions, yielding more robust multi-task performance and the ability...

  4. Learning Interactive Real-World Simulators

    cs.AI 2023-10 conditional novelty 7.0

    UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.

  5. MiMIC: Mitigating Visual Modality Collapse in Universal Multimodal Retrieval While Avoiding Semantic Misalignment

    cs.CV 2026-04 unverdicted novelty 6.0

    MiMIC mitigates visual modality collapse and semantic misalignment in universal multimodal retrieval via fusion-in-decoder architecture and robust single-modality training.

  6. OFlow: Injecting Object-Aware Temporal Flow Matching for Robust Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    OFlow unifies temporal foresight and object-aware reasoning inside a shared latent space via flow matching to improve VLA robustness in robotic manipulation under distribution shifts.

  7. GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data

    cs.RO 2025-05 unverdicted novelty 6.0

    GraspVLA shows that pretraining a grasping model on a billion synthetic action frames enables zero-shot open-vocabulary performance and sim-to-real transfer.

  8. CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

    cs.RO 2024-11 unverdicted novelty 6.0

    CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new ro...

  9. MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

    cs.CV 2024-03 unverdicted novelty 6.0

    MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.

  10. Vision-Language Foundation Models as Effective Robot Imitators

    cs.RO 2023-11 conditional novelty 6.0

    RoboFlamingo adapts open-source vision-language models for robot manipulation tasks via single-step comprehension plus an explicit policy head, outperforming prior methods on benchmarks with only light fine-tuning.

  11. Aligning Large Multimodal Models with Factually Augmented RLHF

    cs.CV 2023-09 conditional novelty 6.0

    Factually Augmented RLHF aligns large multimodal models to reduce hallucinations, reaching 94% of GPT-4 on LLaVA-Bench and 60% improvement on the new MMHAL-BENCH.

  12. Otter: A Multi-Modal Model with In-Context Instruction Tuning

    cs.CV 2023-05 unverdicted novelty 6.0

    Otter is a multi-modal model instruction-tuned on the MIMIC-IT dataset of over 3 million in-context instruction-response pairs to improve convergence and generalization on tasks with multiple images and videos.

  13. GR-3 Technical Report

    cs.RO 2025-07 unverdicted novelty 5.0

    GR-3 is a VLA model that generalizes to novel objects, environments, and abstract instructions, outperforms the π0 baseline, and integrates with the new ByteMini bi-manual mobile robot.

  14. A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

    cs.RO 2025-07 unverdicted novelty 5.0

    The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.

  15. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    cs.CV 2023-12 unverdicted novelty 5.0

    InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.

  16. MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings

    cs.CV 2026-04 unverdicted novelty 4.0

    MMCORE transfers VLM reasoning into diffusion-based image generation and editing via aligned latent embeddings from learnable queries, outperforming baselines on text-to-image and editing tasks.

  17. PaliGemma: A versatile 3B VLM for transfer

    cs.CV 2024-07 unverdicted novelty 4.0

    PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.

  18. Improved Baselines with Visual Instruction Tuning

    cs.CV 2023-10 conditional novelty 4.0

    Simple changes to LLaVA using CLIP-ViT-L-336px, an MLP connector, and academic VQA data yield state-of-the-art results on 11 benchmarks with only 1.2M public examples and one-day training on 8 A100 GPUs.

Reference graph

Works this paper leans on

99 extracted references · 99 canonical work pages · cited by 18 Pith papers · 4 internal anchors

  1. [1]

    PaLM: Scaling Language Modeling with Pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

  2. [2]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Jared Kaplan Melanie Subbiah, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Christopher Hesse Clemens Winter, Mark Chen, Eric Sigler, Mateusz Litwin,...

  3. [3]

    GLaM: Efficient scaling of language models with mixture-of-experts

    Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten P Bosma, Zongwei Zhou, Tao Wang, Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc Le, Yonghui Wu, Zhifeng Chen, and C...

  4. [4]

    Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H

    Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yujing Z...

  5. [5]

    PaLI: A jointly-scaled multilingual language-image model

    Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebas- tian Alexander Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karago...

  6. [6]

    Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Je- natton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin ...

  7. [7]

    Tran, Xavier Garcia, Jason Wei, Xuezhi Wang, Hyung Won Chung, Siamak Shakeri, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Denny Zhou, Neil Houlsby, and Donald Metzler

    Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Jason Wei, Xuezhi Wang, Hyung Won Chung, Siamak Shakeri, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Denny Zhou, Neil Houlsby, and Donald Metzler. UL2: Unifying language learning paradigms. In ICLR, 2023

  8. [8]

    In: Findings of the 61st Annual Meeting of the Association for Computational Linguistics (2023), https://arxiv.org/abs/ 2212.10505 10

    Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, and Yasemin Altun. DePlot: One-shot visual language reasoning by plot-to-table translation. arXiv preprint arXiv:2212.10505, 2022

  9. [9]

    GIT: A generative image-to-text transformer for vision and language

    Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. GIT: A generative image-to-text transformer for vision and language. TMLR, 2022

  10. [10]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022

  11. [11]

    Language Is Not All You Need: Aligning Perception with Language Models

    Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023

  12. [12]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023

  13. [13]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

  14. [14]

    A Survey on In-context Learning

    Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. A survey for in-context learning. arXiv preprint arXiv:2301.00234, 2022

  15. [15]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, 2022

  16. [16]

    Least-to-most prompting enables complex reasoning in large language models

    Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables complex reasoning in large language models. In ICLR, 2023

  17. [17]

    Larger language models do in-context learning differently

    Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, et al. Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846, 2023

  18. [18]

    Meta-learning via language model in-context tuning

    Yanda Chen, Ruiqi Zhong, Sheng Zha, George Karypis, and He He. Meta-learning via language model in-context tuning. In ACL, 2022

  19. [19]

    Unified-IO: A unified model for vision, language, and multi-modal tasks

    Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, and Aniruddha Kembhavi. Unified-IO: A unified model for vision, language, and multi-modal tasks. In ICLR, 2023

  20. [20]

    Spotlight: Mobile UI understanding using vision-language models with a focus

    Gang Li and Yang Li. Spotlight: Mobile UI understanding using vision-language models with a focus. In ICLR, 2023. 25

  21. [21]

    SimVLM: Simple visual language model pretraining with weak supervision

    Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. SimVLM: Simple visual language model pretraining with weak supervision. In ICLR, 2022

  22. [22]

    Conceptual Captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual Captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018

  23. [23]

    Thapliyal, Jordi Pont-Tuset, Xi Chen, and Radu Soricut

    Ashish V . Thapliyal, Jordi Pont-Tuset, Xi Chen, and Radu Soricut. Crossmodal-3600: A massively multilingual multimodal evaluation dataset. In EMNLP, 2022

  24. [24]

    PreSTU: Pre-training for scene-text understanding

    Jihyung Kil, Soravit Changpinyo, Xi Chen, Hexiang Hu, Sebastian Goodman, Wei-Lun Chao, and Radu Soricut. PreSTU: Pre-training for scene-text understanding. arXiv preprint arXiv:2209.05534, 2022

  25. [25]

    All you may need for VQA are image captions

    Soravit Changpinyo, Doron Kukliansky, Idan Szpektor, Xi Chen, Nan Ding, and Radu Soricut. All you may need for VQA are image captions. In NAACL, 2022

  26. [26]

    Pre-training image-language trans- formers for open-vocabulary tasks

    AJ Piergiovanni, Weicheng Kuo, and Anelia Angelova. Pre-training image-language trans- formers for open-vocabulary tasks. In T4V: Transformers for Vision Workshop, Conference on Computer Vision and Pattern Recognition, 2022

  27. [27]

    Pix2Struct: Screenshot parsing as pretraining for visual language understanding

    Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2Struct: Screenshot parsing as pretraining for visual language understanding. In ICML, 2023

  28. [28]

    Simple open-vocabulary object detection with vision transformers

    Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, et al. Simple open-vocabulary object detection with vision transformers. In ECCV, 2022

  29. [29]

    Vector-quantized image modeling with improved VQGAN

    Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved VQGAN. In ICLR, 2022

  30. [30]

    Deep visual-semantic alignments for generating image descriptions

    Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, 2015

  31. [31]

    nocaps: Novel object captioning at scale

    Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8948–8957, 2019

  32. [32]

    TextCaps: a dataset for image captioning with reading comprehension

    Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. TextCaps: a dataset for image captioning with reading comprehension. In European conference on computer vision, pages 742–758, 2020

  33. [33]

    Captioning images taken by people who are blind

    Danna Gurari, Yinan Zhao, Meng Zhang, and Nilavra Bhattacharya. Captioning images taken by people who are blind. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVII 16, pages 417–434. Springer, 2020

  34. [34]

    Screen2Words: Automatic mobile ui summarization with multimodal learning

    Bryan Wang, Gang Li, Xin Zhou, Zhourong Chen, Tovi Grossman, and Yang Li. Screen2Words: Automatic mobile ui summarization with multimodal learning. In The 34th Annual ACM Symposium on User Interface Software and Technology, UIST ’21, page 498–510, New York, NY , USA, 2021. Association for Computing Machinery

  35. [35]

    Widget Captioning: Generating natural language description for mobile user interface elements

    Yang Li, Gang Li, Luheng He, Jingjie Zheng, Hong Li, and Zhiwei Guan. Widget Captioning: Generating natural language description for mobile user interface elements. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5495–5510, Online, November 2020. Association for Computational Linguistics

  36. [36]

    Making the V in VQA matter: Elevating the role of image understanding in visual question answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 6904–6913, 2017

  37. [37]

    OK-VQA: A visual question answering benchmark requiring external knowledge

    Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. OK-VQA: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019

  38. [38]

    Tallyqa: Answering complex counting questions

    Manoj Acharya, Kushal Kafle, and Christopher Kanan. Tallyqa: Answering complex counting questions. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 8076–8084, 2019. 26

  39. [39]

    Towards VQA models that can read

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards VQA models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019

  40. [40]

    VizWiz grand challenge: Answering visual questions from blind people

    Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. VizWiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 3608–3617, 2018

  41. [41]

    Jawahar, Ernest Valveny, and Dimosthenis Karatzas

    Ali Furkan Biten, Rubèn Tito, Andrés Mafla, Lluis Gomez, Marçal Rusiñol, C.V . Jawahar, Ernest Valveny, and Dimosthenis Karatzas. Scene text visual question answering. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4290–4300, 2019

  42. [42]

    Ocr-vqa: Visual question answering by reading text in images

    Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In ICDAR, 2019

  43. [43]

    Infographicvqa

    Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawa- har. Infographicvqa. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1697–1706, 2022

  44. [44]

    Docvqa: A dataset for vqa on document images

    Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021

  45. [45]

    A diagram is worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In ECCV, 2016

  46. [46]

    ChartQA: A benchmark for question answering about charts with visual and logical reasoning

    Ahmed Masry, Do Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Findings of ACL, 2022

  47. [47]

    Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities

    Hexiang Hu, Yi Luan, Yang Chen, Urvashi Khandelwal, Mandar Joshi, Kenton Lee, Kristina Toutanova, and Ming-Wei Chang. Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities. arXiv preprint arXiv:2302.11154, 2023

  48. [48]

    Can pre-trained vision and language models answer visual information-seeking questions? arXiv preprint arXiv:2302.11713, 2023

    Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming- Wei Chang. Can pre-trained vision and language models answer visual information-seeking questions? arXiv preprint arXiv:2302.11713, 2023

  49. [49]

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun

    Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, et al. Image as a foreign language: BEiT pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022

  50. [50]

    Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. PaLM-E: An embodied ...

  51. [51]

    Revisiting modulated convolutions for visual counting and beyond

    Duy-Kien Nguyen, Vedanuj Goswami, and Xinlei Chen. Revisiting modulated convolutions for visual counting and beyond. ICLR, 2021

  52. [52]

    Winner team Mia at TextVQA challenge 2021: Vision-and-language represen- tation learning with pre-trained sequence-to-sequence model

    Yixuan Qiao, Hao Chen, Jun Wang, Yihao Chen, Xianbin Ye, Ziliang Li, Xianbiao Qi, Peng Gao, and Guotong Xie. Winner team Mia at TextVQA challenge 2021: Vision-and-language represen- tation learning with pre-trained sequence-to-sequence model. arXiv preprint arXiv:2106.15332, 2021

  53. [53]

    Manmatha

    Ali Furkan Biten, Ron Litman, Yusheng Xie, Srikar Appalaraju, and R. Manmatha. LaTr: Layout-aware transformer for scene-text VQA. In CVPR, 2022

  54. [54]

    Unifying vision, text, and layout for universal document processing

    Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, and Mohit Bansal. Unifying vision, text, and layout for universal document processing. In CVPR, 2023

  55. [55]

    MSR-VTT: A large video description dataset for bridging video and language

    Jun Xu, Tao Mei, Ting Yao, and Yong Rui. MSR-VTT: A large video description dataset for bridging video and language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016. 27

  56. [56]

    V ATEX: A large-scale, high-quality multilingual dataset for video-and-language research

    Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. V ATEX: A large-scale, high-quality multilingual dataset for video-and-language research. In ICCV, 2019

  57. [57]

    Dense- captioning events in videos

    Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense- captioning events in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017

  58. [58]

    Spoken Moments: Learning joint audio-visual representations from video descriptions

    Mathew Monfort, SouYoung Jin, Alexander Liu, David Harwath, Rogerio Feris, James Glass, and Aude Oliva. Spoken Moments: Learning joint audio-visual representations from video descriptions. In CVPR, 2021

  59. [59]

    NExT-QA: Next phase of question- answering to explaining temporal actions

    Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. NExT-QA: Next phase of question- answering to explaining temporal actions. In CVPR, 2021

  60. [60]

    Video question answering via gradually refined attention over appearance and motion

    Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answering via gradually refined attention over appearance and motion. In MM, 2017

  61. [61]

    ActivityNet-QA: A dataset for understanding complex web videos via question answering

    Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. ActivityNet-QA: A dataset for understanding complex web videos via question answering. In AAAI, 2019

  62. [62]

    End-to-end dense video captioning with parallel decoding

    Teng Wang, Ruimao Zhang, Zhichao Lu, Feng Zheng, Ran Cheng, and Ping Luo. End-to-end dense video captioning with parallel decoding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6847–6857, 2021

  63. [63]

    VindLU: A recipe for effective video-and-language pretraining

    Feng Cheng, Xizi Wang, Jie Lei, David Crandall, Mohit Bansal, and Gedas Bertasius. VindLU: A recipe for effective video-and-language pretraining. In CVPR, 2023

  64. [64]

    End-to-end generative pretraining for multimodal video captioning

    Paul Hongsuck Seo, Arsha Nagrani, Anurag Arnab, and Cordelia Schmid. End-to-end generative pretraining for multimodal video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17959–17968, 2022

  65. [65]

    A multi-world approach to question answering about real- world scenes based on uncertain input

    Mario Fritz Mateusz Malinowski. A multi-world approach to question answering about real- world scenes based on uncertain input. In NeurIPS, 2014

  66. [66]

    ImageNet: A large- scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large- scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255, 2009

  67. [67]

    Are we done with ImageNet? arXiv preprint arXiv:2006.07159, 2020

    Lucas Beyer, Olivier J Hénaff, Alexander Kolesnikov, Xiaohua Zhai, and Aäron van den Oord. Are we done with ImageNet? arXiv preprint arXiv:2006.07159, 2020

  68. [68]

    The many faces of robustness: A critical analysis of out-of-distribution generalization

    Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8349, 2021

  69. [69]

    Natural adversarial examples

    Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15262–15271, 2021

  70. [70]

    Learning robust global represen- tations by penalizing local predictive power

    Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global represen- tations by penalizing local predictive power. In Advances in Neural Information Processing Systems, pages 10506–10518, 2019

  71. [71]

    Do ImageNet classifiers generalize to ImageNet? In International Conference on Machine Learning, pages 5389–5400, 2019

    Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do ImageNet classifiers generalize to ImageNet? In International Conference on Machine Learning, pages 5389–5400, 2019

  72. [72]

    Fleet, and Geoffrey E

    Ting Chen, Saurabh Saxena, Lala Li, David J. Fleet, and Geoffrey E. Hinton. Pix2Seq: A language modeling framework for object detection. In The Tenth International Conference on Learning Representations, ICLR, 2022

  73. [73]

    Girshick

    Agrim Gupta, Piotr Dollár, and Ross B. Girshick. LVIS: A dataset for large vocabulary instance segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019 , pages 5356–5364. Computer Vision Foundation / IEEE, 2019. 28

  74. [74]

    Open-vocabulary object detection via vision and language knowledge distillation

    Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2022

  75. [75]

    Regionclip: Region-based language-image pretraining

    Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, and Jianfeng Gao. Regionclip: Region-based language-image pretraining. In CVPR, 2022

  76. [76]

    Women also snowboard: Overcoming bias in captioning models

    Lisa Anne Hendricks, Kaylee Burns, Kate Saenko, Trevor Darrell, and Anna Rohrbach. Women also snowboard: Overcoming bias in captioning models. In ECCV, 2018

  77. [77]

    Semantics derived automatically from language corpora contain human-like biases

    Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan. Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334), 2017

  78. [78]

    Men also like shopping: Reducing gender bias amplification using corpus-level constraints

    Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In EMNLP, 2017

  79. [79]

    Towards fairness in visual recognition: Effective strategies for bias mitigation

    Zeyu Wang, Klint Qinami, Ioannis Christos Karakozis, Kyle Genova, Prem Nair, Kenji Hata, and Olga Russakovsky. Towards fairness in visual recognition: Effective strategies for bias mitigation. In CVPR, 2020

  80. [80]

    Gender shades: Intersectional accuracy disparities in commercial gender classification

    Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In FAccT, 2018

Showing first 80 references.