PaLI-X: On Scaling up a Multilingual Vision and Language Model
Pith reviewed 2026-05-17 14:29 UTC · model grok-4.3
The pith
Scaling up PaLI-X sets new state-of-the-art on most vision and language benchmarks and shows emergent capabilities.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PaLI-X advances the state-of-the-art on most vision-and-language benchmarks considered (25+ of them) by scaling up the size of the components and the breadth of its training task mixture. It achieves new levels of performance on multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot learning, as well as object detection, video question answering, and video captioning. Finally, emerging capabilities are observed, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.
What carries the argument
Scaling the PaLI-X multilingual vision and language model in size and training task mixture to improve performance and reveal new abilities.
Load-bearing premise
That larger model size combined with a broader training task mixture will produce higher benchmark scores and emergent behaviors without needing task-specific fine-tuning or architectural modifications.
What would settle it
Training an even larger PaLI-X on the expanded task set but finding no improvement over prior SOTA on most benchmarks or absence of complex counting ability.
read the original abstract
We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. PaLI-X advances the state-of-the-art on most vision-and-language benchmarks considered (25+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents PaLI-X, a scaled-up multilingual vision-and-language model. The authors increase both model size and the breadth of the training task mixture, reporting new state-of-the-art results on more than 25 vision-and-language benchmarks that include image captioning, visual question answering, document understanding, few-shot in-context learning, object detection, video QA, and video captioning. They additionally document emergent capabilities such as complex counting and multilingual object detection that were not explicitly present in the training mix.
Significance. If the benchmark gains and emergent behaviors are shown to arise under evaluation protocols that are at least as stringent as those used by prior work, the paper would provide concrete evidence that joint scaling of model capacity and task diversity yields both higher performance and qualitatively new abilities in multimodal models. The breadth of tasks covered and the observation of capabilities outside the explicit training distribution are strengths that could inform the design of future generalist vision-language systems.
major comments (2)
- [Abstract] Abstract and results presentation: the headline claim of advancing the state of the art on 25+ benchmarks does not specify, for each benchmark or benchmark group, whether the reported numbers are zero-shot, few-shot, or obtained after task-specific fine-tuning. Because the training mixture already contains captioning, VQA, and document-understanding objectives, it is impossible to isolate the contribution of scale and task breadth from ordinary per-task supervised adaptation that earlier models also receive.
- [Emergent capabilities discussion] Section describing emergent capabilities: the observations of complex counting and multilingual object detection are presented as qualitative evidence of emergence, yet the manuscript provides neither quantitative baselines from smaller PaLI variants nor explicit prompting protocols. Without these controls it is difficult to confirm that the behaviors are truly emergent rather than the result of the broader task mixture or prompting choices.
minor comments (2)
- [Results tables] Table captions and benchmark lists should explicitly note the evaluation protocol (zero-shot / k-shot / fine-tuned) alongside each score so that readers can immediately compare with prior work.
- [Training recipe] The description of the training task mixture would benefit from a concise table enumerating the proportion of each task type and the total number of examples seen during pre-training.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below, indicating where we will revise the manuscript for greater clarity while maintaining the integrity of our experimental claims.
read point-by-point responses
-
Referee: [Abstract] Abstract and results presentation: the headline claim of advancing the state of the art on 25+ benchmarks does not specify, for each benchmark or benchmark group, whether the reported numbers are zero-shot, few-shot, or obtained after task-specific fine-tuning. Because the training mixture already contains captioning, VQA, and document-understanding objectives, it is impossible to isolate the contribution of scale and task breadth from ordinary per-task supervised adaptation that earlier models also receive.
Authors: We agree that the abstract would benefit from greater specificity on evaluation protocols. In the revised manuscript we will update the abstract to state that results are reported under the standard protocols used by prior work on each benchmark (zero-shot or few-shot for some tasks, task-specific fine-tuning for others), with full per-benchmark details retained in the experimental sections and tables. Regarding isolation of scale versus task adaptation, we note that the paper's central contribution is the joint effect of increased model capacity and broader task mixture; direct comparisons to prior models trained on narrower mixtures are already provided. While a perfectly controlled factorial ablation isolating every variable is indeed difficult, the observed scaling trends and outperformance relative to smaller PaLI variants support the value of the combined approach. We will add a short clarifying paragraph in the introduction. revision: partial
-
Referee: [Emergent capabilities discussion] Section describing emergent capabilities: the observations of complex counting and multilingual object detection are presented as qualitative evidence of emergence, yet the manuscript provides neither quantitative baselines from smaller PaLI variants nor explicit prompting protocols. Without these controls it is difficult to confirm that the behaviors are truly emergent rather than the result of the broader task mixture or prompting choices.
Authors: We thank the referee for this observation. To strengthen the emergence claims we will incorporate quantitative results from smaller PaLI variants on the same counting and detection prompts, along with the exact prompting templates used. These comparisons are available from our internal scaling experiments and will be added to the revised section, allowing readers to assess whether the behaviors appear only at the largest scale. revision: yes
Circularity Check
No circularity: empirical scaling results are self-contained observations
full rationale
The paper reports training a scaled multilingual vision-language model and its benchmark outcomes without any mathematical derivation chain, equations, or predictions that reduce to fitted inputs by construction. Claims rest on direct empirical results from model size increases and task mixture expansion, evaluated on external benchmarks; no self-definitional steps, self-citation load-bearing arguments, or renamed known results appear in the presented material.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PaLI-X advances the state-of-the-art on most vision-and-language benchmarks considered (25+ of them).
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 18 Pith papers
-
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
-
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.
-
RT-H: Action Hierarchies Using Language
RT-H learns robot policies by first predicting language motions as an intermediate representation and then mapping those plus the high-level task to actions, yielding more robust multi-task performance and the ability...
-
Learning Interactive Real-World Simulators
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
-
MiMIC: Mitigating Visual Modality Collapse in Universal Multimodal Retrieval While Avoiding Semantic Misalignment
MiMIC mitigates visual modality collapse and semantic misalignment in universal multimodal retrieval via fusion-in-decoder architecture and robust single-modality training.
-
OFlow: Injecting Object-Aware Temporal Flow Matching for Robust Robotic Manipulation
OFlow unifies temporal foresight and object-aware reasoning inside a shared latent space via flow matching to improve VLA robustness in robotic manipulation under distribution shifts.
-
GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data
GraspVLA shows that pretraining a grasping model on a billion synthetic action frames enables zero-shot open-vocabulary performance and sim-to-real transfer.
-
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new ro...
-
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
-
Vision-Language Foundation Models as Effective Robot Imitators
RoboFlamingo adapts open-source vision-language models for robot manipulation tasks via single-step comprehension plus an explicit policy head, outperforming prior methods on benchmarks with only light fine-tuning.
-
Aligning Large Multimodal Models with Factually Augmented RLHF
Factually Augmented RLHF aligns large multimodal models to reduce hallucinations, reaching 94% of GPT-4 on LLaVA-Bench and 60% improvement on the new MMHAL-BENCH.
-
Otter: A Multi-Modal Model with In-Context Instruction Tuning
Otter is a multi-modal model instruction-tuned on the MIMIC-IT dataset of over 3 million in-context instruction-response pairs to improve convergence and generalization on tasks with multiple images and videos.
-
GR-3 Technical Report
GR-3 is a VLA model that generalizes to novel objects, environments, and abstract instructions, outperforms the π0 baseline, and integrates with the new ByteMini bi-manual mobile robot.
-
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.
-
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
-
MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings
MMCORE transfers VLM reasoning into diffusion-based image generation and editing via aligned latent embeddings from learnable queries, outperforming baselines on text-to-image and editing tasks.
-
PaliGemma: A versatile 3B VLM for transfer
PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.
-
Improved Baselines with Visual Instruction Tuning
Simple changes to LLaVA using CLIP-ViT-L-336px, an MLP connector, and academic VQA data yield state-of-the-art results on 11 benchmarks with only 1.2M public examples and one-day training on 8 A100 GPUs.
Reference graph
Works this paper leans on
-
[1]
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[2]
Tom B. Brown, Benjamin Mann, Nick Ryder, Jared Kaplan Melanie Subbiah, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Christopher Hesse Clemens Winter, Mark Chen, Eric Sigler, Mateusz Litwin,...
work page 2020
-
[3]
GLaM: Efficient scaling of language models with mixture-of-experts
Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten P Bosma, Zongwei Zhou, Tao Wang, Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc Le, Yonghui Wu, Zhifeng Chen, and C...
work page 2022
-
[4]
Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yujing Z...
work page 2023
-
[5]
PaLI: A jointly-scaled multilingual language-image model
Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebas- tian Alexander Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karago...
work page 2023
-
[6]
Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Je- natton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin ...
work page 2023
-
[7]
Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Jason Wei, Xuezhi Wang, Hyung Won Chung, Siamak Shakeri, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Denny Zhou, Neil Houlsby, and Donald Metzler. UL2: Unifying language learning paradigms. In ICLR, 2023
work page 2023
-
[8]
Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, and Yasemin Altun. DePlot: One-shot visual language reasoning by plot-to-table translation. arXiv preprint arXiv:2212.10505, 2022
-
[9]
GIT: A generative image-to-text transformer for vision and language
Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. GIT: A generative image-to-text transformer for vision and language. TMLR, 2022
work page 2022
-
[10]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022
work page 2022
-
[11]
Language Is Not All You Need: Aligning Perception with Language Models
Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [12]
-
[13]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020
work page 1901
-
[14]
A Survey on In-context Learning
Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. A survey for in-context learning. arXiv preprint arXiv:2301.00234, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[15]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, 2022
work page 2022
-
[16]
Least-to-most prompting enables complex reasoning in large language models
Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables complex reasoning in large language models. In ICLR, 2023
work page 2023
-
[17]
Larger language models do in-context learning differently
Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, et al. Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846, 2023
-
[18]
Meta-learning via language model in-context tuning
Yanda Chen, Ruiqi Zhong, Sheng Zha, George Karypis, and He He. Meta-learning via language model in-context tuning. In ACL, 2022
work page 2022
-
[19]
Unified-IO: A unified model for vision, language, and multi-modal tasks
Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, and Aniruddha Kembhavi. Unified-IO: A unified model for vision, language, and multi-modal tasks. In ICLR, 2023
work page 2023
-
[20]
Spotlight: Mobile UI understanding using vision-language models with a focus
Gang Li and Yang Li. Spotlight: Mobile UI understanding using vision-language models with a focus. In ICLR, 2023. 25
work page 2023
-
[21]
SimVLM: Simple visual language model pretraining with weak supervision
Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. SimVLM: Simple visual language model pretraining with weak supervision. In ICLR, 2022
work page 2022
-
[22]
Conceptual Captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual Captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018
work page 2018
-
[23]
Thapliyal, Jordi Pont-Tuset, Xi Chen, and Radu Soricut
Ashish V . Thapliyal, Jordi Pont-Tuset, Xi Chen, and Radu Soricut. Crossmodal-3600: A massively multilingual multimodal evaluation dataset. In EMNLP, 2022
work page 2022
-
[24]
PreSTU: Pre-training for scene-text understanding
Jihyung Kil, Soravit Changpinyo, Xi Chen, Hexiang Hu, Sebastian Goodman, Wei-Lun Chao, and Radu Soricut. PreSTU: Pre-training for scene-text understanding. arXiv preprint arXiv:2209.05534, 2022
-
[25]
All you may need for VQA are image captions
Soravit Changpinyo, Doron Kukliansky, Idan Szpektor, Xi Chen, Nan Ding, and Radu Soricut. All you may need for VQA are image captions. In NAACL, 2022
work page 2022
-
[26]
Pre-training image-language trans- formers for open-vocabulary tasks
AJ Piergiovanni, Weicheng Kuo, and Anelia Angelova. Pre-training image-language trans- formers for open-vocabulary tasks. In T4V: Transformers for Vision Workshop, Conference on Computer Vision and Pattern Recognition, 2022
work page 2022
-
[27]
Pix2Struct: Screenshot parsing as pretraining for visual language understanding
Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2Struct: Screenshot parsing as pretraining for visual language understanding. In ICML, 2023
work page 2023
-
[28]
Simple open-vocabulary object detection with vision transformers
Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, et al. Simple open-vocabulary object detection with vision transformers. In ECCV, 2022
work page 2022
-
[29]
Vector-quantized image modeling with improved VQGAN
Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved VQGAN. In ICLR, 2022
work page 2022
-
[30]
Deep visual-semantic alignments for generating image descriptions
Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, 2015
work page 2015
-
[31]
nocaps: Novel object captioning at scale
Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8948–8957, 2019
work page 2019
-
[32]
TextCaps: a dataset for image captioning with reading comprehension
Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. TextCaps: a dataset for image captioning with reading comprehension. In European conference on computer vision, pages 742–758, 2020
work page 2020
-
[33]
Captioning images taken by people who are blind
Danna Gurari, Yinan Zhao, Meng Zhang, and Nilavra Bhattacharya. Captioning images taken by people who are blind. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVII 16, pages 417–434. Springer, 2020
work page 2020
-
[34]
Screen2Words: Automatic mobile ui summarization with multimodal learning
Bryan Wang, Gang Li, Xin Zhou, Zhourong Chen, Tovi Grossman, and Yang Li. Screen2Words: Automatic mobile ui summarization with multimodal learning. In The 34th Annual ACM Symposium on User Interface Software and Technology, UIST ’21, page 498–510, New York, NY , USA, 2021. Association for Computing Machinery
work page 2021
-
[35]
Widget Captioning: Generating natural language description for mobile user interface elements
Yang Li, Gang Li, Luheng He, Jingjie Zheng, Hong Li, and Zhiwei Guan. Widget Captioning: Generating natural language description for mobile user interface elements. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5495–5510, Online, November 2020. Association for Computational Linguistics
work page 2020
-
[36]
Making the V in VQA matter: Elevating the role of image understanding in visual question answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 6904–6913, 2017
work page 2017
-
[37]
OK-VQA: A visual question answering benchmark requiring external knowledge
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. OK-VQA: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019
work page 2019
-
[38]
Tallyqa: Answering complex counting questions
Manoj Acharya, Kushal Kafle, and Christopher Kanan. Tallyqa: Answering complex counting questions. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 8076–8084, 2019. 26
work page 2019
-
[39]
Towards VQA models that can read
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards VQA models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019
work page 2019
-
[40]
VizWiz grand challenge: Answering visual questions from blind people
Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. VizWiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 3608–3617, 2018
work page 2018
-
[41]
Jawahar, Ernest Valveny, and Dimosthenis Karatzas
Ali Furkan Biten, Rubèn Tito, Andrés Mafla, Lluis Gomez, Marçal Rusiñol, C.V . Jawahar, Ernest Valveny, and Dimosthenis Karatzas. Scene text visual question answering. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4290–4300, 2019
work page 2019
-
[42]
Ocr-vqa: Visual question answering by reading text in images
Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In ICDAR, 2019
work page 2019
-
[43]
Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawa- har. Infographicvqa. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1697–1706, 2022
work page 2022
-
[44]
Docvqa: A dataset for vqa on document images
Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021
work page 2021
-
[45]
A diagram is worth a dozen images
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In ECCV, 2016
work page 2016
-
[46]
ChartQA: A benchmark for question answering about charts with visual and logical reasoning
Ahmed Masry, Do Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Findings of ACL, 2022
work page 2022
-
[47]
Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities
Hexiang Hu, Yi Luan, Yang Chen, Urvashi Khandelwal, Mandar Joshi, Kenton Lee, Kristina Toutanova, and Ming-Wei Chang. Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities. arXiv preprint arXiv:2302.11154, 2023
-
[48]
Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming- Wei Chang. Can pre-trained vision and language models answer visual information-seeking questions? arXiv preprint arXiv:2302.11713, 2023
-
[49]
Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun
Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, et al. Image as a foreign language: BEiT pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022
-
[50]
Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. PaLM-E: An embodied ...
work page 2023
-
[51]
Revisiting modulated convolutions for visual counting and beyond
Duy-Kien Nguyen, Vedanuj Goswami, and Xinlei Chen. Revisiting modulated convolutions for visual counting and beyond. ICLR, 2021
work page 2021
-
[52]
Yixuan Qiao, Hao Chen, Jun Wang, Yihao Chen, Xianbin Ye, Ziliang Li, Xianbiao Qi, Peng Gao, and Guotong Xie. Winner team Mia at TextVQA challenge 2021: Vision-and-language represen- tation learning with pre-trained sequence-to-sequence model. arXiv preprint arXiv:2106.15332, 2021
- [53]
-
[54]
Unifying vision, text, and layout for universal document processing
Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, and Mohit Bansal. Unifying vision, text, and layout for universal document processing. In CVPR, 2023
work page 2023
-
[55]
MSR-VTT: A large video description dataset for bridging video and language
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. MSR-VTT: A large video description dataset for bridging video and language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016. 27
work page 2016
-
[56]
V ATEX: A large-scale, high-quality multilingual dataset for video-and-language research
Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. V ATEX: A large-scale, high-quality multilingual dataset for video-and-language research. In ICCV, 2019
work page 2019
-
[57]
Dense- captioning events in videos
Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense- captioning events in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017
work page 2017
-
[58]
Spoken Moments: Learning joint audio-visual representations from video descriptions
Mathew Monfort, SouYoung Jin, Alexander Liu, David Harwath, Rogerio Feris, James Glass, and Aude Oliva. Spoken Moments: Learning joint audio-visual representations from video descriptions. In CVPR, 2021
work page 2021
-
[59]
NExT-QA: Next phase of question- answering to explaining temporal actions
Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. NExT-QA: Next phase of question- answering to explaining temporal actions. In CVPR, 2021
work page 2021
-
[60]
Video question answering via gradually refined attention over appearance and motion
Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answering via gradually refined attention over appearance and motion. In MM, 2017
work page 2017
-
[61]
ActivityNet-QA: A dataset for understanding complex web videos via question answering
Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. ActivityNet-QA: A dataset for understanding complex web videos via question answering. In AAAI, 2019
work page 2019
-
[62]
End-to-end dense video captioning with parallel decoding
Teng Wang, Ruimao Zhang, Zhichao Lu, Feng Zheng, Ran Cheng, and Ping Luo. End-to-end dense video captioning with parallel decoding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6847–6857, 2021
work page 2021
-
[63]
VindLU: A recipe for effective video-and-language pretraining
Feng Cheng, Xizi Wang, Jie Lei, David Crandall, Mohit Bansal, and Gedas Bertasius. VindLU: A recipe for effective video-and-language pretraining. In CVPR, 2023
work page 2023
-
[64]
End-to-end generative pretraining for multimodal video captioning
Paul Hongsuck Seo, Arsha Nagrani, Anurag Arnab, and Cordelia Schmid. End-to-end generative pretraining for multimodal video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17959–17968, 2022
work page 2022
-
[65]
A multi-world approach to question answering about real- world scenes based on uncertain input
Mario Fritz Mateusz Malinowski. A multi-world approach to question answering about real- world scenes based on uncertain input. In NeurIPS, 2014
work page 2014
-
[66]
ImageNet: A large- scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large- scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255, 2009
work page 2009
-
[67]
Are we done with ImageNet? arXiv preprint arXiv:2006.07159, 2020
Lucas Beyer, Olivier J Hénaff, Alexander Kolesnikov, Xiaohua Zhai, and Aäron van den Oord. Are we done with ImageNet? arXiv preprint arXiv:2006.07159, 2020
-
[68]
The many faces of robustness: A critical analysis of out-of-distribution generalization
Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8349, 2021
work page 2021
-
[69]
Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15262–15271, 2021
work page 2021
-
[70]
Learning robust global represen- tations by penalizing local predictive power
Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global represen- tations by penalizing local predictive power. In Advances in Neural Information Processing Systems, pages 10506–10518, 2019
work page 2019
-
[71]
Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do ImageNet classifiers generalize to ImageNet? In International Conference on Machine Learning, pages 5389–5400, 2019
work page 2019
-
[72]
Ting Chen, Saurabh Saxena, Lala Li, David J. Fleet, and Geoffrey E. Hinton. Pix2Seq: A language modeling framework for object detection. In The Tenth International Conference on Learning Representations, ICLR, 2022
work page 2022
-
[73]
Agrim Gupta, Piotr Dollár, and Ross B. Girshick. LVIS: A dataset for large vocabulary instance segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019 , pages 5356–5364. Computer Vision Foundation / IEEE, 2019. 28
work page 2019
-
[74]
Open-vocabulary object detection via vision and language knowledge distillation
Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2022
work page 2022
-
[75]
Regionclip: Region-based language-image pretraining
Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, and Jianfeng Gao. Regionclip: Region-based language-image pretraining. In CVPR, 2022
work page 2022
-
[76]
Women also snowboard: Overcoming bias in captioning models
Lisa Anne Hendricks, Kaylee Burns, Kate Saenko, Trevor Darrell, and Anna Rohrbach. Women also snowboard: Overcoming bias in captioning models. In ECCV, 2018
work page 2018
-
[77]
Semantics derived automatically from language corpora contain human-like biases
Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan. Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334), 2017
work page 2017
-
[78]
Men also like shopping: Reducing gender bias amplification using corpus-level constraints
Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In EMNLP, 2017
work page 2017
-
[79]
Towards fairness in visual recognition: Effective strategies for bias mitigation
Zeyu Wang, Klint Qinami, Ioannis Christos Karakozis, Kyle Genova, Prem Nair, Kenji Hata, and Olga Russakovsky. Towards fairness in visual recognition: Effective strategies for bias mitigation. In CVPR, 2020
work page 2020
-
[80]
Gender shades: Intersectional accuracy disparities in commercial gender classification
Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In FAccT, 2018
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.