pith. machine review for the scientific record. sign in

arxiv: 2309.17421 · v2 · submitted 2023-09-29 · 💻 cs.CV · cs.CL

Recognition: no theorem link

The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

Authors on Pith no claims yet

Pith reviewed 2026-05-15 23:22 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords large multimodal modelsGPT-4Vvisual understandingmultimodal generalistinterleaved inputsvisual referring promptingmultimodal tasks
0
0 comments X

The pith

GPT-4V processes arbitrarily interleaved multimodal inputs to function as a multimodal generalist system

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes GPT-4V to explore how large multimodal models add visual understanding to language models for stronger generic intelligence. It presents a collection of hand-curated qualitative samples across domains and tasks to test the model's quality, genericity, supported input types, working modes, and effective prompting strategies. Observations from these samples show that the model's ability to handle mixed inputs combined with its broad capabilities positions it as a powerful generalist. The work also identifies new interaction possibilities through the model's recognition of visual markers drawn on images, and it closes with discussions of application scenarios and research directions for such systems.

Core claim

GPT-4V's unprecedented ability in processing arbitrarily interleaved multimodal inputs and the genericity of its capabilities together make it a powerful multimodal generalist system, with its understanding of visual markers on input images enabling new human-computer interaction methods such as visual referring prompting.

What carries the argument

GPT-4V(ision) as a large multimodal model extending LLMs with visual understanding, demonstrated through performance on curated qualitative samples spanning domains and tasks.

If this is right

  • New human-computer interaction methods arise from the model's ability to interpret visual markers drawn on images.
  • Emerging application scenarios open for GPT-4V-based systems in solving real-world multimodal problems.
  • Future research directions include next-generation multimodal task formulation and methods to enhance LMMs.
  • Deeper understanding of multimodal foundation models can develop from systematic explorations like this one.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar qualitative probes could be applied to compare GPT-4V against other emerging LMMs on the same sample set.
  • Practical deployment might require additional safeguards for handling interleaved inputs in sensitive domains.
  • The prompting techniques identified here could be formalized into reusable templates for broader use.

Load-bearing premise

The authors' hand-curated qualitative samples are representative enough to establish the model's genericity and quality without quantitative benchmarks or controlled comparisons.

What would settle it

Quantitative evaluation on standardized multimodal benchmarks where GPT-4V shows no advantage over specialized single-modality models or random baselines would challenge the generalist claim.

read the original abstract

Large multimodal models (LMMs) extend large language models (LLMs) with multi-sensory skills, such as visual understanding, to achieve stronger generic intelligence. In this paper, we analyze the latest model, GPT-4V(ision), to deepen the understanding of LMMs. The analysis focuses on the intriguing tasks that GPT-4V can perform, containing test samples to probe the quality and genericity of GPT-4V's capabilities, its supported inputs and working modes, and the effective ways to prompt the model. In our approach to exploring GPT-4V, we curate and organize a collection of carefully designed qualitative samples spanning a variety of domains and tasks. Observations from these samples demonstrate that GPT-4V's unprecedented ability in processing arbitrarily interleaved multimodal inputs and the genericity of its capabilities together make GPT-4V a powerful multimodal generalist system. Furthermore, GPT-4V's unique capability of understanding visual markers drawn on input images can give rise to new human-computer interaction methods such as visual referring prompting. We conclude the report with in-depth discussions on the emerging application scenarios and the future research directions for GPT-4V-based systems. We hope that this preliminary exploration will inspire future research on the next-generation multimodal task formulation, new ways to exploit and enhance LMMs to solve real-world problems, and gaining better understanding of multimodal foundation models. Finally, we acknowledge that the model under our study is solely the product of OpenAI's innovative work, and they should be fully credited for its development. Please see the GPT-4V contributions paper for the authorship and credit attribution: https://cdn.openai.com/contributions/gpt-4v.pdf

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents a qualitative exploration of GPT-4V(ision) through a curated collection of test samples spanning domains and tasks. It examines the model's ability to process arbitrarily interleaved multimodal inputs, its genericity across capabilities, effective prompting strategies including visual referring, and emerging applications, concluding that these observations establish GPT-4V as a powerful multimodal generalist system.

Significance. If the observations hold under more rigorous scrutiny, the work provides an early catalog of GPT-4V behaviors that can guide prompt engineering and interaction design for LMMs. Its discussion of visual markers as a new HCI primitive and the call for future multimodal task formulations are potentially useful for the community, though the absence of quantitative benchmarks limits its role as a definitive benchmark study.

major comments (2)
  1. [Abstract] Abstract and the central claim paragraph: the statement that curated samples 'demonstrate' GPT-4V's 'unprecedented ability in processing arbitrarily interleaved multimodal inputs and the genericity of its capabilities' rests entirely on hand-selected qualitative examples without quantitative metrics, error bars, baseline comparisons to prior LMMs, or statistical sampling of task distributions, leaving the genericity conclusion vulnerable to selection bias.
  2. [Observations / test samples section] Section describing the test samples and observations: no systematic failure-case analysis or controlled ablation of input interleaving is reported, so the claim of 'arbitrarily interleaved' processing cannot be distinguished from success on the chosen illustrative cases.
minor comments (2)
  1. [Approach / sample curation] The paper should explicitly state the total number of samples, selection criteria, and any post-hoc filtering applied to avoid the appearance of cherry-picking.
  2. [Figures and examples] Figure captions and example presentations would benefit from clearer indication of which visual markers were added by the authors versus native model output.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the paper is a qualitative exploration and will revise the abstract, claims, and add a limitations section to use more cautious language and acknowledge the lack of quantitative evaluation.

read point-by-point responses
  1. Referee: [Abstract] Abstract and the central claim paragraph: the statement that curated samples 'demonstrate' GPT-4V's 'unprecedented ability in processing arbitrarily interleaved multimodal inputs and the genericity of its capabilities' rests entirely on hand-selected qualitative examples without quantitative metrics, error bars, baseline comparisons to prior LMMs, or statistical sampling of task distributions, leaving the genericity conclusion vulnerable to selection bias.

    Authors: We agree that the current wording overstates the conclusions given the qualitative nature of the work. We will revise the abstract and central claim paragraph to replace 'demonstrate' with 'illustrate through curated examples' and 'suggest', explicitly note the exploratory scope, and add a dedicated limitations section discussing selection bias, absence of quantitative metrics, error bars, baselines, and statistical sampling. revision: yes

  2. Referee: [Observations / test samples section] Section describing the test samples and observations: no systematic failure-case analysis or controlled ablation of input interleaving is reported, so the claim of 'arbitrarily interleaved' processing cannot be distinguished from success on the chosen illustrative cases.

    Authors: We acknowledge the absence of systematic failure analysis or controlled ablations. We will revise the observations section to include additional discussion of observed limitations and challenging interleaving cases from our explorations, and rephrase the 'arbitrarily interleaved' claim to indicate successful handling in the presented samples rather than exhaustive or controlled validation. revision: partial

Circularity Check

0 steps flagged

No circularity: purely observational qualitative analysis

full rationale

The paper contains no equations, fitted parameters, derivations, or self-referential logic. Its central claim rests on hand-curated qualitative samples whose representativeness is a methodological choice, not a reduction to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear. The analysis is self-contained as descriptive exploration without any mathematical or definitional circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper contains no formal derivations, fitted parameters, or postulated entities; it rests entirely on empirical observation of a black-box model.

pith-pipeline@v0.9.0 · 5632 in / 1102 out tokens · 45057 ms · 2026-05-15T23:22:38.619369+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

    cs.CL 2023-11 unverdicted novelty 8.0

    MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.

  2. AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation

    cs.CV 2026-04 unverdicted novelty 7.0

    AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.

  3. GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows

    cs.CL 2026-04 conditional novelty 7.0

    GTA-2 benchmark shows frontier models achieve below 50% on atomic tool tasks and only 14.39% success on realistic long-horizon workflows, with execution harnesses like Manus providing substantial gains.

  4. VoxelCodeBench: Benchmarking 3D World Modeling Through Code Generation

    cs.LG 2026-04 unverdicted novelty 7.0

    VoxelCodeBench shows that leading code models produce executable 3D manipulation code more readily than spatially correct outputs, especially on geometric construction and multi-object composition tasks.

  5. DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    cs.CV 2025-05 unverdicted novelty 7.0

    DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.

  6. The Prompt Report: A Systematic Survey of Prompt Engineering Techniques

    cs.CL 2024-06 accept novelty 7.0

    This systematic survey organizes prompt engineering into a taxonomy of 58 LLM techniques and 40 others, supplies a shared vocabulary, and offers guidelines for state-of-the-art models.

  7. Raven: Rethinking Automated Assessment for Scratch Programs via Video-Grounded Evaluation

    cs.SE 2026-04 unverdicted novelty 6.0

    Raven automates Scratch program assessment by having instructors specify task-level video generation rules and using LLMs to analyze resulting videos for behavioral compliance, outperforming prior tools on real studen...

  8. CLASP: Closed-loop Asynchronous Spatial Perception for Open-vocabulary Desktop Object Grasping

    cs.RO 2026-04 unverdicted novelty 6.0

    CLASP achieves 87% success in open-vocabulary desktop grasping via dual-pathway perception, asynchronous closed-loop evaluation, and automated multimodal data synthesis.

  9. RASR: Retrieval-Augmented Semantic Reasoning for Fake News Video Detection

    cs.CV 2026-04 unverdicted novelty 6.0

    RASR retrieves cross-instance semantic evidence and uses domain priors to drive multimodal LLM reasoning for improved fake news video detection on FakeSV and FakeTT datasets.

  10. TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation

    cs.CV 2026-03 conditional novelty 6.0

    TagaVLM embeds topological structures into VLMs via residual attention and interleaved prompts, achieving 51.09% success rate on R2R unseen environments and outperforming prior large-model methods.

  11. TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

    cs.RO 2024-12 conditional novelty 6.0

    Visual trace prompting improves spatial-temporal awareness in VLA models, delivering 10% gains on SimplerEnv and 3.5x on real-robot tasks.

  12. BLINK: Multimodal Large Language Models Can See but Not Perceive

    cs.CV 2024-04 accept novelty 6.0

    BLINK benchmark shows multimodal LLMs reach only 45-51 percent accuracy on core visual perception tasks where humans achieve 95 percent, indicating these abilities have not emerged.

  13. MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

    cs.CV 2024-03 unverdicted novelty 6.0

    MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.

  14. Dual-Anchoring: Addressing State Drift in Vision-Language Navigation

    cs.CV 2026-04 unverdicted novelty 5.0

    Dual-Anchoring adds explicit progress tokens and retrospective landmark verification to VLN agents, cutting state drift and lifting success rate 15.2% overall with 24.7% gains on long trajectories.

  15. Think before Go: Hierarchical Reasoning for Image-goal Navigation

    cs.RO 2026-04 unverdicted novelty 5.0

    HRNav decomposes image-goal navigation into VLM-based short-horizon planning and RL-based execution with a wandering suppression penalty to improve performance in complex unseen settings.

  16. Well Begun is Half Done: Training-Free and Model-Agnostic Semantically Guaranteed User Representation Initialization for Multimodal Recommendation

    cs.IR 2026-04 unverdicted novelty 5.0

    SG-URInit builds semantically enriched initial user representations for multimodal recommenders by fusing local item modality features with global cluster semantics, closing the gap with item representations without e...

  17. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    cs.CV 2023-12 unverdicted novelty 5.0

    InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.

  18. UnAC: Adaptive Visual Prompting with Abstraction and Stepwise Checking for Complex Multimodal Reasoning

    cs.CV 2026-05 unverdicted novelty 4.0

    UnAC improves LMM performance on visual reasoning benchmarks by combining adaptive visual prompting, image abstraction, and gradual self-checking.

  19. Improved Baselines with Visual Instruction Tuning

    cs.CV 2023-10 conditional novelty 4.0

    Simple changes to LLaVA using CLIP-ViT-L-336px, an MLP connector, and academic VQA data yield state-of-the-art results on 11 benchmarks with only 1.2M public examples and one-day training on 8 A100 GPUs.

  20. A Survey on Multimodal Large Language Models

    cs.CV 2023-06 accept novelty 3.0

    This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

Reference graph

Works this paper leans on

160 extracted references · 160 canonical work pages · cited by 20 Pith papers · 30 internal anchors

  1. [1]

    https://openai.com/blog/ chatgpt-can-now-see-hear-and-speak , 2023

    Chatgpt can now see, hear, and speak. https://openai.com/blog/ chatgpt-can-now-see-hear-and-speak , 2023

  2. [2]

    https://github.com/deep-floyd/IF, 2023

    Deepfloyd if. https://github.com/deep-floyd/IF, 2023

  3. [3]

    https://github.com/microsoft/guidance/, 2023

    Guidance. https://github.com/microsoft/guidance/, 2023

  4. [4]

    https://www.midjourney.com/, 2023

    Midjourney. https://www.midjourney.com/, 2023

  5. [5]

    Building rome in a day

    Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Simon, Brian Curless, Steven M Seitz, and Richard Szeliski. Building rome in a day. Communications of the ACM, 54(10):105–112, 2011

  6. [6]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022

  7. [7]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022

  8. [8]

    Fusion of detected objects in text for visual question answering

    Chris Alberti, Jeffrey Ling, Michael Collins, and David Reitter. Fusion of detected objects in text for visual question answering. In EMNLP, 2019

  9. [9]

    Bottom-up and top-down attention for image captioning and visual question answering

    Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, 2018

  10. [10]

    State of gpt

    Karpathy Andrej. State of gpt. https://karpathy.ai/stateofgpt.pdf, 2023

  11. [11]

    PaLM 2 Technical Report

    Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023

  12. [12]

    Lawrence Zitnick, and Devi Parikh

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual Question Answering. In ICCV, 2015

  13. [13]

    Openflamingo, March 2023

    Anas Awadalla, Irena Gao, Joshua Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo, March 2023

  14. [14]

    Are elephants big- ger than butterflies? reasoning about sizes of objects

    Hessam Bagherinezhad, Hannaneh Hajishirzi, Yejin Choi, and Ali Farhadi. Are elephants big- ger than butterflies? reasoning about sizes of objects. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016

  15. [15]

    Learning to exploit temporal structure for biomedical vision-language processing

    Shruthi Bannur, Stephanie Hyland, Qianchu Liu, Fernando Perez-Garcia, Maximilian Ilse, Daniel C Castro, Benedikt Boecking, Harshita Sharma, Kenza Bouzid, Anja Thieme, et al. Learning to exploit temporal structure for biomedical vision-language processing. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15016–1...

  16. [16]

    Measuring abstract reasoning in neural networks

    David Barrett, Felix Hill, Adam Santoro, Ari Morcos, and Timothy Lillicrap. Measuring abstract reasoning in neural networks. In International conference on machine learning, pages 511–520. PMLR, 2018

  17. [17]

    Scene text visual question answering

    Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusinol, Ernest Valveny, CV Jawahar, and Dimosthenis Karatzas. Scene text visual question answering. In ICCV, 2019

  18. [18]

    Training diffusion models with reinforcement learning, 2023

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning, 2023. 157

  19. [19]

    Improving language models by retrieving from trillions of tokens

    Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR, 2022

  20. [20]

    Food-101–mining discriminative components with random forests

    Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InComputer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pages 446–461. Springer, 2014

  21. [21]

    Measuring emotional intelligence with the mayer-salovery- caruso emotional intelligence test (msceit)

    Marc A Brackett and Peter Salovey. Measuring emotional intelligence with the mayer-salovery- caruso emotional intelligence test (msceit). Psicothema, 18:34–41, 2006

  22. [22]

    Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023

  23. [23]

    Language models are few-shot learners

    Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In NeurIPS, 2020

  24. [24]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023

  25. [25]

    Pix2seq: A language modeling framework for object detection

    Ting Chen, Saurabh Saxena, Lala Li, David J Fleet, and Geoffrey Hinton. Pix2seq: A language modeling framework for object detection. In ICLR, 2022

  26. [26]

    A unified sequence interface for vision tasks

    Ting Chen, Saurabh Saxena, Lala Li, Tsung-Yi Lin, David J Fleet, and Geoffrey E Hinton. A unified sequence interface for vision tasks. Advances in Neural Information Processing Systems, 35:31333–31346, 2022

  27. [27]

    Microsoft COCO Captions: Data Collection and Evaluation Server

    Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server.arXiv preprint arXiv:1504.00325, 2015

  28. [28]

    Uniter: Learning universal image-text representations

    Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Learning universal image-text representations. In ECCV, 2020

  29. [29]

    Unifying vision-and-language tasks via text generation

    Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. Unifying vision-and-language tasks via text generation. In ICML, 2021

  30. [30]

    PaLM: Scaling Language Modeling with Pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

  31. [31]

    Referring as a collaborative process

    Herbert H Clark and Deanna Wilkes-Gibbs. Referring as a collaborative process. Cognition, 22(1):1–39, 1986

  32. [32]

    The cityscapes dataset for semantic urban scene understanding

    Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016

  33. [33]

    Visual perception

    Tom Cornsweet. Visual perception. Academic press, 2012

  34. [34]

    Why can gpt learn in-context? language models secretly perform gradient descent as meta optimizers

    Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Zhifang Sui, and Furu Wei. Why can gpt learn in-context? language models secretly perform gradient descent as meta optimizers. arXiv preprint arXiv:2212.10559, 2022

  35. [35]

    InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023. 158

  36. [36]

    Visual dialog

    Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. Visual dialog. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 326–335, 2017

  37. [37]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009

  38. [38]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019

  39. [39]

    A Survey on In-context Learning

    Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. A survey for in-context learning. arXiv preprint arXiv:2301.00234, 2022

  40. [40]

    Coarse-to-fine vision-language pre-training with fusion in the backbone

    Zi-Yi Dou, Aishwarya Kamath, Zhe Gan, Pengchuan Zhang, Jianfeng Wang, Linjie Li, Zicheng Liu, Ce Liu, Yann LeCun, Nanyun Peng, et al. Coarse-to-fine vision-language pre-training with fusion in the backbone. In Advances in Neural Information Processing Systems

  41. [41]

    An empirical study of training end- to-end vision-and-language transformers

    Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, et al. An empirical study of training end- to-end vision-and-language transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18166–18176, 2022

  42. [42]

    Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodied ...

  43. [43]

    A modified procedure for naming 332 pictures and collecting norms: Using tangram pictures in psycholinguistic studies

    Alicia Fasquel, Angèle Brunellière, and Dominique Knutsen. A modified procedure for naming 332 pictures and collecting norms: Using tangram pictures in psycholinguistic studies. Behavior Research Methods, pages 1–23, 2022

  44. [44]

    Act the part: Learning interaction strategies for articulated object part discovery

    Samir Yitzhak Gadre, Kiana Ehsani, and Shuran Song. Act the part: Learning interaction strategies for articulated object part discovery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15752–15761, 2021

  45. [45]

    Large-scale adversarial training for vision-and-language representation learning

    Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, and Jingjing Liu. Large-scale adversarial training for vision-and-language representation learning. In NeurIPS, 2020

  46. [46]

    Vision- language pre-training: Basics, recent advances, and future trends

    Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, Jianfeng Gao, et al. Vision- language pre-training: Basics, recent advances, and future trends. Foundations and Trends® in Computer Graphics and Vision, 14(3–4):163–352, 2022

  47. [47]

    Pal: Program-aided language models

    Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. In International Conference on Machine Learning, pages 10764–10799. PMLR, 2023

  48. [48]

    Multimodal-gpt: A vision and language model for dialogue with humans, 2023

    Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans, 2023

  49. [49]

    Ms-celeb-1m: A dataset and benchmark for large-scale face recognition

    Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, pages 87–102. Springer, 2016

  50. [50]

    Retrieval augmented language model pre-training

    Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR, 2020

  51. [51]

    Mask r-cnn

    Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. 159

  52. [52]

    The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning

    Jack *Hessel, Jena D *Hwang, Jae Sung Park, Rowan Zellers, Chandra Bhagavatula, Anna Rohrbach, Kate Saenko, and Yejin Choi. The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning. In ECCV, 2022

  53. [53]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

  54. [54]

    Promptcap: Prompt-guided task-aware image captioning

    Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A Smith, and Jiebo Luo. Promptcap: Prompt-guided task-aware image captioning. In Proceedings of International Conference on Computer Vision (ICCV), 2023

  55. [55]

    Language Is Not All You Need: Aligning Perception with Language Models

    Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023

  56. [56]

    Language models as zero-shot planners: Extracting actionable knowledge for embodied agents

    Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International Conference on Machine Learning, pages 9118–9147. PMLR, 2022

  57. [57]

    Pixel-bert: Aligning image pixels with text by deep multi-modal transformers

    Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849, 2020

  58. [58]

    Why is there so much more research on vision than on any other sensory modality? Frontiers in psychology, 10:2246, 2019

    Fabian Hutmacher. Why is there so much more research on vision than on any other sensory modality? Frontiers in psychology, 10:2246, 2019

  59. [59]

    Abstract visual reasoning with tangram shapes

    Anya Ji, Noriyuki Kojima, Noah Rush, Alane Suhr, Wai Keen V ong, Robert Hawkins, and Yoav Artzi. Abstract visual reasoning with tangram shapes. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 582–601, 2022

  60. [60]

    Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

    Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

  61. [61]

    Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

    Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017

  62. [62]

    Densecap: Fully convolutional localization networks for dense captioning

    Justin Johnson, Andrej Karpathy, and Li Fei-Fei. Densecap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4565–4574, 2016

  63. [63]

    Language models can solve computer tasks

    Geunwoo Kim, Pierre Baldi, and Stephen McAleer. Language models can solve computer tasks. arXiv preprint arXiv:2303.17491, 2023

  64. [64]

    Vilt: Vision-and-language transformer without convolution or region supervision

    Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-and-language transformer without convolution or region supervision. In ICML, 2021

  65. [65]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of International Conference on Computer Vision (ICCV), 2023

  66. [66]

    Large language models are zero-shot reasoners

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022

  67. [67]

    Visual genome: Connecting language and vision using crowdsourced dense image annotations

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 2017. 160

  68. [68]

    Retrieval- augmented generation for knowledge-intensive nlp tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Na- man Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval- augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020

  69. [69]

    arXiv preprint arXiv:2309.10020 (2023)

    Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, and Jianfeng Gao. Multimodal foundation models: From specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020, 2023

  70. [70]

    Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training

    Gen Li, Nan Duan, Yuejian Fang, Ming Gong, Daxin Jiang, and Ming Zhou. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In AAAI, 2020

  71. [71]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023

  72. [72]

    Align before fuse: Vision and language representation learning with momen- tum distillation

    Junnan Li, Ramprasaath R Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, and Steven Hoi. Align before fuse: Vision and language representation learning with momen- tum distillation. In NeurIPS, 2021

  73. [73]

    VisualBERT: A Simple and Performant Baseline for Vision and Language

    Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019

  74. [74]

    Oscar: Object-semantics aligned pre-training for vision-language tasks

    Xiujun Li, Xi Yin, Chunyuan Li, Xiaowei Hu, Pengchuan Zhang, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV, 2020

  75. [75]

    Taskmatrix

    Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu, Yan Xia, Yu Liu, Yang Ou, Shuai Lu, Lei Ji, Shaoguang Mao, et al. Taskmatrix. ai: Completing tasks by connecting foundation models with millions of apis. arXiv preprint arXiv:2303.16434, 2023

  76. [76]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014

  77. [77]

    Visually grounded reasoning across languages and cultures

    Fangyu Liu, Emanuele Bugliarello, Edoardo Maria Ponti, Siva Reddy, Nigel Collier, and Desmond Elliott. Visually grounded reasoning across languages and cultures. In Proceed- ings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10467–10485, Online and Punta Cana, Dominican Republic, November 2021. Association for Computati...

  78. [78]

    Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

    Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023

  79. [79]

    Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023

  80. [80]

    Deep learning face attributes in the wild

    Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV) , December 2015

Showing first 80 references.