arxiv: 2309.17421 · v2 · submitted 2023-09-29 · 💻 cs.CV · cs.CL

Recognition: no theorem link

The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

Zhengyuan Yang , Linjie Li , Kevin Lin , Jianfeng Wang , Chung-Ching Lin , Zicheng Liu , Lijuan Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 23:22 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords large multimodal modelsGPT-4Vvisual understandingmultimodal generalistinterleaved inputsvisual referring promptingmultimodal tasks

0 comments

The pith

GPT-4V processes arbitrarily interleaved multimodal inputs to function as a multimodal generalist system

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes GPT-4V to explore how large multimodal models add visual understanding to language models for stronger generic intelligence. It presents a collection of hand-curated qualitative samples across domains and tasks to test the model's quality, genericity, supported input types, working modes, and effective prompting strategies. Observations from these samples show that the model's ability to handle mixed inputs combined with its broad capabilities positions it as a powerful generalist. The work also identifies new interaction possibilities through the model's recognition of visual markers drawn on images, and it closes with discussions of application scenarios and research directions for such systems.

Core claim

GPT-4V's unprecedented ability in processing arbitrarily interleaved multimodal inputs and the genericity of its capabilities together make it a powerful multimodal generalist system, with its understanding of visual markers on input images enabling new human-computer interaction methods such as visual referring prompting.

What carries the argument

GPT-4V(ision) as a large multimodal model extending LLMs with visual understanding, demonstrated through performance on curated qualitative samples spanning domains and tasks.

If this is right

New human-computer interaction methods arise from the model's ability to interpret visual markers drawn on images.
Emerging application scenarios open for GPT-4V-based systems in solving real-world multimodal problems.
Future research directions include next-generation multimodal task formulation and methods to enhance LMMs.
Deeper understanding of multimodal foundation models can develop from systematic explorations like this one.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar qualitative probes could be applied to compare GPT-4V against other emerging LMMs on the same sample set.
Practical deployment might require additional safeguards for handling interleaved inputs in sensitive domains.
The prompting techniques identified here could be formalized into reusable templates for broader use.

Load-bearing premise

The authors' hand-curated qualitative samples are representative enough to establish the model's genericity and quality without quantitative benchmarks or controlled comparisons.

What would settle it

Quantitative evaluation on standardized multimodal benchmarks where GPT-4V shows no advantage over specialized single-modality models or random baselines would challenge the generalist claim.

read the original abstract

Large multimodal models (LMMs) extend large language models (LLMs) with multi-sensory skills, such as visual understanding, to achieve stronger generic intelligence. In this paper, we analyze the latest model, GPT-4V(ision), to deepen the understanding of LMMs. The analysis focuses on the intriguing tasks that GPT-4V can perform, containing test samples to probe the quality and genericity of GPT-4V's capabilities, its supported inputs and working modes, and the effective ways to prompt the model. In our approach to exploring GPT-4V, we curate and organize a collection of carefully designed qualitative samples spanning a variety of domains and tasks. Observations from these samples demonstrate that GPT-4V's unprecedented ability in processing arbitrarily interleaved multimodal inputs and the genericity of its capabilities together make GPT-4V a powerful multimodal generalist system. Furthermore, GPT-4V's unique capability of understanding visual markers drawn on input images can give rise to new human-computer interaction methods such as visual referring prompting. We conclude the report with in-depth discussions on the emerging application scenarios and the future research directions for GPT-4V-based systems. We hope that this preliminary exploration will inspire future research on the next-generation multimodal task formulation, new ways to exploit and enhance LMMs to solve real-world problems, and gaining better understanding of multimodal foundation models. Finally, we acknowledge that the model under our study is solely the product of OpenAI's innovative work, and they should be fully credited for its development. Please see the GPT-4V contributions paper for the authorship and credit attribution: https://cdn.openai.com/contributions/gpt-4v.pdf

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Early qualitative examples of GPT-4V illustrate interleaved inputs and visual referring prompting, but the genericity claim rests on curated samples without controls or metrics.

read the letter

The paper's core contribution is a set of organized qualitative probes into GPT-4V right after launch. It walks through tasks that mix text and images, shows the model handling arbitrary interleaving, and introduces visual referring prompting—drawing markers on images to direct attention. That last piece is a practical observation worth noting for interface design. The authors also map out supported input modes and sketch some application scenarios in a direct way. These sections give readers concrete starting points for their own experiments. The examples are well-chosen and cover a range of domains, which makes the report easy to follow. The limitation is straightforward: everything is hand-curated and observational. There are no quantitative success rates, no baseline comparisons, no sampling of task distributions, and no blinded checks. The central statement that the model is a powerful multimodal generalist therefore stays at the level of demonstration rather than measurement. Selection effects are hard to rule out. This is useful reading for people who want quick ideas on prompting strategies or early application directions for LMMs. It is less useful for anyone needing reproducible methods or statistical evidence. The work is clear about its exploratory nature and does not overclaim technical novelty beyond the prompting observation. I would send it to peer review as a timely report. Reviewers will probably push for more systematic evaluation, but the paper supplies enough structure and examples to make that feedback productive.

Referee Report

2 major / 2 minor

Summary. The paper presents a qualitative exploration of GPT-4V(ision) through a curated collection of test samples spanning domains and tasks. It examines the model's ability to process arbitrarily interleaved multimodal inputs, its genericity across capabilities, effective prompting strategies including visual referring, and emerging applications, concluding that these observations establish GPT-4V as a powerful multimodal generalist system.

Significance. If the observations hold under more rigorous scrutiny, the work provides an early catalog of GPT-4V behaviors that can guide prompt engineering and interaction design for LMMs. Its discussion of visual markers as a new HCI primitive and the call for future multimodal task formulations are potentially useful for the community, though the absence of quantitative benchmarks limits its role as a definitive benchmark study.

major comments (2)

[Abstract] Abstract and the central claim paragraph: the statement that curated samples 'demonstrate' GPT-4V's 'unprecedented ability in processing arbitrarily interleaved multimodal inputs and the genericity of its capabilities' rests entirely on hand-selected qualitative examples without quantitative metrics, error bars, baseline comparisons to prior LMMs, or statistical sampling of task distributions, leaving the genericity conclusion vulnerable to selection bias.
[Observations / test samples section] Section describing the test samples and observations: no systematic failure-case analysis or controlled ablation of input interleaving is reported, so the claim of 'arbitrarily interleaved' processing cannot be distinguished from success on the chosen illustrative cases.

minor comments (2)

[Approach / sample curation] The paper should explicitly state the total number of samples, selection criteria, and any post-hoc filtering applied to avoid the appearance of cherry-picking.
[Figures and examples] Figure captions and example presentations would benefit from clearer indication of which visual markers were added by the authors versus native model output.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the paper is a qualitative exploration and will revise the abstract, claims, and add a limitations section to use more cautious language and acknowledge the lack of quantitative evaluation.

read point-by-point responses

Referee: [Abstract] Abstract and the central claim paragraph: the statement that curated samples 'demonstrate' GPT-4V's 'unprecedented ability in processing arbitrarily interleaved multimodal inputs and the genericity of its capabilities' rests entirely on hand-selected qualitative examples without quantitative metrics, error bars, baseline comparisons to prior LMMs, or statistical sampling of task distributions, leaving the genericity conclusion vulnerable to selection bias.

Authors: We agree that the current wording overstates the conclusions given the qualitative nature of the work. We will revise the abstract and central claim paragraph to replace 'demonstrate' with 'illustrate through curated examples' and 'suggest', explicitly note the exploratory scope, and add a dedicated limitations section discussing selection bias, absence of quantitative metrics, error bars, baselines, and statistical sampling. revision: yes
Referee: [Observations / test samples section] Section describing the test samples and observations: no systematic failure-case analysis or controlled ablation of input interleaving is reported, so the claim of 'arbitrarily interleaved' processing cannot be distinguished from success on the chosen illustrative cases.

Authors: We acknowledge the absence of systematic failure analysis or controlled ablations. We will revise the observations section to include additional discussion of observed limitations and challenging interleaving cases from our explorations, and rephrase the 'arbitrarily interleaved' claim to indicate successful handling in the presented samples rather than exhaustive or controlled validation. revision: partial

Circularity Check

0 steps flagged

No circularity: purely observational qualitative analysis

full rationale

The paper contains no equations, fitted parameters, derivations, or self-referential logic. Its central claim rests on hand-curated qualitative samples whose representativeness is a methodological choice, not a reduction to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear. The analysis is self-contained as descriptive exploration without any mathematical or definitional circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper contains no formal derivations, fitted parameters, or postulated entities; it rests entirely on empirical observation of a black-box model.

pith-pipeline@v0.9.0 · 5632 in / 1102 out tokens · 45057 ms · 2026-05-15T23:22:38.619369+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
cs.CL 2023-11 unverdicted novelty 8.0

MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation
cs.CV 2026-04 unverdicted novelty 7.0

AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.
GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows
cs.CL 2026-04 conditional novelty 7.0

GTA-2 benchmark shows frontier models achieve below 50% on atomic tool tasks and only 14.39% success on realistic long-horizon workflows, with execution harnesses like Manus providing substantial gains.
VoxelCodeBench: Benchmarking 3D World Modeling Through Code Generation
cs.LG 2026-04 unverdicted novelty 7.0

VoxelCodeBench shows that leading code models produce executable 3D manipulation code more readily than spatially correct outputs, especially on geometric construction and multi-object composition tasks.
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
cs.CV 2025-05 unverdicted novelty 7.0

DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.
The Prompt Report: A Systematic Survey of Prompt Engineering Techniques
cs.CL 2024-06 accept novelty 7.0

This systematic survey organizes prompt engineering into a taxonomy of 58 LLM techniques and 40 others, supplies a shared vocabulary, and offers guidelines for state-of-the-art models.
Raven: Rethinking Automated Assessment for Scratch Programs via Video-Grounded Evaluation
cs.SE 2026-04 unverdicted novelty 6.0

Raven automates Scratch program assessment by having instructors specify task-level video generation rules and using LLMs to analyze resulting videos for behavioral compliance, outperforming prior tools on real studen...
CLASP: Closed-loop Asynchronous Spatial Perception for Open-vocabulary Desktop Object Grasping
cs.RO 2026-04 unverdicted novelty 6.0

CLASP achieves 87% success in open-vocabulary desktop grasping via dual-pathway perception, asynchronous closed-loop evaluation, and automated multimodal data synthesis.
RASR: Retrieval-Augmented Semantic Reasoning for Fake News Video Detection
cs.CV 2026-04 unverdicted novelty 6.0

RASR retrieves cross-instance semantic evidence and uses domain priors to drive multimodal LLM reasoning for improved fake news video detection on FakeSV and FakeTT datasets.
TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation
cs.CV 2026-03 conditional novelty 6.0

TagaVLM embeds topological structures into VLMs via residual attention and interleaved prompts, achieving 51.09% success rate on R2R unseen environments and outperforming prior large-model methods.
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies
cs.RO 2024-12 conditional novelty 6.0

Visual trace prompting improves spatial-temporal awareness in VLA models, delivering 10% gains on SimplerEnv and 3.5x on real-robot tasks.
BLINK: Multimodal Large Language Models Can See but Not Perceive
cs.CV 2024-04 accept novelty 6.0

BLINK benchmark shows multimodal LLMs reach only 45-51 percent accuracy on core visual perception tasks where humans achieve 95 percent, indicating these abilities have not emerged.
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
cs.CV 2024-03 unverdicted novelty 6.0

MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
Dual-Anchoring: Addressing State Drift in Vision-Language Navigation
cs.CV 2026-04 unverdicted novelty 5.0

Dual-Anchoring adds explicit progress tokens and retrospective landmark verification to VLN agents, cutting state drift and lifting success rate 15.2% overall with 24.7% gains on long trajectories.
Think before Go: Hierarchical Reasoning for Image-goal Navigation
cs.RO 2026-04 unverdicted novelty 5.0

HRNav decomposes image-goal navigation into VLM-based short-horizon planning and RL-based execution with a wandering suppression penalty to improve performance in complex unseen settings.
Well Begun is Half Done: Training-Free and Model-Agnostic Semantically Guaranteed User Representation Initialization for Multimodal Recommendation
cs.IR 2026-04 unverdicted novelty 5.0

SG-URInit builds semantically enriched initial user representations for multimodal recommenders by fusing local item modality features with global cluster semantics, closing the gap with item representations without e...
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
cs.CV 2023-12 unverdicted novelty 5.0

InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
UnAC: Adaptive Visual Prompting with Abstraction and Stepwise Checking for Complex Multimodal Reasoning
cs.CV 2026-05 unverdicted novelty 4.0

UnAC improves LMM performance on visual reasoning benchmarks by combining adaptive visual prompting, image abstraction, and gradual self-checking.
Improved Baselines with Visual Instruction Tuning
cs.CV 2023-10 conditional novelty 4.0

Simple changes to LLaVA using CLIP-ViT-L-336px, an MLP connector, and academic VQA data yield state-of-the-art results on 11 benchmarks with only 1.2M public examples and one-day training on 8 A100 GPUs.
A Survey on Multimodal Large Language Models
cs.CV 2023-06 accept novelty 3.0

This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

Reference graph

Works this paper leans on

160 extracted references · 160 canonical work pages · cited by 20 Pith papers · 30 internal anchors

[1]

https://openai.com/blog/ chatgpt-can-now-see-hear-and-speak , 2023

Chatgpt can now see, hear, and speak. https://openai.com/blog/ chatgpt-can-now-see-hear-and-speak , 2023

work page 2023
[2]

https://github.com/deep-floyd/IF, 2023

Deepfloyd if. https://github.com/deep-floyd/IF, 2023

work page 2023
[3]

https://github.com/microsoft/guidance/, 2023

Guidance. https://github.com/microsoft/guidance/, 2023

work page 2023
[4]

https://www.midjourney.com/, 2023

Midjourney. https://www.midjourney.com/, 2023

work page 2023
[5]

Building rome in a day

Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Simon, Brian Curless, Steven M Seitz, and Richard Szeliski. Building rome in a day. Communications of the ACM, 54(10):105–112, 2011

work page 2011
[6]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022

work page 2022
[8]

Fusion of detected objects in text for visual question answering

Chris Alberti, Jeffrey Ling, Michael Collins, and David Reitter. Fusion of detected objects in text for visual question answering. In EMNLP, 2019

work page 2019
[9]

Bottom-up and top-down attention for image captioning and visual question answering

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, 2018

work page 2018
[10]

State of gpt

Karpathy Andrej. State of gpt. https://karpathy.ai/stateofgpt.pdf, 2023

work page 2023
[11]

PaLM 2 Technical Report

Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Lawrence Zitnick, and Devi Parikh

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual Question Answering. In ICCV, 2015

work page 2015
[13]

Openflamingo, March 2023

Anas Awadalla, Irena Gao, Joshua Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo, March 2023

work page 2023
[14]

Are elephants big- ger than butterflies? reasoning about sizes of objects

Hessam Bagherinezhad, Hannaneh Hajishirzi, Yejin Choi, and Ali Farhadi. Are elephants big- ger than butterflies? reasoning about sizes of objects. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016

work page 2016
[15]

Learning to exploit temporal structure for biomedical vision-language processing

Shruthi Bannur, Stephanie Hyland, Qianchu Liu, Fernando Perez-Garcia, Maximilian Ilse, Daniel C Castro, Benedikt Boecking, Harshita Sharma, Kenza Bouzid, Anja Thieme, et al. Learning to exploit temporal structure for biomedical vision-language processing. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15016–1...

work page 2023
[16]

Measuring abstract reasoning in neural networks

David Barrett, Felix Hill, Adam Santoro, Ari Morcos, and Timothy Lillicrap. Measuring abstract reasoning in neural networks. In International conference on machine learning, pages 511–520. PMLR, 2018

work page 2018
[17]

Scene text visual question answering

Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusinol, Ernest Valveny, CV Jawahar, and Dimosthenis Karatzas. Scene text visual question answering. In ICCV, 2019

work page 2019
[18]

Training diffusion models with reinforcement learning, 2023

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning, 2023. 157

work page 2023
[19]

Improving language models by retrieving from trillions of tokens

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR, 2022

work page 2022
[20]

Food-101–mining discriminative components with random forests

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InComputer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pages 446–461. Springer, 2014

work page 2014
[21]

Measuring emotional intelligence with the mayer-salovery- caruso emotional intelligence test (msceit)

Marc A Brackett and Peter Salovey. Measuring emotional intelligence with the mayer-salovery- caruso emotional intelligence test (msceit). Psicothema, 18:34–41, 2006

work page 2006
[22]

Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023

work page 2023
[23]

Language models are few-shot learners

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In NeurIPS, 2020

work page 2020
[24]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Pix2seq: A language modeling framework for object detection

Ting Chen, Saurabh Saxena, Lala Li, David J Fleet, and Geoffrey Hinton. Pix2seq: A language modeling framework for object detection. In ICLR, 2022

work page 2022
[26]

A unified sequence interface for vision tasks

Ting Chen, Saurabh Saxena, Lala Li, Tsung-Yi Lin, David J Fleet, and Geoffrey E Hinton. A unified sequence interface for vision tasks. Advances in Neural Information Processing Systems, 35:31333–31346, 2022

work page 2022
[27]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server.arXiv preprint arXiv:1504.00325, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[28]

Uniter: Learning universal image-text representations

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Learning universal image-text representations. In ECCV, 2020

work page 2020
[29]

Unifying vision-and-language tasks via text generation

Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. Unifying vision-and-language tasks via text generation. In ICML, 2021

work page 2021
[30]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

Referring as a collaborative process

Herbert H Clark and Deanna Wilkes-Gibbs. Referring as a collaborative process. Cognition, 22(1):1–39, 1986

work page 1986
[32]

The cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016

work page 2016
[33]

Visual perception

Tom Cornsweet. Visual perception. Academic press, 2012

work page 2012
[34]

Why can gpt learn in-context? language models secretly perform gradient descent as meta optimizers

Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Zhifang Sui, and Furu Wei. Why can gpt learn in-context? language models secretly perform gradient descent as meta optimizers. arXiv preprint arXiv:2212.10559, 2022

work page arXiv 2022
[35]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023. 158

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Visual dialog

Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. Visual dialog. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 326–335, 2017

work page 2017
[37]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009

work page 2009
[38]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019

work page 2019
[39]

A Survey on In-context Learning

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. A survey for in-context learning. arXiv preprint arXiv:2301.00234, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[40]

Coarse-to-fine vision-language pre-training with fusion in the backbone

Zi-Yi Dou, Aishwarya Kamath, Zhe Gan, Pengchuan Zhang, Jianfeng Wang, Linjie Li, Zicheng Liu, Ce Liu, Yann LeCun, Nanyun Peng, et al. Coarse-to-fine vision-language pre-training with fusion in the backbone. In Advances in Neural Information Processing Systems

work page
[41]

An empirical study of training end- to-end vision-and-language transformers

Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, et al. An empirical study of training end- to-end vision-and-language transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18166–18176, 2022

work page 2022
[42]

Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodied ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

A modified procedure for naming 332 pictures and collecting norms: Using tangram pictures in psycholinguistic studies

Alicia Fasquel, Angèle Brunellière, and Dominique Knutsen. A modified procedure for naming 332 pictures and collecting norms: Using tangram pictures in psycholinguistic studies. Behavior Research Methods, pages 1–23, 2022

work page 2022
[44]

Act the part: Learning interaction strategies for articulated object part discovery

Samir Yitzhak Gadre, Kiana Ehsani, and Shuran Song. Act the part: Learning interaction strategies for articulated object part discovery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15752–15761, 2021

work page 2021
[45]

Large-scale adversarial training for vision-and-language representation learning

Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, and Jingjing Liu. Large-scale adversarial training for vision-and-language representation learning. In NeurIPS, 2020

work page 2020
[46]

Vision- language pre-training: Basics, recent advances, and future trends

Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, Jianfeng Gao, et al. Vision- language pre-training: Basics, recent advances, and future trends. Foundations and Trends® in Computer Graphics and Vision, 14(3–4):163–352, 2022

work page 2022
[47]

Pal: Program-aided language models

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. In International Conference on Machine Learning, pages 10764–10799. PMLR, 2023

work page 2023
[48]

Multimodal-gpt: A vision and language model for dialogue with humans, 2023

Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans, 2023

work page 2023
[49]

Ms-celeb-1m: A dataset and benchmark for large-scale face recognition

Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, pages 87–102. Springer, 2016

work page 2016
[50]

Retrieval augmented language model pre-training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR, 2020

work page 2020
[51]

Mask r-cnn

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. 159

work page 2017
[52]

The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning

Jack *Hessel, Jena D *Hwang, Jae Sung Park, Rowan Zellers, Chandra Bhagavatula, Anna Rohrbach, Kate Saenko, and Yejin Choi. The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning. In ECCV, 2022

work page 2022
[53]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[54]

Promptcap: Prompt-guided task-aware image captioning

Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A Smith, and Jiebo Luo. Promptcap: Prompt-guided task-aware image captioning. In Proceedings of International Conference on Computer Vision (ICCV), 2023

work page 2023
[55]

Language Is Not All You Need: Aligning Perception with Language Models

Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

Language models as zero-shot planners: Extracting actionable knowledge for embodied agents

Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International Conference on Machine Learning, pages 9118–9147. PMLR, 2022

work page 2022
[57]

Pixel-bert: Aligning image pixels with text by deep multi-modal transformers

Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849, 2020

work page arXiv 2004
[58]

Why is there so much more research on vision than on any other sensory modality? Frontiers in psychology, 10:2246, 2019

Fabian Hutmacher. Why is there so much more research on vision than on any other sensory modality? Frontiers in psychology, 10:2246, 2019

work page 2019
[59]

Abstract visual reasoning with tangram shapes

Anya Ji, Noriyuki Kojima, Noah Rush, Alane Suhr, Wai Keen V ong, Robert Hawkins, and Yoav Artzi. Abstract visual reasoning with tangram shapes. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 582–601, 2022

work page 2022
[60]

Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

work page 2019
[61]

Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017

work page 2017
[62]

Densecap: Fully convolutional localization networks for dense captioning

Justin Johnson, Andrej Karpathy, and Li Fei-Fei. Densecap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4565–4574, 2016

work page 2016
[63]

Language models can solve computer tasks

Geunwoo Kim, Pierre Baldi, and Stephen McAleer. Language models can solve computer tasks. arXiv preprint arXiv:2303.17491, 2023

work page arXiv 2023
[64]

Vilt: Vision-and-language transformer without convolution or region supervision

Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-and-language transformer without convolution or region supervision. In ICML, 2021

work page 2021
[65]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of International Conference on Computer Vision (ICCV), 2023

work page 2023
[66]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022

work page 2022
[67]

Visual genome: Connecting language and vision using crowdsourced dense image annotations

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 2017. 160

work page 2017
[68]

Retrieval- augmented generation for knowledge-intensive nlp tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Na- man Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval- augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020

work page 2020
[69]

arXiv preprint arXiv:2309.10020 (2023)

Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, and Jianfeng Gao. Multimodal foundation models: From specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020, 2023

work page arXiv 2023
[70]

Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training

Gen Li, Nan Duan, Yuejian Fang, Ming Gong, Daxin Jiang, and Ming Zhou. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In AAAI, 2020

work page 2020
[71]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[72]

Align before fuse: Vision and language representation learning with momen- tum distillation

Junnan Li, Ramprasaath R Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, and Steven Hoi. Align before fuse: Vision and language representation learning with momen- tum distillation. In NeurIPS, 2021

work page 2021
[73]

VisualBERT: A Simple and Performant Baseline for Vision and Language

Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019

work page internal anchor Pith review arXiv 1908
[74]

Oscar: Object-semantics aligned pre-training for vision-language tasks

Xiujun Li, Xi Yin, Chunyuan Li, Xiaowei Hu, Pengchuan Zhang, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV, 2020

work page 2020
[75]

Taskmatrix

Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu, Yan Xia, Yu Liu, Yang Ou, Shuai Lu, Lei Ji, Shaoguang Mao, et al. Taskmatrix. ai: Completing tasks by connecting foundation models with millions of apis. arXiv preprint arXiv:2303.16434, 2023

work page arXiv 2023
[76]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014

work page 2014
[77]

Visually grounded reasoning across languages and cultures

Fangyu Liu, Emanuele Bugliarello, Edoardo Maria Ponti, Siva Reddy, Nigel Collier, and Desmond Elliott. Visually grounded reasoning across languages and cultures. In Proceed- ings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10467–10485, Online and Punta Cana, Dominican Republic, November 2021. Association for Computati...

work page 2021
[78]

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[79]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[80]

Deep learning face attributes in the wild

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV) , December 2015

work page 2015

Showing first 80 references.