Recognition: no theorem link
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)
Pith reviewed 2026-05-15 23:22 UTC · model grok-4.3
The pith
GPT-4V processes arbitrarily interleaved multimodal inputs to function as a multimodal generalist system
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GPT-4V's unprecedented ability in processing arbitrarily interleaved multimodal inputs and the genericity of its capabilities together make it a powerful multimodal generalist system, with its understanding of visual markers on input images enabling new human-computer interaction methods such as visual referring prompting.
What carries the argument
GPT-4V(ision) as a large multimodal model extending LLMs with visual understanding, demonstrated through performance on curated qualitative samples spanning domains and tasks.
If this is right
- New human-computer interaction methods arise from the model's ability to interpret visual markers drawn on images.
- Emerging application scenarios open for GPT-4V-based systems in solving real-world multimodal problems.
- Future research directions include next-generation multimodal task formulation and methods to enhance LMMs.
- Deeper understanding of multimodal foundation models can develop from systematic explorations like this one.
Where Pith is reading between the lines
- Similar qualitative probes could be applied to compare GPT-4V against other emerging LMMs on the same sample set.
- Practical deployment might require additional safeguards for handling interleaved inputs in sensitive domains.
- The prompting techniques identified here could be formalized into reusable templates for broader use.
Load-bearing premise
The authors' hand-curated qualitative samples are representative enough to establish the model's genericity and quality without quantitative benchmarks or controlled comparisons.
What would settle it
Quantitative evaluation on standardized multimodal benchmarks where GPT-4V shows no advantage over specialized single-modality models or random baselines would challenge the generalist claim.
read the original abstract
Large multimodal models (LMMs) extend large language models (LLMs) with multi-sensory skills, such as visual understanding, to achieve stronger generic intelligence. In this paper, we analyze the latest model, GPT-4V(ision), to deepen the understanding of LMMs. The analysis focuses on the intriguing tasks that GPT-4V can perform, containing test samples to probe the quality and genericity of GPT-4V's capabilities, its supported inputs and working modes, and the effective ways to prompt the model. In our approach to exploring GPT-4V, we curate and organize a collection of carefully designed qualitative samples spanning a variety of domains and tasks. Observations from these samples demonstrate that GPT-4V's unprecedented ability in processing arbitrarily interleaved multimodal inputs and the genericity of its capabilities together make GPT-4V a powerful multimodal generalist system. Furthermore, GPT-4V's unique capability of understanding visual markers drawn on input images can give rise to new human-computer interaction methods such as visual referring prompting. We conclude the report with in-depth discussions on the emerging application scenarios and the future research directions for GPT-4V-based systems. We hope that this preliminary exploration will inspire future research on the next-generation multimodal task formulation, new ways to exploit and enhance LMMs to solve real-world problems, and gaining better understanding of multimodal foundation models. Finally, we acknowledge that the model under our study is solely the product of OpenAI's innovative work, and they should be fully credited for its development. Please see the GPT-4V contributions paper for the authorship and credit attribution: https://cdn.openai.com/contributions/gpt-4v.pdf
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a qualitative exploration of GPT-4V(ision) through a curated collection of test samples spanning domains and tasks. It examines the model's ability to process arbitrarily interleaved multimodal inputs, its genericity across capabilities, effective prompting strategies including visual referring, and emerging applications, concluding that these observations establish GPT-4V as a powerful multimodal generalist system.
Significance. If the observations hold under more rigorous scrutiny, the work provides an early catalog of GPT-4V behaviors that can guide prompt engineering and interaction design for LMMs. Its discussion of visual markers as a new HCI primitive and the call for future multimodal task formulations are potentially useful for the community, though the absence of quantitative benchmarks limits its role as a definitive benchmark study.
major comments (2)
- [Abstract] Abstract and the central claim paragraph: the statement that curated samples 'demonstrate' GPT-4V's 'unprecedented ability in processing arbitrarily interleaved multimodal inputs and the genericity of its capabilities' rests entirely on hand-selected qualitative examples without quantitative metrics, error bars, baseline comparisons to prior LMMs, or statistical sampling of task distributions, leaving the genericity conclusion vulnerable to selection bias.
- [Observations / test samples section] Section describing the test samples and observations: no systematic failure-case analysis or controlled ablation of input interleaving is reported, so the claim of 'arbitrarily interleaved' processing cannot be distinguished from success on the chosen illustrative cases.
minor comments (2)
- [Approach / sample curation] The paper should explicitly state the total number of samples, selection criteria, and any post-hoc filtering applied to avoid the appearance of cherry-picking.
- [Figures and examples] Figure captions and example presentations would benefit from clearer indication of which visual markers were added by the authors versus native model output.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We agree that the paper is a qualitative exploration and will revise the abstract, claims, and add a limitations section to use more cautious language and acknowledge the lack of quantitative evaluation.
read point-by-point responses
-
Referee: [Abstract] Abstract and the central claim paragraph: the statement that curated samples 'demonstrate' GPT-4V's 'unprecedented ability in processing arbitrarily interleaved multimodal inputs and the genericity of its capabilities' rests entirely on hand-selected qualitative examples without quantitative metrics, error bars, baseline comparisons to prior LMMs, or statistical sampling of task distributions, leaving the genericity conclusion vulnerable to selection bias.
Authors: We agree that the current wording overstates the conclusions given the qualitative nature of the work. We will revise the abstract and central claim paragraph to replace 'demonstrate' with 'illustrate through curated examples' and 'suggest', explicitly note the exploratory scope, and add a dedicated limitations section discussing selection bias, absence of quantitative metrics, error bars, baselines, and statistical sampling. revision: yes
-
Referee: [Observations / test samples section] Section describing the test samples and observations: no systematic failure-case analysis or controlled ablation of input interleaving is reported, so the claim of 'arbitrarily interleaved' processing cannot be distinguished from success on the chosen illustrative cases.
Authors: We acknowledge the absence of systematic failure analysis or controlled ablations. We will revise the observations section to include additional discussion of observed limitations and challenging interleaving cases from our explorations, and rephrase the 'arbitrarily interleaved' claim to indicate successful handling in the presented samples rather than exhaustive or controlled validation. revision: partial
Circularity Check
No circularity: purely observational qualitative analysis
full rationale
The paper contains no equations, fitted parameters, derivations, or self-referential logic. Its central claim rests on hand-curated qualitative samples whose representativeness is a methodological choice, not a reduction to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear. The analysis is self-contained as descriptive exploration without any mathematical or definitional circularity.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 20 Pith papers
-
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
-
AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation
AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.
-
GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows
GTA-2 benchmark shows frontier models achieve below 50% on atomic tool tasks and only 14.39% success on realistic long-horizon workflows, with execution harnesses like Manus providing substantial gains.
-
VoxelCodeBench: Benchmarking 3D World Modeling Through Code Generation
VoxelCodeBench shows that leading code models produce executable 3D manipulation code more readily than spatially correct outputs, especially on geometric construction and multi-object composition tasks.
-
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.
-
The Prompt Report: A Systematic Survey of Prompt Engineering Techniques
This systematic survey organizes prompt engineering into a taxonomy of 58 LLM techniques and 40 others, supplies a shared vocabulary, and offers guidelines for state-of-the-art models.
-
Raven: Rethinking Automated Assessment for Scratch Programs via Video-Grounded Evaluation
Raven automates Scratch program assessment by having instructors specify task-level video generation rules and using LLMs to analyze resulting videos for behavioral compliance, outperforming prior tools on real studen...
-
CLASP: Closed-loop Asynchronous Spatial Perception for Open-vocabulary Desktop Object Grasping
CLASP achieves 87% success in open-vocabulary desktop grasping via dual-pathway perception, asynchronous closed-loop evaluation, and automated multimodal data synthesis.
-
RASR: Retrieval-Augmented Semantic Reasoning for Fake News Video Detection
RASR retrieves cross-instance semantic evidence and uses domain priors to drive multimodal LLM reasoning for improved fake news video detection on FakeSV and FakeTT datasets.
-
TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation
TagaVLM embeds topological structures into VLMs via residual attention and interleaved prompts, achieving 51.09% success rate on R2R unseen environments and outperforming prior large-model methods.
-
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies
Visual trace prompting improves spatial-temporal awareness in VLA models, delivering 10% gains on SimplerEnv and 3.5x on real-robot tasks.
-
BLINK: Multimodal Large Language Models Can See but Not Perceive
BLINK benchmark shows multimodal LLMs reach only 45-51 percent accuracy on core visual perception tasks where humans achieve 95 percent, indicating these abilities have not emerged.
-
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
-
Dual-Anchoring: Addressing State Drift in Vision-Language Navigation
Dual-Anchoring adds explicit progress tokens and retrospective landmark verification to VLN agents, cutting state drift and lifting success rate 15.2% overall with 24.7% gains on long trajectories.
-
Think before Go: Hierarchical Reasoning for Image-goal Navigation
HRNav decomposes image-goal navigation into VLM-based short-horizon planning and RL-based execution with a wandering suppression penalty to improve performance in complex unseen settings.
-
Well Begun is Half Done: Training-Free and Model-Agnostic Semantically Guaranteed User Representation Initialization for Multimodal Recommendation
SG-URInit builds semantically enriched initial user representations for multimodal recommenders by fusing local item modality features with global cluster semantics, closing the gap with item representations without e...
-
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
-
UnAC: Adaptive Visual Prompting with Abstraction and Stepwise Checking for Complex Multimodal Reasoning
UnAC improves LMM performance on visual reasoning benchmarks by combining adaptive visual prompting, image abstraction, and gradual self-checking.
-
Improved Baselines with Visual Instruction Tuning
Simple changes to LLaVA using CLIP-ViT-L-336px, an MLP connector, and academic VQA data yield state-of-the-art results on 11 benchmarks with only 1.2M public examples and one-day training on 8 A100 GPUs.
-
A Survey on Multimodal Large Language Models
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.
Reference graph
Works this paper leans on
-
[1]
https://openai.com/blog/ chatgpt-can-now-see-hear-and-speak , 2023
Chatgpt can now see, hear, and speak. https://openai.com/blog/ chatgpt-can-now-see-hear-and-speak , 2023
work page 2023
-
[2]
https://github.com/deep-floyd/IF, 2023
Deepfloyd if. https://github.com/deep-floyd/IF, 2023
work page 2023
-
[3]
https://github.com/microsoft/guidance/, 2023
Guidance. https://github.com/microsoft/guidance/, 2023
work page 2023
- [4]
-
[5]
Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Simon, Brian Curless, Steven M Seitz, and Richard Szeliski. Building rome in a day. Communications of the ACM, 54(10):105–112, 2011
work page 2011
-
[6]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[7]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022
work page 2022
-
[8]
Fusion of detected objects in text for visual question answering
Chris Alberti, Jeffrey Ling, Michael Collins, and David Reitter. Fusion of detected objects in text for visual question answering. In EMNLP, 2019
work page 2019
-
[9]
Bottom-up and top-down attention for image captioning and visual question answering
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, 2018
work page 2018
- [10]
-
[11]
Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Lawrence Zitnick, and Devi Parikh
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual Question Answering. In ICCV, 2015
work page 2015
-
[13]
Anas Awadalla, Irena Gao, Joshua Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo, March 2023
work page 2023
-
[14]
Are elephants big- ger than butterflies? reasoning about sizes of objects
Hessam Bagherinezhad, Hannaneh Hajishirzi, Yejin Choi, and Ali Farhadi. Are elephants big- ger than butterflies? reasoning about sizes of objects. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016
work page 2016
-
[15]
Learning to exploit temporal structure for biomedical vision-language processing
Shruthi Bannur, Stephanie Hyland, Qianchu Liu, Fernando Perez-Garcia, Maximilian Ilse, Daniel C Castro, Benedikt Boecking, Harshita Sharma, Kenza Bouzid, Anja Thieme, et al. Learning to exploit temporal structure for biomedical vision-language processing. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15016–1...
work page 2023
-
[16]
Measuring abstract reasoning in neural networks
David Barrett, Felix Hill, Adam Santoro, Ari Morcos, and Timothy Lillicrap. Measuring abstract reasoning in neural networks. In International conference on machine learning, pages 511–520. PMLR, 2018
work page 2018
-
[17]
Scene text visual question answering
Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusinol, Ernest Valveny, CV Jawahar, and Dimosthenis Karatzas. Scene text visual question answering. In ICCV, 2019
work page 2019
-
[18]
Training diffusion models with reinforcement learning, 2023
Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning, 2023. 157
work page 2023
-
[19]
Improving language models by retrieving from trillions of tokens
Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR, 2022
work page 2022
-
[20]
Food-101–mining discriminative components with random forests
Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InComputer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pages 446–461. Springer, 2014
work page 2014
-
[21]
Marc A Brackett and Peter Salovey. Measuring emotional intelligence with the mayer-salovery- caruso emotional intelligence test (msceit). Psicothema, 18:34–41, 2006
work page 2006
-
[22]
Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023
work page 2023
-
[23]
Language models are few-shot learners
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In NeurIPS, 2020
work page 2020
-
[24]
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Pix2seq: A language modeling framework for object detection
Ting Chen, Saurabh Saxena, Lala Li, David J Fleet, and Geoffrey Hinton. Pix2seq: A language modeling framework for object detection. In ICLR, 2022
work page 2022
-
[26]
A unified sequence interface for vision tasks
Ting Chen, Saurabh Saxena, Lala Li, Tsung-Yi Lin, David J Fleet, and Geoffrey E Hinton. A unified sequence interface for vision tasks. Advances in Neural Information Processing Systems, 35:31333–31346, 2022
work page 2022
-
[27]
Microsoft COCO Captions: Data Collection and Evaluation Server
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server.arXiv preprint arXiv:1504.00325, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[28]
Uniter: Learning universal image-text representations
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Learning universal image-text representations. In ECCV, 2020
work page 2020
-
[29]
Unifying vision-and-language tasks via text generation
Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. Unifying vision-and-language tasks via text generation. In ICML, 2021
work page 2021
-
[30]
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[31]
Referring as a collaborative process
Herbert H Clark and Deanna Wilkes-Gibbs. Referring as a collaborative process. Cognition, 22(1):1–39, 1986
work page 1986
-
[32]
The cityscapes dataset for semantic urban scene understanding
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016
work page 2016
- [33]
-
[34]
Why can gpt learn in-context? language models secretly perform gradient descent as meta optimizers
Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Zhifang Sui, and Furu Wei. Why can gpt learn in-context? language models secretly perform gradient descent as meta optimizers. arXiv preprint arXiv:2212.10559, 2022
-
[35]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023. 158
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. Visual dialog. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 326–335, 2017
work page 2017
-
[37]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009
work page 2009
-
[38]
Bert: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019
work page 2019
-
[39]
A Survey on In-context Learning
Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. A survey for in-context learning. arXiv preprint arXiv:2301.00234, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[40]
Coarse-to-fine vision-language pre-training with fusion in the backbone
Zi-Yi Dou, Aishwarya Kamath, Zhe Gan, Pengchuan Zhang, Jianfeng Wang, Linjie Li, Zicheng Liu, Ce Liu, Yann LeCun, Nanyun Peng, et al. Coarse-to-fine vision-language pre-training with fusion in the backbone. In Advances in Neural Information Processing Systems
-
[41]
An empirical study of training end- to-end vision-and-language transformers
Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, et al. An empirical study of training end- to-end vision-and-language transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18166–18176, 2022
work page 2022
-
[42]
Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodied ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
Alicia Fasquel, Angèle Brunellière, and Dominique Knutsen. A modified procedure for naming 332 pictures and collecting norms: Using tangram pictures in psycholinguistic studies. Behavior Research Methods, pages 1–23, 2022
work page 2022
-
[44]
Act the part: Learning interaction strategies for articulated object part discovery
Samir Yitzhak Gadre, Kiana Ehsani, and Shuran Song. Act the part: Learning interaction strategies for articulated object part discovery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15752–15761, 2021
work page 2021
-
[45]
Large-scale adversarial training for vision-and-language representation learning
Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, and Jingjing Liu. Large-scale adversarial training for vision-and-language representation learning. In NeurIPS, 2020
work page 2020
-
[46]
Vision- language pre-training: Basics, recent advances, and future trends
Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, Jianfeng Gao, et al. Vision- language pre-training: Basics, recent advances, and future trends. Foundations and Trends® in Computer Graphics and Vision, 14(3–4):163–352, 2022
work page 2022
-
[47]
Pal: Program-aided language models
Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. In International Conference on Machine Learning, pages 10764–10799. PMLR, 2023
work page 2023
-
[48]
Multimodal-gpt: A vision and language model for dialogue with humans, 2023
Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans, 2023
work page 2023
-
[49]
Ms-celeb-1m: A dataset and benchmark for large-scale face recognition
Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, pages 87–102. Springer, 2016
work page 2016
-
[50]
Retrieval augmented language model pre-training
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR, 2020
work page 2020
-
[51]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. 159
work page 2017
-
[52]
The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning
Jack *Hessel, Jena D *Hwang, Jae Sung Park, Rowan Zellers, Chandra Bhagavatula, Anna Rohrbach, Kate Saenko, and Yejin Choi. The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning. In ECCV, 2022
work page 2022
-
[53]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[54]
Promptcap: Prompt-guided task-aware image captioning
Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A Smith, and Jiebo Luo. Promptcap: Prompt-guided task-aware image captioning. In Proceedings of International Conference on Computer Vision (ICCV), 2023
work page 2023
-
[55]
Language Is Not All You Need: Aligning Perception with Language Models
Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[56]
Language models as zero-shot planners: Extracting actionable knowledge for embodied agents
Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International Conference on Machine Learning, pages 9118–9147. PMLR, 2022
work page 2022
-
[57]
Pixel-bert: Aligning image pixels with text by deep multi-modal transformers
Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849, 2020
-
[58]
Fabian Hutmacher. Why is there so much more research on vision than on any other sensory modality? Frontiers in psychology, 10:2246, 2019
work page 2019
-
[59]
Abstract visual reasoning with tangram shapes
Anya Ji, Noriyuki Kojima, Noah Rush, Alane Suhr, Wai Keen V ong, Robert Hawkins, and Yoav Artzi. Abstract visual reasoning with tangram shapes. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 582–601, 2022
work page 2022
-
[60]
Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019
work page 2019
-
[61]
Clevr: A diagnostic dataset for compositional language and elementary visual reasoning
Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017
work page 2017
-
[62]
Densecap: Fully convolutional localization networks for dense captioning
Justin Johnson, Andrej Karpathy, and Li Fei-Fei. Densecap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4565–4574, 2016
work page 2016
-
[63]
Language models can solve computer tasks
Geunwoo Kim, Pierre Baldi, and Stephen McAleer. Language models can solve computer tasks. arXiv preprint arXiv:2303.17491, 2023
-
[64]
Vilt: Vision-and-language transformer without convolution or region supervision
Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-and-language transformer without convolution or region supervision. In ICML, 2021
work page 2021
-
[65]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of International Conference on Computer Vision (ICCV), 2023
work page 2023
-
[66]
Large language models are zero-shot reasoners
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022
work page 2022
-
[67]
Visual genome: Connecting language and vision using crowdsourced dense image annotations
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 2017. 160
work page 2017
-
[68]
Retrieval- augmented generation for knowledge-intensive nlp tasks
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Na- man Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval- augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020
work page 2020
-
[69]
arXiv preprint arXiv:2309.10020 (2023)
Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, and Jianfeng Gao. Multimodal foundation models: From specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020, 2023
-
[70]
Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training
Gen Li, Nan Duan, Yuejian Fang, Ming Gong, Daxin Jiang, and Ming Zhou. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In AAAI, 2020
work page 2020
-
[71]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[72]
Align before fuse: Vision and language representation learning with momen- tum distillation
Junnan Li, Ramprasaath R Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, and Steven Hoi. Align before fuse: Vision and language representation learning with momen- tum distillation. In NeurIPS, 2021
work page 2021
-
[73]
VisualBERT: A Simple and Performant Baseline for Vision and Language
Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019
work page internal anchor Pith review arXiv 1908
-
[74]
Oscar: Object-semantics aligned pre-training for vision-language tasks
Xiujun Li, Xi Yin, Chunyuan Li, Xiaowei Hu, Pengchuan Zhang, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV, 2020
work page 2020
-
[75]
Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu, Yan Xia, Yu Liu, Yang Ou, Shuai Lu, Lei Ji, Shaoguang Mao, et al. Taskmatrix. ai: Completing tasks by connecting foundation models with millions of apis. arXiv preprint arXiv:2303.16434, 2023
-
[76]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014
work page 2014
-
[77]
Visually grounded reasoning across languages and cultures
Fangyu Liu, Emanuele Bugliarello, Edoardo Maria Ponti, Siva Reddy, Nigel Collier, and Desmond Elliott. Visually grounded reasoning across languages and cultures. In Proceed- ings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10467–10485, Online and Punta Cana, Dominican Republic, November 2021. Association for Computati...
work page 2021
-
[78]
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[79]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[80]
Deep learning face attributes in the wild
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV) , December 2015
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.