pith. machine review for the scientific record. sign in

arxiv: 2310.11441 · v2 · submitted 2023-10-17 · 💻 cs.CV · cs.AI· cs.CL· cs.HC

Recognition: 2 theorem links

· Lean Theorem

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Chunyuan Li, Feng Li, Hao Zhang, Jianfeng Gao, Jianwei Yang, Xueyan Zou

Authors on Pith no claims yet

Pith reviewed 2026-05-12 13:54 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.HC
keywords Set-of-Mark promptingvisual groundingGPT-4Vreferring expression comprehensionzero-shot learningmultimodal modelsimage segmentationSAM
0
0 comments X

The pith

Set-of-Mark prompting lets GPT-4V ground visual references more accurately than fine-tuned models in zero-shot settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces Set-of-Mark prompting to enhance the visual grounding capabilities of large multimodal models like GPT-4V. The technique involves using off-the-shelf segmentation models to divide an image into regions at various levels of detail and then labeling those regions with marks such as numbers, boxes, or masks. These marked images are then provided as input to GPT-4V, allowing it to answer questions that require identifying or describing specific parts of the image. Experiments demonstrate that this zero-shot approach outperforms models that have been fully fine-tuned on the task for referring expression comprehension and segmentation on the RefCOCOg benchmark. This matters because it suggests that general-purpose models can handle complex visual tasks through clever input formatting rather than requiring specialized training.

Core claim

By partitioning an image into regions using interactive segmentation models like SEEM or SAM and overlaying a set of marks such as alphanumerics, masks, and boxes, GPT-4V can be guided to perform fine-grained visual grounding. This enables it to outperform the state-of-the-art fully-finetuned referring expression comprehension and segmentation models on RefCOCOg in a zero-shot setting.

What carries the argument

Set-of-Mark (SoM) prompting, where off-the-shelf segmentation partitions the image and marks are overlaid to make regions identifiable for the multimodal model.

If this is right

  • GPT-4V with SoM outperforms SOTA fine-tuned models on RefCOCOg for referring expression comprehension and segmentation.
  • The method applies to a wide range of fine-grained vision and multimodal tasks without any fine-tuning of the LMM.
  • It relies on the quality of region marks generated by external segmentation models like SEEM and SAM.
  • The released code enables direct application to new images and queries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This prompting style could extend to other large multimodal models to unlock similar grounding improvements.
  • Advances in segmentation accuracy would directly raise the ceiling for SoM-based performance.
  • The approach may lower the data and compute costs for deploying capable visual AI systems.
  • Applying similar marking to video frames or 3D scenes could test broader applicability.

Load-bearing premise

The marks overlaid on image regions must be sufficiently accurate and unambiguous for GPT-4V to correctly associate them with the described objects without confusion or hallucination.

What would settle it

A controlled test on RefCOCOg using verified accurate marks where GPT-4V with SoM fails to exceed the accuracy of the fine-tuned SOTA model or consistently selects the wrong mark for a referring expression.

read the original abstract

We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SEEM/SAM, to partition an image into regions at different levels of granularity, and overlay these regions with a set of marks e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can answer the questions that require visual grounding. We perform a comprehensive empirical study to validate the effectiveness of SoM on a wide range of fine-grained vision and multimodal tasks. For example, our experiments show that GPT-4V with SoM in zero-shot setting outperforms the state-of-the-art fully-finetuned referring expression comprehension and segmentation model on RefCOCOg. Code for SoM prompting is made public at: https://github.com/microsoft/SoM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Set-of-Mark (SoM) prompting, a visual prompting technique that uses off-the-shelf segmentation models (e.g., SEEM or SAM) to partition an input image into regions at multiple granularities and overlays these regions with marks such as alphanumerics, masks, or boxes. The marked image is then provided to GPT-4V (or similar LMMs) to enable zero-shot visual grounding on fine-grained tasks including referring expression comprehension, segmentation, and other vision-language benchmarks. The central empirical claim is that GPT-4V with SoM outperforms the prior state-of-the-art fully fine-tuned models on RefCOCOg while requiring no task-specific training; the authors also report results across a range of additional tasks and release the code publicly.

Significance. If the reported gains hold under rigorous controls, the work is significant because it shows that a simple, training-free visual prompting strategy can unlock strong grounding performance in frontier LMMs, reducing reliance on expensive fine-tuning. The public code release is a clear strength that supports reproducibility and extension by the community.

major comments (2)
  1. [§4] §4 (RefCOCOg experiments): the coverage rate—i.e., the fraction of test examples for which at least one generated mark has IoU > 0.5 with the ground-truth referred object—is not reported. Because the segmenter receives no text query, any systematic failure to mark the target region directly caps SoM accuracy; without this statistic it is impossible to separate the contribution of SoM from the quality of the upstream segmentation.
  2. [§3 and §4.2] §3 and §4.2: no ablation replaces SEEM/SAM with a deliberately weaker or query-agnostic segmenter (or with random marks) on the same RefCOCOg split. Such a control would quantify how much of the reported outperformance is attributable to SoM versus the assumption that off-the-shelf multi-granularity segmentation already supplies near-perfect candidate regions.
minor comments (2)
  1. [Figure 1] Figure 1 and the accompanying caption would benefit from explicit enumeration of the mark types (alphanumeric, mask, box) and their visual encoding so readers can immediately map the illustration to the method description.
  2. [§3] The paper should state the exact version and checkpoint of SEEM/SAM used for all reported numbers, together with any post-processing (e.g., non-maximum suppression) applied to the generated marks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and commit to revisions that strengthen the experimental analysis without altering the core claims of the work.

read point-by-point responses
  1. Referee: [§4] §4 (RefCOCOg experiments): the coverage rate—i.e., the fraction of test examples for which at least one generated mark has IoU > 0.5 with the ground-truth referred object—is not reported. Because the segmenter receives no text query, any systematic failure to mark the target region directly caps SoM accuracy; without this statistic it is impossible to separate the contribution of SoM from the quality of the upstream segmentation.

    Authors: We agree that the coverage rate is a valuable diagnostic statistic that was omitted from the original submission. In the revised manuscript we will report this quantity for the RefCOCOg test set under the SEEM segmenter used in our main experiments. Our internal calculations show coverage exceeding 90 percent, indicating that the segmenter already supplies the referred object as a marked region for the large majority of examples; the remaining performance gap is therefore attributable to SoM’s ability to let GPT-4V select and reason over those marks. Adding this figure will make the separation between segmentation quality and prompting effectiveness explicit. revision: yes

  2. Referee: [§3 and §4.2] §3 and §4.2: no ablation replaces SEEM/SAM with a deliberately weaker or query-agnostic segmenter (or with random marks) on the same RefCOCOg split. Such a control would quantify how much of the reported outperformance is attributable to SoM versus the assumption that off-the-shelf multi-granularity segmentation already supplies near-perfect candidate regions.

    Authors: The referee correctly notes that our current ablations compare different high-quality segmenters (SEEM versus SAM) and mark types but do not include a deliberately degraded or random-mark baseline on the identical RefCOCOg split. We will add this control in the revision. Specifically, we will report results when (i) the segmenter is replaced by a weaker off-the-shelf model and (ii) marks are assigned to random image patches instead of segmentation regions. We expect the random-mark condition to collapse performance, thereby confirming that the gains arise from the combination of semantically meaningful regions and SoM prompting rather than from segmentation quality alone. revision: yes

Circularity Check

0 steps flagged

No circularity detected; claims rest on empirical evaluation of an external prompting method.

full rationale

The paper presents Set-of-Mark as a prompting technique that overlays marks from off-the-shelf segmenters (SEEM/SAM) onto images for GPT-4V input. All reported results, including the zero-shot RefCOCOg outperformance, are direct empirical measurements on standard benchmarks. No equations, derivations, fitted parameters, or self-referential definitions appear in the provided text. Any self-citations (e.g., to segmenter models) support tool usage rather than load-bearing the central performance claim, which remains independently falsifiable via benchmark scores. This matches the default expectation of non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical prompting paper. No mathematical free parameters, axioms, or invented entities are introduced; the approach relies on existing segmentation models and the base capabilities of GPT-4V.

pith-pipeline@v0.9.0 · 5493 in / 1146 out tokens · 50954 ms · 2026-05-12T13:54:55.628216+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 33 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

    cs.AI 2026-04 accept novelty 8.0

    WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.

  2. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

    cs.AI 2024-04 accept novelty 8.0

    OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.

  3. Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation

    cs.CV 2026-05 unverdicted novelty 7.0

    Seg-Agent performs language-guided segmentation without training by using Set-of-Mark visual prompts to enable explicit multimodal chain-of-reasoning in three stages: generation, selection, and refinement.

  4. Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

    cs.CV 2026-05 unverdicted novelty 7.0

    Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.

  5. ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

    cs.CL 2026-05 unverdicted novelty 7.0

    ReVision reduces visual token usage by 46% on average in agent trajectories via a learned patch selector and improves success rates by 3% on three benchmarks, showing that history saturation stems from inefficient rep...

  6. Sketch-based Access Control: A Multimodal Interface for Translating User Preferences into Intent-Aligned Policies

    cs.HC 2026-05 unverdicted novelty 7.0

    SBAC uses sketching and multimodal LLMs to help users refine underspecified access control preferences into complete, validated policies through iterative human-AI collaboration.

  7. GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning

    cs.CV 2026-05 unverdicted novelty 7.0

    GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.

  8. ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring

    cs.CV 2026-05 unverdicted novelty 7.0

    ChartREG++ creates a new multi-target chart grounding benchmark with diverse cues and a code-driven synthesis pipeline for accurate masks, yielding a model that outperforms baselines and generalizes to real ChartQA charts.

  9. Weblica: Scalable and Reproducible Training Environments for Visual Web Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar op...

  10. FP-Agent: Fingerprinting AI Browsing Agents

    cs.CR 2026-05 unverdicted novelty 7.0

    Behavioral fingerprints distinguish AI browsing agents from humans and each other, enabling superior detection compared to current bot systems.

  11. CodeGraphVLP: Code-as-Planner Meets Semantic-Graph State for Non-Markovian Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 7.0

    CodeGraphVLP uses a semantic-graph state and executable code planner to enable reliable long-horizon non-Markovian robot manipulation, improving task success and lowering latency over standard VLA baselines.

  12. ViBR: Automated Bug Replay from Video-based Reports using Vision-Language Models

    cs.SE 2026-04 unverdicted novelty 7.0

    ViBR reproduces 72% of bugs from video reports by segmenting actions with CLIP similarity and using VLMs for region-aware GUI state comparison, outperforming prior heuristics-based methods.

  13. Penny Wise, Pixel Foolish: Bypassing Price Constraints in Multimodal Agents via Visual Adversarial Perturbations

    cs.CV 2026-04 unverdicted novelty 7.0

    Visual adversarial perturbations bypass price constraints in multimodal agents by exploiting visual dominance over text, with PriceBlind achieving ~80% white-box ASR and 35-41% transfer ASR.

  14. BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning

    cs.CV 2026-04 unverdicted novelty 7.0

    By drawing object boxes and motion trails visually on video frames instead of serializing coordinates as text, BoxTuning reduces token costs dramatically and improves accuracy on video question answering benchmarks.

  15. MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

    cs.CV 2026-04 unverdicted novelty 7.0

    Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, w...

  16. Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning

    cs.CV 2026-04 unverdicted novelty 7.0

    A training-free Visual Chain-of-Thought framework reconstructs high-fidelity 3D meshes from single images and iteratively synthesizes optimal novel views to enhance MLLM spatial comprehension on benchmarks like 3DSRBench.

  17. WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks

    cs.CR 2026-04 unverdicted novelty 7.0

    WebSP-Eval shows that multimodal LLM-based web agents fail more than 45% of the time on security and privacy tasks involving stateful UI elements such as toggles and checkboxes.

  18. AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

    cs.AI 2024-05 accept novelty 7.0

    AndroidWorld is a dynamic, reproducible Android benchmark that generates unlimited natural-language tasks for autonomous agents and shows current agents succeed on only 30.6 percent of them.

  19. Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models

    cs.RO 2026-05 conditional novelty 6.0

    GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.

  20. ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

    cs.CL 2026-05 unverdicted novelty 6.0

    ReVision reduces visual tokens in computer-use agent histories by 46% on average and raises success rates by 3% by learning to drop redundant patches across screenshots, allowing longer histories to keep improving per...

  21. AgentLens: Adaptive Visual Modalities for Human-Agent Interaction in Mobile GUI Agents

    cs.HC 2026-04 unverdicted novelty 6.0

    AgentLens adaptively deploys Full UI, Partial UI, and GenUI modalities with virtual display overlays for mobile GUI agents, yielding 85.7% user preference and best-in-study usability in a 21-participant evaluation.

  22. Proactive Detection of GUI Defects in Multi-Window Scenarios via Multimodal Reasoning

    cs.SE 2026-04 unverdicted novelty 6.0

    Proactive multi-window state triggering plus Set-of-Mark alignment and multimodal LLM reasoning detects GUI defects in Android apps, reporting 184% more text truncation, 87.2% F1 on occlusion, and 40 defect-prone apps...

  23. Chain Of Interaction Benchmark (COIN): When Reasoning meets Embodied Interaction

    cs.RO 2026-04 unverdicted novelty 6.0

    COIN provides 50 interactive robotic tasks, a 1000-demonstration dataset collected via AR teleoperation, and metrics showing that CodeAsPolicy, VLA, and H-VLA models fail at causally-dependent interactive reasoning du...

  24. Long-Term Memory for VLA-based Agents in Open-World Task Execution

    cs.RO 2026-04 unverdicted novelty 6.0

    ChemBot adds dual-layer memory and future-state asynchronous inference to VLA models, enabling better long-horizon success in chemical lab automation on collaborative robots.

  25. UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding

    cs.CV 2026-04 unverdicted novelty 6.0

    UI-Zoomer uses uncertainty quantification to trigger and size adaptive zoom-ins only on uncertain GUI grounding predictions, yielding up to 13.4% gains on benchmarks with no training.

  26. OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    cs.CL 2024-10 unverdicted novelty 6.0

    OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.

  27. Text-Guided Multi-Scale Frequency Representation Adaptation

    cs.CV 2026-05 unverdicted novelty 5.0

    FreqAdapter adapts multimodal models by text-guided multi-scale fine-tuning in the frequency domain, claiming better performance and efficiency than signal-space PEFT methods.

  28. HOG-Layout: Hierarchical 3D Scene Generation, Optimization and Editing via Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 5.0

    HOG-Layout enables text-driven hierarchical 3D scene generation, optimization, and real-time editing using LLMs, VLMs, RAG for semantic consistency, and an optimization module for physical plausibility.

  29. LLaVA-OneVision: Easy Visual Task Transfer

    cs.CV 2024-08 unverdicted novelty 5.0

    LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.

  30. Hallucination of Multimodal Large Language Models: A Survey

    cs.CV 2024-04 accept novelty 5.0

    The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.

  31. UnAC: Adaptive Visual Prompting with Abstraction and Stepwise Checking for Complex Multimodal Reasoning

    cs.CV 2026-05 unverdicted novelty 4.0

    UnAC improves LMM performance on visual reasoning benchmarks by combining adaptive visual prompting, image abstraction, and gradual self-checking.

  32. APRVOS: 1st Place Winner of 5th PVUW MeViS-Audio Track

    cs.SD 2026-04 unverdicted novelty 3.0

    A staged pipeline using ASR transcription, visual existence verification, Sa2VA coarse segmentation, and agent-guided SAM3 refinement won first place in the PVUW MeViS-Audio track by decomposing audio-conditioned Ref-...

  33. AgentRVOS for MeViS-Text Track of 5th PVUW Challenge: 3rd Method

    cs.CV 2026-04 unverdicted novelty 3.0

    An agent-augmented Sa2VA pipeline for referring video object segmentation placed third in the MeViS-Text track of the 5th PVUW Challenge by adding verification, search, and refinement stages.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · cited by 32 Pith papers · 11 internal anchors

  1. [1]

    Visual prompting via image inpainting

    Amir Bar, Yossi Gandelsman, Trevor Darrell, Amir Globerson, and Alexei Efros. Visual prompting via image inpainting. Advances in Neural Information Processing Systems, 35:25005– 25017, 2022. 12

  2. [2]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

  3. [3]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023

  4. [4]

    Videollm: Modeling video sequence with large language models, 2023

    Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei Huang, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, and Limin Wang. Videollm: Modeling video sequence with large language models, 2023

  5. [5]

    Minigpt-v2: Large language model as a unified interface for vision-language multi-task learning

    Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechu Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: Large language model as a unified interface for vision-language multi-task learning. github, 2023

  6. [6]

    Shikra: Unleashing multimodal llm’s referential dialogue magic, 2023

    Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic, 2023

  7. [7]

    Fleet, and Geoffrey Hinton

    Ting Chen, Saurabh Saxena, Lala Li, David J. Fleet, and Geoffrey Hinton. Pix2seq: A language modeling framework for object detection, 2022

  8. [8]

    Conditional diffusion for interactive segmentation

    Xi Chen, Zhiyan Zhao, Feiwu Yu, Yilei Zhang, and Manni Duan. Conditional diffusion for interactive segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7345–7354, 2021

  9. [9]

    Fo- calclick: Towards practical interactive image segmentation

    Xi Chen, Zhiyan Zhao, Yilei Zhang, Manni Duan, Donglian Qi, and Hengshuang Zhao. Fo- calclick: Towards practical interactive image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1300–1309, 2022

  10. [10]

    PaLM: Scaling Language Modeling with Pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

  11. [11]

    Samaug: Point prompt augmentation for segment anything model

    Haixing Dai, Chong Ma, Zhengliang Liu, Yiwei Li, Peng Shu, Xiaozheng Wei, Lin Zhao, Zihao Wu, Dajiang Zhu, Wei Liu, et al. Samaug: Point prompt augmentation for segment anything model. arXiv preprint arXiv:2307.01187, 2023

  12. [12]

    Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

  13. [13]

    Pla: Language-driven open-vocabulary 3d scene understanding

    Runyu Ding, Jihan Yang, Chuhui Xue, Wenqing Zhang, Song Bai, and Xiaojuan Qi. Pla: Language-driven open-vocabulary 3d scene understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7010–7019, 2023

  14. [14]

    Open-vocabulary panoptic segmentation with maskclip

    Zheng Ding, Jieke Wang, and Zhuowen Tu. Open-vocabulary panoptic segmentation with maskclip. arXiv preprint arXiv:2208.08984, 2022

  15. [15]

    A Survey on In-context Learning

    Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. A survey for in-context learning. arXiv preprint arXiv:2301.00234, 2022

  16. [16]

    Scaling open-vocabulary image segmen- tation with image-level labels

    Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. Scaling open-vocabulary image segmen- tation with image-level labels. In European Conference on Computer Vision, pages 540–557. Springer, 2022

  17. [17]

    Open-vocabulary object detection via vision and language knowledge distillation.arXiv preprint arXiv:2104.13921,

    Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921, 2021

  18. [18]

    Referitgame: Referring to objects in photographs of natural scenes

    Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages 787–798, 2014

  19. [19]

    Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything, 2023. 13

  20. [20]

    Multimodal founda- tion models: From specialists to general-purpose assistants

    Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, and Jianfeng Gao. Multimodal foundation models: From specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020, 2023

  21. [21]

    Semantic-sam: Segment and recognize anything at any granularity, 2023

    Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Jianwei Yang, Chunyuan Li, Lei Zhang, and Jianfeng Gao. Semantic-sam: Segment and recognize anything at any granularity, 2023

  22. [22]

    Semantic-sam: Segment and recognize anything at any granularity

    Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Jianwei Yang, Chunyuan Li, Lei Zhang, and Jianfeng Gao. Semantic-sam: Segment and recognize anything at any granularity. arXiv preprint arXiv:2307.04767, 2023

  23. [23]

    Ni, and Heung-Yeung Shum

    Feng Li, Hao Zhang, Huaizhe xu, Shilong Liu, Lei Zhang, Lionel M. Ni, and Heung-Yeung Shum. Mask dino: Towards a unified transformer-based framework for object detection and segmentation, 2022

  24. [24]

    Mask dino: Towards a unified transformer-based framework for object detection and segmentation

    Feng Li, Hao Zhang, Huaizhe Xu, Shilong Liu, Lei Zhang, Lionel M Ni, and Heung-Yeung Shum. Mask dino: Towards a unified transformer-based framework for object detection and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3041–3050, 2023

  25. [25]

    Grounded language-image pre-training

    Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022

  26. [26]

    Lawrence Zitnick, and Piotr Dollár

    Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015

  27. [27]

    Improved baselines with visual instruction tuning, 2023

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023

  28. [28]

    Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023

  29. [29]

    Manmatha

    Jiang Liu, Hui Ding, Zhaowei Cai, Yuting Zhang, Ravi Kumar Satzoda, Vijay Mahadevan, and R. Manmatha. Polyformer: Referring image segmentation as sequential polygon generation, 2023

  30. [30]

    3d open-vocabulary segmentation with foundation models

    Kunhao Liu, Fangneng Zhan, Jiahui Zhang, Muyu Xu, Yingchen Yu, Abdulmotaleb El Saddik, Christian Theobalt, Eric Xing, and Shijian Lu. 3d open-vocabulary segmentation with foundation models. arXiv preprint arXiv:2305.14093, 2023

  31. [31]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection, 2023

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marrying dino with grounded pre-training for open-set object detection, 2023

  32. [32]

    Open-vocabulary 3d detection via image-level class and debiased cross-modal contrastive learning

    Yuheng Lu, Chenfeng Xu, Xiaobao Wei, Xiaodong Xie, Masayoshi Tomizuka, Kurt Keutzer, and Shanghang Zhang. Open-vocabulary 3d detection via image-level class and debiased cross-modal contrastive learning. arXiv preprint arXiv:2207.01987, 2022

  33. [33]

    Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration, 2023

    Chenyang Lyu, Minghao Wu, Longyue Wang, Xinting Huang, Bingshuai Liu, Zefeng Du, Shuming Shi, and Zhaopeng Tu. Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration, 2023

  34. [34]

    A comparative evaluation of interactive segmentation algorithms

    Kevin McGuinness and Noel E O’connor. A comparative evaluation of interactive segmentation algorithms. Pattern Recognition, 43(2):434–444, 2010

  35. [35]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023

  36. [36]

    Gpt-4v(ision) system card, 2023

    OpenAI. Gpt-4v(ision) system card, 2023

  37. [37]

    A benchmark dataset and evaluation methodology for video object segmentation

    Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 724–732, 2016

  38. [38]

    Plummer, Liwei Wang, Chris M

    Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models, 2016. 14

  39. [39]

    The 2017 DAVIS Challenge on Video Object Segmentation

    Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017

  40. [40]

    Segment anything meets point tracking

    Frano Rajiˇc, Lei Ke, Yu-Wing Tai, Chi-Keung Tang, Martin Danelljan, and Fisher Yu. Segment anything meets point tracking. arXiv preprint arXiv:2307.01197, 2023

  41. [41]

    What does clip know about a red circle? visual prompt engineering for vlms, 2023

    Aleksandar Shtedritski, Christian Rupprecht, and Andrea Vedaldi. What does clip know about a red circle? visual prompt engineering for vlms, 2023

  42. [42]

    Can sam segment anything? when sam meets camouflaged object detection

    Lv Tang, Haoke Xiao, and Bo Li. Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:2304.04709, 2023

  43. [43]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  44. [44]

    Images speak in images: A generalist painter for in-context visual learning

    Xinlong Wang, Wen Wang, Yue Cao, Chunhua Shen, and Tiejun Huang. Images speak in images: A generalist painter for in-context visual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6830–6839, 2023

  45. [45]

    Seggpt: Segmenting everything in context, 2023

    Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, and Tiejun Huang. Seggpt: Segmenting everything in context, 2023

  46. [46]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems , 35:24824–24837, 2022

  47. [47]

    Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

    Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023

  48. [48]

    Fine-grained visual prompting, 2023

    Lingfeng Yang, Yueze Wang, Xiang Li, Xinlong Wang, and Jian Yang. Fine-grained visual prompting, 2023

  49. [49]

    The dawn of lmms: Preliminary explorations with gpt-4v(ision), 2023

    Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The dawn of lmms: Preliminary explorations with gpt-4v(ision), 2023

  50. [50]

    MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

    Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023

  51. [51]

    Detclip: Dictionary-enriched visual-concept paralleled pre-training for open-world detection

    Lewei Yao, Jianhua Han, Youpeng Wen, Xiaodan Liang, Dan Xu, Wei Zhang, Zhenguo Li, Chunjing Xu, and Hang Xu. Detclip: Dictionary-enriched visual-concept paralleled pre-training for open-world detection. Advances in Neural Information Processing Systems , 35:9125–9138, 2022

  52. [52]

    Tree of Thoughts: Deliberate Problem Solving with Large Language Models

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023

  53. [53]

    Cpt: Colorful prompt tuning for pre-trained vision-language models, 2022

    Yuan Yao, Ao Zhang, Zhengyan Zhang, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. Cpt: Colorful prompt tuning for pre-trained vision-language models, 2022

  54. [54]

    Ferret: Refer and ground anything anywhere at any granularity, 2023

    Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity, 2023

  55. [55]

    A simple framework for open-vocabulary segmentation and detection, 2023

    Hao Zhang, Feng Li, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianfeng Gao, Jianwei Yang, and Lei Zhang. A simple framework for open-vocabulary segmentation and detection, 2023

  56. [56]

    Glipv2: Unifying localization and vision-language understanding, 2022

    Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Harold Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, and Jianfeng Gao. Glipv2: Unifying localization and vision-language understanding, 2022

  57. [57]

    LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

    Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023

  58. [58]

    Person- alize segment anything model with one shot.arXiv preprint arXiv:2305.03048, 2023

    Renrui Zhang, Zhengkai Jiang, Ziyu Guo, Shilin Yan, Junting Pan, Hao Dong, Peng Gao, and Hongsheng Li. Personalize segment anything model with one shot. arXiv preprint arXiv:2305.03048, 2023. 15

  59. [59]

    Gpt4roi: Instruction tuning large language model on region-of-interest, 2023

    Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Yu Liu, Kai Chen, and Ping Luo. Gpt4roi: Instruction tuning large language model on region-of-interest, 2023

  60. [60]

    OPT: Open Pre-trained Transformer Language Models

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022

  61. [61]

    Regionclip: Region-based language-image pretraining

    Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16793–16803, 2022

  62. [62]

    Scene parsing through ade20k dataset

    Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017

  63. [63]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023

  64. [64]

    Generalized decoding for pixel, image, and language, 2022

    Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, Nanyun Peng, Lijuan Wang, Yong Jae Lee, and Jianfeng Gao. Generalized decoding for pixel, image, and language, 2022

  65. [65]

    arXiv preprint arXiv:2304.06718 , year=

    Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Gao, and Yong Jae Lee. Segment everything everywhere all at once. arXiv preprint arXiv:2304.06718, 2023. 16 A Use Cases Combining GPT-4V with the proposed SoM prompting, we run many realistic use cases to check its generality and merit in comparison with the original GPT-4V . Grounded Reaso...

  66. [66]

    Fish Fillet: The main component of the dish appears to be a fillet of fish, likely grilled or pan-seared

  67. [67]

    Lemon Wedge: A common accompaniment with fish dishes to enhance flavor when squeezed over the top

  68. [68]

    Parsley or Cilantro: This green herb is used as a garnish and can add a fresh flavor to dishes

  69. [69]

    Asparagus: These green stalks are often steamed, sautéed, or grilled and served as a side

  70. [70]

    Sauce or Broth: There is a yellowish sauce or broth on the plate

  71. [71]

    Potato: It looks like a roasted or boiled potato

  72. [72]

    Garnish: These could be green onions or chives scattered on the plate to add color and flavor

  73. [73]

    The dish seems to be a well-balanced meal with a protein (fish), vegetables (asparagus and potato), and various flavor enhancers (sauces, lemon, and herbs)

    Sauce or Oil Drizzle: There’s a darker drizzle around the potato, which could be a type of sauce or perhaps a balsamic reduction. The dish seems to be a well-balanced meal with a protein (fish), vegetables (asparagus and potato), and various flavor enhancers (sauces, lemon, and herbs). Figure 11: Case study on an image of dish for GPT-4V . GPT-4V does not...

  74. [74]

    Onion rings (1 and 11) - Fried foods like these are often high in unhealthy fats and calories

  75. [75]

    Fries (8) - Another fried item that can be calorie-dense and contain unhealthy fats

  76. [76]

    Burgers (3 and 5) - Depending on their composition, they can be high in calories, unhealthy fats, and may contain processed meats

  77. [77]

    Processed meats (2 and 15) - These can be high in sodium, unhealthy fats, and other additives

  78. [78]

    accept" button. – B (7) : Often the

    Chips (16) - Typically high in unhealthy fats and can be calorie-dense. While indulging occasionally is okay, it’s best to consume these items in moderation if you’re trying to maintain a healthy diet. Figure 12: SoM-pormpted GPT-4V gives very precise suggestions while the original one fails, even with hallucinated foods, e.g., soft drinks. 20 Tool Usage ...