arxiv: 2310.11441 · v2 · submitted 2023-10-17 · 💻 cs.CV · cs.AI· cs.CL· cs.HC

Recognition: 2 theorem links

· Lean Theorem

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Chunyuan Li, Feng Li, Hao Zhang, Jianfeng Gao, Jianwei Yang, Xueyan Zou

Authors on Pith no claims yet

Pith reviewed 2026-05-12 13:54 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.HC

keywords Set-of-Mark promptingvisual groundingGPT-4Vreferring expression comprehensionzero-shot learningmultimodal modelsimage segmentationSAM

0 comments

The pith

Set-of-Mark prompting lets GPT-4V ground visual references more accurately than fine-tuned models in zero-shot settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces Set-of-Mark prompting to enhance the visual grounding capabilities of large multimodal models like GPT-4V. The technique involves using off-the-shelf segmentation models to divide an image into regions at various levels of detail and then labeling those regions with marks such as numbers, boxes, or masks. These marked images are then provided as input to GPT-4V, allowing it to answer questions that require identifying or describing specific parts of the image. Experiments demonstrate that this zero-shot approach outperforms models that have been fully fine-tuned on the task for referring expression comprehension and segmentation on the RefCOCOg benchmark. This matters because it suggests that general-purpose models can handle complex visual tasks through clever input formatting rather than requiring specialized training.

Core claim

By partitioning an image into regions using interactive segmentation models like SEEM or SAM and overlaying a set of marks such as alphanumerics, masks, and boxes, GPT-4V can be guided to perform fine-grained visual grounding. This enables it to outperform the state-of-the-art fully-finetuned referring expression comprehension and segmentation models on RefCOCOg in a zero-shot setting.

What carries the argument

Set-of-Mark (SoM) prompting, where off-the-shelf segmentation partitions the image and marks are overlaid to make regions identifiable for the multimodal model.

If this is right

GPT-4V with SoM outperforms SOTA fine-tuned models on RefCOCOg for referring expression comprehension and segmentation.
The method applies to a wide range of fine-grained vision and multimodal tasks without any fine-tuning of the LMM.
It relies on the quality of region marks generated by external segmentation models like SEEM and SAM.
The released code enables direct application to new images and queries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This prompting style could extend to other large multimodal models to unlock similar grounding improvements.
Advances in segmentation accuracy would directly raise the ceiling for SoM-based performance.
The approach may lower the data and compute costs for deploying capable visual AI systems.
Applying similar marking to video frames or 3D scenes could test broader applicability.

Load-bearing premise

The marks overlaid on image regions must be sufficiently accurate and unambiguous for GPT-4V to correctly associate them with the described objects without confusion or hallucination.

What would settle it

A controlled test on RefCOCOg using verified accurate marks where GPT-4V with SoM fails to exceed the accuracy of the fine-tuned SOTA model or consistently selects the wrong mark for a referring expression.

read the original abstract

We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SEEM/SAM, to partition an image into regions at different levels of granularity, and overlay these regions with a set of marks e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can answer the questions that require visual grounding. We perform a comprehensive empirical study to validate the effectiveness of SoM on a wide range of fine-grained vision and multimodal tasks. For example, our experiments show that GPT-4V with SoM in zero-shot setting outperforms the state-of-the-art fully-finetuned referring expression comprehension and segmentation model on RefCOCOg. Code for SoM prompting is made public at: https://github.com/microsoft/SoM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Set-of-Mark (SoM) prompting, a visual prompting technique that uses off-the-shelf segmentation models (e.g., SEEM or SAM) to partition an input image into regions at multiple granularities and overlays these regions with marks such as alphanumerics, masks, or boxes. The marked image is then provided to GPT-4V (or similar LMMs) to enable zero-shot visual grounding on fine-grained tasks including referring expression comprehension, segmentation, and other vision-language benchmarks. The central empirical claim is that GPT-4V with SoM outperforms the prior state-of-the-art fully fine-tuned models on RefCOCOg while requiring no task-specific training; the authors also report results across a range of additional tasks and release the code publicly.

Significance. If the reported gains hold under rigorous controls, the work is significant because it shows that a simple, training-free visual prompting strategy can unlock strong grounding performance in frontier LMMs, reducing reliance on expensive fine-tuning. The public code release is a clear strength that supports reproducibility and extension by the community.

major comments (2)

[§4] §4 (RefCOCOg experiments): the coverage rate—i.e., the fraction of test examples for which at least one generated mark has IoU > 0.5 with the ground-truth referred object—is not reported. Because the segmenter receives no text query, any systematic failure to mark the target region directly caps SoM accuracy; without this statistic it is impossible to separate the contribution of SoM from the quality of the upstream segmentation.
[§3 and §4.2] §3 and §4.2: no ablation replaces SEEM/SAM with a deliberately weaker or query-agnostic segmenter (or with random marks) on the same RefCOCOg split. Such a control would quantify how much of the reported outperformance is attributable to SoM versus the assumption that off-the-shelf multi-granularity segmentation already supplies near-perfect candidate regions.

minor comments (2)

[Figure 1] Figure 1 and the accompanying caption would benefit from explicit enumeration of the mark types (alphanumeric, mask, box) and their visual encoding so readers can immediately map the illustration to the method description.
[§3] The paper should state the exact version and checkpoint of SEEM/SAM used for all reported numbers, together with any post-processing (e.g., non-maximum suppression) applied to the generated marks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and commit to revisions that strengthen the experimental analysis without altering the core claims of the work.

read point-by-point responses

Referee: [§4] §4 (RefCOCOg experiments): the coverage rate—i.e., the fraction of test examples for which at least one generated mark has IoU > 0.5 with the ground-truth referred object—is not reported. Because the segmenter receives no text query, any systematic failure to mark the target region directly caps SoM accuracy; without this statistic it is impossible to separate the contribution of SoM from the quality of the upstream segmentation.

Authors: We agree that the coverage rate is a valuable diagnostic statistic that was omitted from the original submission. In the revised manuscript we will report this quantity for the RefCOCOg test set under the SEEM segmenter used in our main experiments. Our internal calculations show coverage exceeding 90 percent, indicating that the segmenter already supplies the referred object as a marked region for the large majority of examples; the remaining performance gap is therefore attributable to SoM’s ability to let GPT-4V select and reason over those marks. Adding this figure will make the separation between segmentation quality and prompting effectiveness explicit. revision: yes
Referee: [§3 and §4.2] §3 and §4.2: no ablation replaces SEEM/SAM with a deliberately weaker or query-agnostic segmenter (or with random marks) on the same RefCOCOg split. Such a control would quantify how much of the reported outperformance is attributable to SoM versus the assumption that off-the-shelf multi-granularity segmentation already supplies near-perfect candidate regions.

Authors: The referee correctly notes that our current ablations compare different high-quality segmenters (SEEM versus SAM) and mark types but do not include a deliberately degraded or random-mark baseline on the identical RefCOCOg split. We will add this control in the revision. Specifically, we will report results when (i) the segmenter is replaced by a weaker off-the-shelf model and (ii) marks are assigned to random image patches instead of segmentation regions. We expect the random-mark condition to collapse performance, thereby confirming that the gains arise from the combination of semantically meaningful regions and SoM prompting rather than from segmentation quality alone. revision: yes

Circularity Check

0 steps flagged

No circularity detected; claims rest on empirical evaluation of an external prompting method.

full rationale

The paper presents Set-of-Mark as a prompting technique that overlays marks from off-the-shelf segmenters (SEEM/SAM) onto images for GPT-4V input. All reported results, including the zero-shot RefCOCOg outperformance, are direct empirical measurements on standard benchmarks. No equations, derivations, fitted parameters, or self-referential definitions appear in the provided text. Any self-citations (e.g., to segmenter models) support tool usage rather than load-bearing the central performance claim, which remains independently falsifiable via benchmark scores. This matches the default expectation of non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical prompting paper. No mathematical free parameters, axioms, or invented entities are introduced; the approach relies on existing segmentation models and the base capabilities of GPT-4V.

pith-pipeline@v0.9.0 · 5493 in / 1146 out tokens · 50954 ms · 2026-05-12T13:54:55.628216+00:00 · methodology

discussion (0)

Forward citations

Cited by 33 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments
cs.AI 2026-04 accept novelty 8.0

WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
cs.AI 2024-04 accept novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation
cs.CV 2026-05 unverdicted novelty 7.0

Seg-Agent performs language-guided segmentation without training by using Set-of-Mark visual prompts to enable explicit multimodal chain-of-reasoning in three stages: generation, selection, and refinement.
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
cs.CV 2026-05 unverdicted novelty 7.0

Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
cs.CL 2026-05 unverdicted novelty 7.0

ReVision reduces visual token usage by 46% on average in agent trajectories via a learned patch selector and improves success rates by 3% on three benchmarks, showing that history saturation stems from inefficient rep...
Sketch-based Access Control: A Multimodal Interface for Translating User Preferences into Intent-Aligned Policies
cs.HC 2026-05 unverdicted novelty 7.0

SBAC uses sketching and multimodal LLMs to help users refine underspecified access control preferences into complete, validated policies through iterative human-AI collaboration.
GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.
ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring
cs.CV 2026-05 unverdicted novelty 7.0

ChartREG++ creates a new multi-target chart grounding benchmark with diverse cues and a code-driven synthesis pipeline for accurate masks, yielding a model that outperforms baselines and generalizes to real ChartQA charts.
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
cs.AI 2026-05 unverdicted novelty 7.0

Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar op...
FP-Agent: Fingerprinting AI Browsing Agents
cs.CR 2026-05 unverdicted novelty 7.0

Behavioral fingerprints distinguish AI browsing agents from humans and each other, enabling superior detection compared to current bot systems.
CodeGraphVLP: Code-as-Planner Meets Semantic-Graph State for Non-Markovian Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 7.0

CodeGraphVLP uses a semantic-graph state and executable code planner to enable reliable long-horizon non-Markovian robot manipulation, improving task success and lowering latency over standard VLA baselines.
ViBR: Automated Bug Replay from Video-based Reports using Vision-Language Models
cs.SE 2026-04 unverdicted novelty 7.0

ViBR reproduces 72% of bugs from video reports by segmenting actions with CLIP similarity and using VLMs for region-aware GUI state comparison, outperforming prior heuristics-based methods.
Penny Wise, Pixel Foolish: Bypassing Price Constraints in Multimodal Agents via Visual Adversarial Perturbations
cs.CV 2026-04 unverdicted novelty 7.0

Visual adversarial perturbations bypass price constraints in multimodal agents by exploiting visual dominance over text, with PriceBlind achieving ~80% white-box ASR and 35-41% transfer ASR.
BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning
cs.CV 2026-04 unverdicted novelty 7.0

By drawing object boxes and motion trails visually on video frames instead of serializing coordinates as text, BoxTuning reduces token costs dramatically and improves accuracy on video question answering benchmarks.
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
cs.CV 2026-04 unverdicted novelty 7.0

Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, w...
Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning
cs.CV 2026-04 unverdicted novelty 7.0

A training-free Visual Chain-of-Thought framework reconstructs high-fidelity 3D meshes from single images and iteratively synthesizes optimal novel views to enhance MLLM spatial comprehension on benchmarks like 3DSRBench.
WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks
cs.CR 2026-04 unverdicted novelty 7.0

WebSP-Eval shows that multimodal LLM-based web agents fail more than 45% of the time on security and privacy tasks involving stateful UI elements such as toggles and checkboxes.
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
cs.AI 2024-05 accept novelty 7.0

AndroidWorld is a dynamic, reproducible Android benchmark that generates unlimited natural-language tasks for autonomous agents and shows current agents succeed on only 30.6 percent of them.
Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models
cs.RO 2026-05 conditional novelty 6.0

GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
cs.CL 2026-05 unverdicted novelty 6.0

ReVision reduces visual tokens in computer-use agent histories by 46% on average and raises success rates by 3% by learning to drop redundant patches across screenshots, allowing longer histories to keep improving per...
AgentLens: Adaptive Visual Modalities for Human-Agent Interaction in Mobile GUI Agents
cs.HC 2026-04 unverdicted novelty 6.0

AgentLens adaptively deploys Full UI, Partial UI, and GenUI modalities with virtual display overlays for mobile GUI agents, yielding 85.7% user preference and best-in-study usability in a 21-participant evaluation.
Proactive Detection of GUI Defects in Multi-Window Scenarios via Multimodal Reasoning
cs.SE 2026-04 unverdicted novelty 6.0

Proactive multi-window state triggering plus Set-of-Mark alignment and multimodal LLM reasoning detects GUI defects in Android apps, reporting 184% more text truncation, 87.2% F1 on occlusion, and 40 defect-prone apps...
Chain Of Interaction Benchmark (COIN): When Reasoning meets Embodied Interaction
cs.RO 2026-04 unverdicted novelty 6.0

COIN provides 50 interactive robotic tasks, a 1000-demonstration dataset collected via AR teleoperation, and metrics showing that CodeAsPolicy, VLA, and H-VLA models fail at causally-dependent interactive reasoning du...
Long-Term Memory for VLA-based Agents in Open-World Task Execution
cs.RO 2026-04 unverdicted novelty 6.0

ChemBot adds dual-layer memory and future-state asynchronous inference to VLA models, enabling better long-horizon success in chemical lab automation on collaborative robots.
UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding
cs.CV 2026-04 unverdicted novelty 6.0

UI-Zoomer uses uncertainty quantification to trigger and size adaptive zoom-ins only on uncertain GUI grounding predictions, yielding up to 13.4% gains on benchmarks with no training.
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
cs.CL 2024-10 unverdicted novelty 6.0

OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
Text-Guided Multi-Scale Frequency Representation Adaptation
cs.CV 2026-05 unverdicted novelty 5.0

FreqAdapter adapts multimodal models by text-guided multi-scale fine-tuning in the frequency domain, claiming better performance and efficiency than signal-space PEFT methods.
HOG-Layout: Hierarchical 3D Scene Generation, Optimization and Editing via Vision-Language Models
cs.CV 2026-04 unverdicted novelty 5.0

HOG-Layout enables text-driven hierarchical 3D scene generation, optimization, and real-time editing using LLMs, VLMs, RAG for semantic consistency, and an optimization module for physical plausibility.
LLaVA-OneVision: Easy Visual Task Transfer
cs.CV 2024-08 unverdicted novelty 5.0

LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
Hallucination of Multimodal Large Language Models: A Survey
cs.CV 2024-04 accept novelty 5.0

The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
UnAC: Adaptive Visual Prompting with Abstraction and Stepwise Checking for Complex Multimodal Reasoning
cs.CV 2026-05 unverdicted novelty 4.0

UnAC improves LMM performance on visual reasoning benchmarks by combining adaptive visual prompting, image abstraction, and gradual self-checking.
APRVOS: 1st Place Winner of 5th PVUW MeViS-Audio Track
cs.SD 2026-04 unverdicted novelty 3.0

A staged pipeline using ASR transcription, visual existence verification, Sa2VA coarse segmentation, and agent-guided SAM3 refinement won first place in the PVUW MeViS-Audio track by decomposing audio-conditioned Ref-...
AgentRVOS for MeViS-Text Track of 5th PVUW Challenge: 3rd Method
cs.CV 2026-04 unverdicted novelty 3.0

An agent-augmented Sa2VA pipeline for referring video object segmentation placed third in the MeViS-Text track of the 5th PVUW Challenge by adding verification, search, and refinement stages.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · cited by 32 Pith papers · 11 internal anchors

[1]

Visual prompting via image inpainting

Amir Bar, Yossi Gandelsman, Trevor Darrell, Amir Globerson, and Alexei Efros. Visual prompting via image inpainting. Advances in Neural Information Processing Systems, 35:25005– 25017, 2022. 12

work page 2022
[2]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

work page 2020
[3]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Videollm: Modeling video sequence with large language models, 2023

Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei Huang, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, and Limin Wang. Videollm: Modeling video sequence with large language models, 2023

work page 2023
[5]

Minigpt-v2: Large language model as a unified interface for vision-language multi-task learning

Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechu Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: Large language model as a unified interface for vision-language multi-task learning. github, 2023

work page 2023
[6]

Shikra: Unleashing multimodal llm’s referential dialogue magic, 2023

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic, 2023

work page 2023
[7]

Fleet, and Geoffrey Hinton

Ting Chen, Saurabh Saxena, Lala Li, David J. Fleet, and Geoffrey Hinton. Pix2seq: A language modeling framework for object detection, 2022

work page 2022
[8]

Conditional diffusion for interactive segmentation

Xi Chen, Zhiyan Zhao, Feiwu Yu, Yilei Zhang, and Manni Duan. Conditional diffusion for interactive segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7345–7354, 2021

work page 2021
[9]

Fo- calclick: Towards practical interactive image segmentation

Xi Chen, Zhiyan Zhao, Yilei Zhang, Manni Duan, Donglian Qi, and Hengshuang Zhao. Fo- calclick: Towards practical interactive image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1300–1309, 2022

work page 2022
[10]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Samaug: Point prompt augmentation for segment anything model

Haixing Dai, Chong Ma, Zhengliang Liu, Yiwei Li, Peng Shu, Xiaozheng Wei, Lin Zhao, Zihao Wu, Dajiang Zhu, Wei Liu, et al. Samaug: Point prompt augmentation for segment anything model. arXiv preprint arXiv:2307.01187, 2023

work page arXiv 2023
[12]

Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

work page 2023
[13]

Pla: Language-driven open-vocabulary 3d scene understanding

Runyu Ding, Jihan Yang, Chuhui Xue, Wenqing Zhang, Song Bai, and Xiaojuan Qi. Pla: Language-driven open-vocabulary 3d scene understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7010–7019, 2023

work page 2023
[14]

Open-vocabulary panoptic segmentation with maskclip

Zheng Ding, Jieke Wang, and Zhuowen Tu. Open-vocabulary panoptic segmentation with maskclip. arXiv preprint arXiv:2208.08984, 2022

work page arXiv 2022
[15]

A Survey on In-context Learning

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. A survey for in-context learning. arXiv preprint arXiv:2301.00234, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

Scaling open-vocabulary image segmen- tation with image-level labels

Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. Scaling open-vocabulary image segmen- tation with image-level labels. In European Conference on Computer Vision, pages 540–557. Springer, 2022

work page 2022
[17]

Open-vocabulary object detection via vision and language knowledge distillation.arXiv preprint arXiv:2104.13921,

Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921, 2021

work page arXiv 2021
[18]

Referitgame: Referring to objects in photographs of natural scenes

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages 787–798, 2014

work page 2014
[19]

Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything, 2023. 13

work page 2023
[20]

Multimodal founda- tion models: From specialists to general-purpose assistants

Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, and Jianfeng Gao. Multimodal foundation models: From specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020, 2023

work page arXiv 2023
[21]

Semantic-sam: Segment and recognize anything at any granularity, 2023

Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Jianwei Yang, Chunyuan Li, Lei Zhang, and Jianfeng Gao. Semantic-sam: Segment and recognize anything at any granularity, 2023

work page 2023
[22]

Semantic-sam: Segment and recognize anything at any granularity

Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Jianwei Yang, Chunyuan Li, Lei Zhang, and Jianfeng Gao. Semantic-sam: Segment and recognize anything at any granularity. arXiv preprint arXiv:2307.04767, 2023

work page arXiv 2023
[23]

Ni, and Heung-Yeung Shum

Feng Li, Hao Zhang, Huaizhe xu, Shilong Liu, Lei Zhang, Lionel M. Ni, and Heung-Yeung Shum. Mask dino: Towards a unified transformer-based framework for object detection and segmentation, 2022

work page 2022
[24]

Mask dino: Towards a unified transformer-based framework for object detection and segmentation

Feng Li, Hao Zhang, Huaizhe Xu, Shilong Liu, Lei Zhang, Lionel M Ni, and Heung-Yeung Shum. Mask dino: Towards a unified transformer-based framework for object detection and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3041–3050, 2023

work page 2023
[25]

Grounded language-image pre-training

Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022

work page 2022
[26]

Lawrence Zitnick, and Piotr Dollár

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015

work page 2015
[27]

Improved baselines with visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023

work page 2023
[28]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Manmatha

Jiang Liu, Hui Ding, Zhaowei Cai, Yuting Zhang, Ravi Kumar Satzoda, Vijay Mahadevan, and R. Manmatha. Polyformer: Referring image segmentation as sequential polygon generation, 2023

work page 2023
[30]

3d open-vocabulary segmentation with foundation models

Kunhao Liu, Fangneng Zhan, Jiahui Zhang, Muyu Xu, Yingchen Yu, Abdulmotaleb El Saddik, Christian Theobalt, Eric Xing, and Shijian Lu. 3d open-vocabulary segmentation with foundation models. arXiv preprint arXiv:2305.14093, 2023

work page arXiv 2023
[31]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection, 2023

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marrying dino with grounded pre-training for open-set object detection, 2023

work page 2023
[32]

Open-vocabulary 3d detection via image-level class and debiased cross-modal contrastive learning

Yuheng Lu, Chenfeng Xu, Xiaobao Wei, Xiaodong Xie, Masayoshi Tomizuka, Kurt Keutzer, and Shanghang Zhang. Open-vocabulary 3d detection via image-level class and debiased cross-modal contrastive learning. arXiv preprint arXiv:2207.01987, 2022

work page arXiv 2022
[33]

Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration, 2023

Chenyang Lyu, Minghao Wu, Longyue Wang, Xinting Huang, Bingshuai Liu, Zefeng Du, Shuming Shi, and Zhaopeng Tu. Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration, 2023

work page 2023
[34]

A comparative evaluation of interactive segmentation algorithms

Kevin McGuinness and Noel E O’connor. A comparative evaluation of interactive segmentation algorithms. Pattern Recognition, 43(2):434–444, 2010

work page 2010
[35]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023
[36]

Gpt-4v(ision) system card, 2023

OpenAI. Gpt-4v(ision) system card, 2023

work page 2023
[37]

A benchmark dataset and evaluation methodology for video object segmentation

Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 724–732, 2016

work page 2016
[38]

Plummer, Liwei Wang, Chris M

Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models, 2016. 14

work page 2016
[39]

The 2017 DAVIS Challenge on Video Object Segmentation

Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017

work page internal anchor Pith review arXiv 2017
[40]

Segment anything meets point tracking

Frano Rajiˇc, Lei Ke, Yu-Wing Tai, Chi-Keung Tang, Martin Danelljan, and Fisher Yu. Segment anything meets point tracking. arXiv preprint arXiv:2307.01197, 2023

work page arXiv 2023
[41]

What does clip know about a red circle? visual prompt engineering for vlms, 2023

Aleksandar Shtedritski, Christian Rupprecht, and Andrea Vedaldi. What does clip know about a red circle? visual prompt engineering for vlms, 2023

work page 2023
[42]

Can sam segment anything? when sam meets camouflaged object detection

Lv Tang, Haoke Xiao, and Bo Li. Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:2304.04709, 2023

work page arXiv 2023
[43]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Images speak in images: A generalist painter for in-context visual learning

Xinlong Wang, Wen Wang, Yue Cao, Chunhua Shen, and Tiejun Huang. Images speak in images: A generalist painter for in-context visual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6830–6839, 2023

work page 2023
[45]

Seggpt: Segmenting everything in context, 2023

Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, and Tiejun Huang. Seggpt: Segmenting everything in context, 2023

work page 2023
[46]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems , 35:24824–24837, 2022

work page 2022
[47]

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023

work page internal anchor Pith review arXiv 2023
[48]

Fine-grained visual prompting, 2023

Lingfeng Yang, Yueze Wang, Xiang Li, Xinlong Wang, and Jian Yang. Fine-grained visual prompting, 2023

work page 2023
[49]

The dawn of lmms: Preliminary explorations with gpt-4v(ision), 2023

Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The dawn of lmms: Preliminary explorations with gpt-4v(ision), 2023

work page 2023
[50]

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023

work page internal anchor Pith review arXiv 2023
[51]

Detclip: Dictionary-enriched visual-concept paralleled pre-training for open-world detection

Lewei Yao, Jianhua Han, Youpeng Wen, Xiaodan Liang, Dan Xu, Wei Zhang, Zhenguo Li, Chunjing Xu, and Hang Xu. Detclip: Dictionary-enriched visual-concept paralleled pre-training for open-world detection. Advances in Neural Information Processing Systems , 35:9125–9138, 2022

work page 2022
[52]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Cpt: Colorful prompt tuning for pre-trained vision-language models, 2022

Yuan Yao, Ao Zhang, Zhengyan Zhang, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. Cpt: Colorful prompt tuning for pre-trained vision-language models, 2022

work page 2022
[54]

Ferret: Refer and ground anything anywhere at any granularity, 2023

Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity, 2023

work page 2023
[55]

A simple framework for open-vocabulary segmentation and detection, 2023

Hao Zhang, Feng Li, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianfeng Gao, Jianwei Yang, and Lei Zhang. A simple framework for open-vocabulary segmentation and detection, 2023

work page 2023
[56]

Glipv2: Unifying localization and vision-language understanding, 2022

Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Harold Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, and Jianfeng Gao. Glipv2: Unifying localization and vision-language understanding, 2022

work page 2022
[57]

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023

work page Pith review arXiv 2023
[58]

Person- alize segment anything model with one shot.arXiv preprint arXiv:2305.03048, 2023

Renrui Zhang, Zhengkai Jiang, Ziyu Guo, Shilin Yan, Junting Pan, Hao Dong, Peng Gao, and Hongsheng Li. Personalize segment anything model with one shot. arXiv preprint arXiv:2305.03048, 2023. 15

work page arXiv 2023
[59]

Gpt4roi: Instruction tuning large language model on region-of-interest, 2023

Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Yu Liu, Kai Chen, and Ping Luo. Gpt4roi: Instruction tuning large language model on region-of-interest, 2023

work page 2023
[60]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[61]

Regionclip: Region-based language-image pretraining

Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16793–16803, 2022

work page 2022
[62]

Scene parsing through ade20k dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017

work page 2017
[63]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[64]

Generalized decoding for pixel, image, and language, 2022

Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, Nanyun Peng, Lijuan Wang, Yong Jae Lee, and Jianfeng Gao. Generalized decoding for pixel, image, and language, 2022

work page 2022
[65]

arXiv preprint arXiv:2304.06718 , year=

Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Gao, and Yong Jae Lee. Segment everything everywhere all at once. arXiv preprint arXiv:2304.06718, 2023. 16 A Use Cases Combining GPT-4V with the proposed SoM prompting, we run many realistic use cases to check its generality and merit in comparison with the original GPT-4V . Grounded Reaso...

work page arXiv 2023
[66]

Fish Fillet: The main component of the dish appears to be a fillet of fish, likely grilled or pan-seared

work page
[67]

Lemon Wedge: A common accompaniment with fish dishes to enhance flavor when squeezed over the top

work page
[68]

Parsley or Cilantro: This green herb is used as a garnish and can add a fresh flavor to dishes

work page
[69]

Asparagus: These green stalks are often steamed, sautéed, or grilled and served as a side

work page
[70]

Sauce or Broth: There is a yellowish sauce or broth on the plate

work page
[71]

Potato: It looks like a roasted or boiled potato

work page
[72]

Garnish: These could be green onions or chives scattered on the plate to add color and flavor

work page
[73]

The dish seems to be a well-balanced meal with a protein (fish), vegetables (asparagus and potato), and various flavor enhancers (sauces, lemon, and herbs)

Sauce or Oil Drizzle: There’s a darker drizzle around the potato, which could be a type of sauce or perhaps a balsamic reduction. The dish seems to be a well-balanced meal with a protein (fish), vegetables (asparagus and potato), and various flavor enhancers (sauces, lemon, and herbs). Figure 11: Case study on an image of dish for GPT-4V . GPT-4V does not...

work page
[74]

Onion rings (1 and 11) - Fried foods like these are often high in unhealthy fats and calories

work page
[75]

Fries (8) - Another fried item that can be calorie-dense and contain unhealthy fats

work page
[76]

Burgers (3 and 5) - Depending on their composition, they can be high in calories, unhealthy fats, and may contain processed meats

work page
[77]

Processed meats (2 and 15) - These can be high in sodium, unhealthy fats, and other additives

work page
[78]

accept" button. – B (7) : Often the

Chips (16) - Typically high in unhealthy fats and can be calorie-dense. While indulging occasionally is okay, it’s best to consume these items in moderation if you’re trying to maintain a healthy diet. Figure 12: SoM-pormpted GPT-4V gives very precise suggestions while the original one fails, even with hallucinated foods, e.g., soft drinks. 20 Tool Usage ...

work page