Recognition: 2 theorem links
· Lean TheoremSet-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Pith reviewed 2026-05-12 13:54 UTC · model grok-4.3
The pith
Set-of-Mark prompting lets GPT-4V ground visual references more accurately than fine-tuned models in zero-shot settings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By partitioning an image into regions using interactive segmentation models like SEEM or SAM and overlaying a set of marks such as alphanumerics, masks, and boxes, GPT-4V can be guided to perform fine-grained visual grounding. This enables it to outperform the state-of-the-art fully-finetuned referring expression comprehension and segmentation models on RefCOCOg in a zero-shot setting.
What carries the argument
Set-of-Mark (SoM) prompting, where off-the-shelf segmentation partitions the image and marks are overlaid to make regions identifiable for the multimodal model.
If this is right
- GPT-4V with SoM outperforms SOTA fine-tuned models on RefCOCOg for referring expression comprehension and segmentation.
- The method applies to a wide range of fine-grained vision and multimodal tasks without any fine-tuning of the LMM.
- It relies on the quality of region marks generated by external segmentation models like SEEM and SAM.
- The released code enables direct application to new images and queries.
Where Pith is reading between the lines
- This prompting style could extend to other large multimodal models to unlock similar grounding improvements.
- Advances in segmentation accuracy would directly raise the ceiling for SoM-based performance.
- The approach may lower the data and compute costs for deploying capable visual AI systems.
- Applying similar marking to video frames or 3D scenes could test broader applicability.
Load-bearing premise
The marks overlaid on image regions must be sufficiently accurate and unambiguous for GPT-4V to correctly associate them with the described objects without confusion or hallucination.
What would settle it
A controlled test on RefCOCOg using verified accurate marks where GPT-4V with SoM fails to exceed the accuracy of the fine-tuned SOTA model or consistently selects the wrong mark for a referring expression.
read the original abstract
We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SEEM/SAM, to partition an image into regions at different levels of granularity, and overlay these regions with a set of marks e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can answer the questions that require visual grounding. We perform a comprehensive empirical study to validate the effectiveness of SoM on a wide range of fine-grained vision and multimodal tasks. For example, our experiments show that GPT-4V with SoM in zero-shot setting outperforms the state-of-the-art fully-finetuned referring expression comprehension and segmentation model on RefCOCOg. Code for SoM prompting is made public at: https://github.com/microsoft/SoM.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Set-of-Mark (SoM) prompting, a visual prompting technique that uses off-the-shelf segmentation models (e.g., SEEM or SAM) to partition an input image into regions at multiple granularities and overlays these regions with marks such as alphanumerics, masks, or boxes. The marked image is then provided to GPT-4V (or similar LMMs) to enable zero-shot visual grounding on fine-grained tasks including referring expression comprehension, segmentation, and other vision-language benchmarks. The central empirical claim is that GPT-4V with SoM outperforms the prior state-of-the-art fully fine-tuned models on RefCOCOg while requiring no task-specific training; the authors also report results across a range of additional tasks and release the code publicly.
Significance. If the reported gains hold under rigorous controls, the work is significant because it shows that a simple, training-free visual prompting strategy can unlock strong grounding performance in frontier LMMs, reducing reliance on expensive fine-tuning. The public code release is a clear strength that supports reproducibility and extension by the community.
major comments (2)
- [§4] §4 (RefCOCOg experiments): the coverage rate—i.e., the fraction of test examples for which at least one generated mark has IoU > 0.5 with the ground-truth referred object—is not reported. Because the segmenter receives no text query, any systematic failure to mark the target region directly caps SoM accuracy; without this statistic it is impossible to separate the contribution of SoM from the quality of the upstream segmentation.
- [§3 and §4.2] §3 and §4.2: no ablation replaces SEEM/SAM with a deliberately weaker or query-agnostic segmenter (or with random marks) on the same RefCOCOg split. Such a control would quantify how much of the reported outperformance is attributable to SoM versus the assumption that off-the-shelf multi-granularity segmentation already supplies near-perfect candidate regions.
minor comments (2)
- [Figure 1] Figure 1 and the accompanying caption would benefit from explicit enumeration of the mark types (alphanumeric, mask, box) and their visual encoding so readers can immediately map the illustration to the method description.
- [§3] The paper should state the exact version and checkpoint of SEEM/SAM used for all reported numbers, together with any post-processing (e.g., non-maximum suppression) applied to the generated marks.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major point below and commit to revisions that strengthen the experimental analysis without altering the core claims of the work.
read point-by-point responses
-
Referee: [§4] §4 (RefCOCOg experiments): the coverage rate—i.e., the fraction of test examples for which at least one generated mark has IoU > 0.5 with the ground-truth referred object—is not reported. Because the segmenter receives no text query, any systematic failure to mark the target region directly caps SoM accuracy; without this statistic it is impossible to separate the contribution of SoM from the quality of the upstream segmentation.
Authors: We agree that the coverage rate is a valuable diagnostic statistic that was omitted from the original submission. In the revised manuscript we will report this quantity for the RefCOCOg test set under the SEEM segmenter used in our main experiments. Our internal calculations show coverage exceeding 90 percent, indicating that the segmenter already supplies the referred object as a marked region for the large majority of examples; the remaining performance gap is therefore attributable to SoM’s ability to let GPT-4V select and reason over those marks. Adding this figure will make the separation between segmentation quality and prompting effectiveness explicit. revision: yes
-
Referee: [§3 and §4.2] §3 and §4.2: no ablation replaces SEEM/SAM with a deliberately weaker or query-agnostic segmenter (or with random marks) on the same RefCOCOg split. Such a control would quantify how much of the reported outperformance is attributable to SoM versus the assumption that off-the-shelf multi-granularity segmentation already supplies near-perfect candidate regions.
Authors: The referee correctly notes that our current ablations compare different high-quality segmenters (SEEM versus SAM) and mark types but do not include a deliberately degraded or random-mark baseline on the identical RefCOCOg split. We will add this control in the revision. Specifically, we will report results when (i) the segmenter is replaced by a weaker off-the-shelf model and (ii) marks are assigned to random image patches instead of segmentation regions. We expect the random-mark condition to collapse performance, thereby confirming that the gains arise from the combination of semantically meaningful regions and SoM prompting rather than from segmentation quality alone. revision: yes
Circularity Check
No circularity detected; claims rest on empirical evaluation of an external prompting method.
full rationale
The paper presents Set-of-Mark as a prompting technique that overlays marks from off-the-shelf segmenters (SEEM/SAM) onto images for GPT-4V input. All reported results, including the zero-shot RefCOCOg outperformance, are direct empirical measurements on standard benchmarks. No equations, derivations, fitted parameters, or self-referential definitions appear in the provided text. Any self-citations (e.g., to segmenter models) support tool usage rather than load-bearing the central performance claim, which remains independently falsifiable via benchmark scores. This matches the default expectation of non-circular empirical work.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 33 Pith papers
-
WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments
WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.
-
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
-
Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation
Seg-Agent performs language-guided segmentation without training by using Set-of-Mark visual prompts to enable explicit multimodal chain-of-reasoning in three stages: generation, selection, and refinement.
-
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
-
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
ReVision reduces visual token usage by 46% on average in agent trajectories via a learned patch selector and improves success rates by 3% on three benchmarks, showing that history saturation stems from inefficient rep...
-
Sketch-based Access Control: A Multimodal Interface for Translating User Preferences into Intent-Aligned Policies
SBAC uses sketching and multimodal LLMs to help users refine underspecified access control preferences into complete, validated policies through iterative human-AI collaboration.
-
GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning
GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.
-
ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring
ChartREG++ creates a new multi-target chart grounding benchmark with diverse cues and a code-driven synthesis pipeline for accurate masks, yielding a model that outperforms baselines and generalizes to real ChartQA charts.
-
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar op...
-
FP-Agent: Fingerprinting AI Browsing Agents
Behavioral fingerprints distinguish AI browsing agents from humans and each other, enabling superior detection compared to current bot systems.
-
CodeGraphVLP: Code-as-Planner Meets Semantic-Graph State for Non-Markovian Vision-Language-Action Models
CodeGraphVLP uses a semantic-graph state and executable code planner to enable reliable long-horizon non-Markovian robot manipulation, improving task success and lowering latency over standard VLA baselines.
-
ViBR: Automated Bug Replay from Video-based Reports using Vision-Language Models
ViBR reproduces 72% of bugs from video reports by segmenting actions with CLIP similarity and using VLMs for region-aware GUI state comparison, outperforming prior heuristics-based methods.
-
Penny Wise, Pixel Foolish: Bypassing Price Constraints in Multimodal Agents via Visual Adversarial Perturbations
Visual adversarial perturbations bypass price constraints in multimodal agents by exploiting visual dominance over text, with PriceBlind achieving ~80% white-box ASR and 35-41% transfer ASR.
-
BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning
By drawing object boxes and motion trails visually on video frames instead of serializing coordinates as text, BoxTuning reduces token costs dramatically and improves accuracy on video question answering benchmarks.
-
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, w...
-
Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning
A training-free Visual Chain-of-Thought framework reconstructs high-fidelity 3D meshes from single images and iteratively synthesizes optimal novel views to enhance MLLM spatial comprehension on benchmarks like 3DSRBench.
-
WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks
WebSP-Eval shows that multimodal LLM-based web agents fail more than 45% of the time on security and privacy tasks involving stateful UI elements such as toggles and checkboxes.
-
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
AndroidWorld is a dynamic, reproducible Android benchmark that generates unlimited natural-language tasks for autonomous agents and shows current agents succeed on only 30.6 percent of them.
-
Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models
GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.
-
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
ReVision reduces visual tokens in computer-use agent histories by 46% on average and raises success rates by 3% by learning to drop redundant patches across screenshots, allowing longer histories to keep improving per...
-
AgentLens: Adaptive Visual Modalities for Human-Agent Interaction in Mobile GUI Agents
AgentLens adaptively deploys Full UI, Partial UI, and GenUI modalities with virtual display overlays for mobile GUI agents, yielding 85.7% user preference and best-in-study usability in a 21-participant evaluation.
-
Proactive Detection of GUI Defects in Multi-Window Scenarios via Multimodal Reasoning
Proactive multi-window state triggering plus Set-of-Mark alignment and multimodal LLM reasoning detects GUI defects in Android apps, reporting 184% more text truncation, 87.2% F1 on occlusion, and 40 defect-prone apps...
-
Chain Of Interaction Benchmark (COIN): When Reasoning meets Embodied Interaction
COIN provides 50 interactive robotic tasks, a 1000-demonstration dataset collected via AR teleoperation, and metrics showing that CodeAsPolicy, VLA, and H-VLA models fail at causally-dependent interactive reasoning du...
-
Long-Term Memory for VLA-based Agents in Open-World Task Execution
ChemBot adds dual-layer memory and future-state asynchronous inference to VLA models, enabling better long-horizon success in chemical lab automation on collaborative robots.
-
UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding
UI-Zoomer uses uncertainty quantification to trigger and size adaptive zoom-ins only on uncertain GUI grounding predictions, yielding up to 13.4% gains on benchmarks with no training.
-
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
-
Text-Guided Multi-Scale Frequency Representation Adaptation
FreqAdapter adapts multimodal models by text-guided multi-scale fine-tuning in the frequency domain, claiming better performance and efficiency than signal-space PEFT methods.
-
HOG-Layout: Hierarchical 3D Scene Generation, Optimization and Editing via Vision-Language Models
HOG-Layout enables text-driven hierarchical 3D scene generation, optimization, and real-time editing using LLMs, VLMs, RAG for semantic consistency, and an optimization module for physical plausibility.
-
LLaVA-OneVision: Easy Visual Task Transfer
LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
-
Hallucination of Multimodal Large Language Models: A Survey
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
-
UnAC: Adaptive Visual Prompting with Abstraction and Stepwise Checking for Complex Multimodal Reasoning
UnAC improves LMM performance on visual reasoning benchmarks by combining adaptive visual prompting, image abstraction, and gradual self-checking.
-
APRVOS: 1st Place Winner of 5th PVUW MeViS-Audio Track
A staged pipeline using ASR transcription, visual existence verification, Sa2VA coarse segmentation, and agent-guided SAM3 refinement won first place in the PVUW MeViS-Audio track by decomposing audio-conditioned Ref-...
-
AgentRVOS for MeViS-Text Track of 5th PVUW Challenge: 3rd Method
An agent-augmented Sa2VA pipeline for referring video object segmentation placed third in the MeViS-Text track of the 5th PVUW Challenge by adding verification, search, and refinement stages.
Reference graph
Works this paper leans on
-
[1]
Visual prompting via image inpainting
Amir Bar, Yossi Gandelsman, Trevor Darrell, Amir Globerson, and Alexei Efros. Visual prompting via image inpainting. Advances in Neural Information Processing Systems, 35:25005– 25017, 2022. 12
work page 2022
-
[2]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...
work page 2020
-
[3]
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Videollm: Modeling video sequence with large language models, 2023
Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei Huang, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, and Limin Wang. Videollm: Modeling video sequence with large language models, 2023
work page 2023
-
[5]
Minigpt-v2: Large language model as a unified interface for vision-language multi-task learning
Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechu Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: Large language model as a unified interface for vision-language multi-task learning. github, 2023
work page 2023
-
[6]
Shikra: Unleashing multimodal llm’s referential dialogue magic, 2023
Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic, 2023
work page 2023
-
[7]
Ting Chen, Saurabh Saxena, Lala Li, David J. Fleet, and Geoffrey Hinton. Pix2seq: A language modeling framework for object detection, 2022
work page 2022
-
[8]
Conditional diffusion for interactive segmentation
Xi Chen, Zhiyan Zhao, Feiwu Yu, Yilei Zhang, and Manni Duan. Conditional diffusion for interactive segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7345–7354, 2021
work page 2021
-
[9]
Fo- calclick: Towards practical interactive image segmentation
Xi Chen, Zhiyan Zhao, Yilei Zhang, Manni Duan, Donglian Qi, and Hengshuang Zhao. Fo- calclick: Towards practical interactive image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1300–1309, 2022
work page 2022
-
[10]
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[11]
Samaug: Point prompt augmentation for segment anything model
Haixing Dai, Chong Ma, Zhengliang Liu, Yiwei Li, Peng Shu, Xiaozheng Wei, Lin Zhao, Zihao Wu, Dajiang Zhu, Wei Liu, et al. Samaug: Point prompt augmentation for segment anything model. arXiv preprint arXiv:2307.01187, 2023
-
[12]
Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023
work page 2023
-
[13]
Pla: Language-driven open-vocabulary 3d scene understanding
Runyu Ding, Jihan Yang, Chuhui Xue, Wenqing Zhang, Song Bai, and Xiaojuan Qi. Pla: Language-driven open-vocabulary 3d scene understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7010–7019, 2023
work page 2023
-
[14]
Open-vocabulary panoptic segmentation with maskclip
Zheng Ding, Jieke Wang, and Zhuowen Tu. Open-vocabulary panoptic segmentation with maskclip. arXiv preprint arXiv:2208.08984, 2022
-
[15]
A Survey on In-context Learning
Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. A survey for in-context learning. arXiv preprint arXiv:2301.00234, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[16]
Scaling open-vocabulary image segmen- tation with image-level labels
Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. Scaling open-vocabulary image segmen- tation with image-level labels. In European Conference on Computer Vision, pages 540–557. Springer, 2022
work page 2022
-
[17]
Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921, 2021
-
[18]
Referitgame: Referring to objects in photographs of natural scenes
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages 787–798, 2014
work page 2014
-
[19]
Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything, 2023. 13
work page 2023
-
[20]
Multimodal founda- tion models: From specialists to general-purpose assistants
Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, and Jianfeng Gao. Multimodal foundation models: From specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020, 2023
-
[21]
Semantic-sam: Segment and recognize anything at any granularity, 2023
Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Jianwei Yang, Chunyuan Li, Lei Zhang, and Jianfeng Gao. Semantic-sam: Segment and recognize anything at any granularity, 2023
work page 2023
-
[22]
Semantic-sam: Segment and recognize anything at any granularity
Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Jianwei Yang, Chunyuan Li, Lei Zhang, and Jianfeng Gao. Semantic-sam: Segment and recognize anything at any granularity. arXiv preprint arXiv:2307.04767, 2023
-
[23]
Feng Li, Hao Zhang, Huaizhe xu, Shilong Liu, Lei Zhang, Lionel M. Ni, and Heung-Yeung Shum. Mask dino: Towards a unified transformer-based framework for object detection and segmentation, 2022
work page 2022
-
[24]
Mask dino: Towards a unified transformer-based framework for object detection and segmentation
Feng Li, Hao Zhang, Huaizhe Xu, Shilong Liu, Lei Zhang, Lionel M Ni, and Heung-Yeung Shum. Mask dino: Towards a unified transformer-based framework for object detection and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3041–3050, 2023
work page 2023
-
[25]
Grounded language-image pre-training
Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022
work page 2022
-
[26]
Lawrence Zitnick, and Piotr Dollár
Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015
work page 2015
-
[27]
Improved baselines with visual instruction tuning, 2023
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023
work page 2023
-
[28]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [29]
-
[30]
3d open-vocabulary segmentation with foundation models
Kunhao Liu, Fangneng Zhan, Jiahui Zhang, Muyu Xu, Yingchen Yu, Abdulmotaleb El Saddik, Christian Theobalt, Eric Xing, and Shijian Lu. 3d open-vocabulary segmentation with foundation models. arXiv preprint arXiv:2305.14093, 2023
-
[31]
Grounding dino: Marrying dino with grounded pre-training for open-set object detection, 2023
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marrying dino with grounded pre-training for open-set object detection, 2023
work page 2023
-
[32]
Open-vocabulary 3d detection via image-level class and debiased cross-modal contrastive learning
Yuheng Lu, Chenfeng Xu, Xiaobao Wei, Xiaodong Xie, Masayoshi Tomizuka, Kurt Keutzer, and Shanghang Zhang. Open-vocabulary 3d detection via image-level class and debiased cross-modal contrastive learning. arXiv preprint arXiv:2207.01987, 2022
-
[33]
Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration, 2023
Chenyang Lyu, Minghao Wu, Longyue Wang, Xinting Huang, Bingshuai Liu, Zefeng Du, Shuming Shi, and Zhaopeng Tu. Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration, 2023
work page 2023
-
[34]
A comparative evaluation of interactive segmentation algorithms
Kevin McGuinness and Noel E O’connor. A comparative evaluation of interactive segmentation algorithms. Pattern Recognition, 43(2):434–444, 2010
work page 2010
- [35]
- [36]
-
[37]
A benchmark dataset and evaluation methodology for video object segmentation
Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 724–732, 2016
work page 2016
-
[38]
Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models, 2016. 14
work page 2016
-
[39]
The 2017 DAVIS Challenge on Video Object Segmentation
Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017
work page internal anchor Pith review arXiv 2017
-
[40]
Segment anything meets point tracking
Frano Rajiˇc, Lei Ke, Yu-Wing Tai, Chi-Keung Tang, Martin Danelljan, and Fisher Yu. Segment anything meets point tracking. arXiv preprint arXiv:2307.01197, 2023
-
[41]
What does clip know about a red circle? visual prompt engineering for vlms, 2023
Aleksandar Shtedritski, Christian Rupprecht, and Andrea Vedaldi. What does clip know about a red circle? visual prompt engineering for vlms, 2023
work page 2023
-
[42]
Can sam segment anything? when sam meets camouflaged object detection
Lv Tang, Haoke Xiao, and Bo Li. Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:2304.04709, 2023
-
[43]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Images speak in images: A generalist painter for in-context visual learning
Xinlong Wang, Wen Wang, Yue Cao, Chunhua Shen, and Tiejun Huang. Images speak in images: A generalist painter for in-context visual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6830–6839, 2023
work page 2023
-
[45]
Seggpt: Segmenting everything in context, 2023
Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, and Tiejun Huang. Seggpt: Segmenting everything in context, 2023
work page 2023
-
[46]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems , 35:24824–24837, 2022
work page 2022
-
[47]
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023
work page internal anchor Pith review arXiv 2023
-
[48]
Fine-grained visual prompting, 2023
Lingfeng Yang, Yueze Wang, Xiang Li, Xinlong Wang, and Jian Yang. Fine-grained visual prompting, 2023
work page 2023
-
[49]
The dawn of lmms: Preliminary explorations with gpt-4v(ision), 2023
Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The dawn of lmms: Preliminary explorations with gpt-4v(ision), 2023
work page 2023
-
[50]
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023
work page internal anchor Pith review arXiv 2023
-
[51]
Detclip: Dictionary-enriched visual-concept paralleled pre-training for open-world detection
Lewei Yao, Jianhua Han, Youpeng Wen, Xiaodan Liang, Dan Xu, Wei Zhang, Zhenguo Li, Chunjing Xu, and Hang Xu. Detclip: Dictionary-enriched visual-concept paralleled pre-training for open-world detection. Advances in Neural Information Processing Systems , 35:9125–9138, 2022
work page 2022
-
[52]
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
Cpt: Colorful prompt tuning for pre-trained vision-language models, 2022
Yuan Yao, Ao Zhang, Zhengyan Zhang, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. Cpt: Colorful prompt tuning for pre-trained vision-language models, 2022
work page 2022
-
[54]
Ferret: Refer and ground anything anywhere at any granularity, 2023
Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity, 2023
work page 2023
-
[55]
A simple framework for open-vocabulary segmentation and detection, 2023
Hao Zhang, Feng Li, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianfeng Gao, Jianwei Yang, and Lei Zhang. A simple framework for open-vocabulary segmentation and detection, 2023
work page 2023
-
[56]
Glipv2: Unifying localization and vision-language understanding, 2022
Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Harold Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, and Jianfeng Gao. Glipv2: Unifying localization and vision-language understanding, 2022
work page 2022
-
[57]
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023
work page Pith review arXiv 2023
-
[58]
Person- alize segment anything model with one shot.arXiv preprint arXiv:2305.03048, 2023
Renrui Zhang, Zhengkai Jiang, Ziyu Guo, Shilin Yan, Junting Pan, Hao Dong, Peng Gao, and Hongsheng Li. Personalize segment anything model with one shot. arXiv preprint arXiv:2305.03048, 2023. 15
-
[59]
Gpt4roi: Instruction tuning large language model on region-of-interest, 2023
Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Yu Liu, Kai Chen, and Ping Luo. Gpt4roi: Instruction tuning large language model on region-of-interest, 2023
work page 2023
-
[60]
OPT: Open Pre-trained Transformer Language Models
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[61]
Regionclip: Region-based language-image pretraining
Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16793–16803, 2022
work page 2022
-
[62]
Scene parsing through ade20k dataset
Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017
work page 2017
-
[63]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[64]
Generalized decoding for pixel, image, and language, 2022
Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, Nanyun Peng, Lijuan Wang, Yong Jae Lee, and Jianfeng Gao. Generalized decoding for pixel, image, and language, 2022
work page 2022
-
[65]
arXiv preprint arXiv:2304.06718 , year=
Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Gao, and Yong Jae Lee. Segment everything everywhere all at once. arXiv preprint arXiv:2304.06718, 2023. 16 A Use Cases Combining GPT-4V with the proposed SoM prompting, we run many realistic use cases to check its generality and merit in comparison with the original GPT-4V . Grounded Reaso...
-
[66]
Fish Fillet: The main component of the dish appears to be a fillet of fish, likely grilled or pan-seared
-
[67]
Lemon Wedge: A common accompaniment with fish dishes to enhance flavor when squeezed over the top
-
[68]
Parsley or Cilantro: This green herb is used as a garnish and can add a fresh flavor to dishes
-
[69]
Asparagus: These green stalks are often steamed, sautéed, or grilled and served as a side
-
[70]
Sauce or Broth: There is a yellowish sauce or broth on the plate
-
[71]
Potato: It looks like a roasted or boiled potato
-
[72]
Garnish: These could be green onions or chives scattered on the plate to add color and flavor
-
[73]
Sauce or Oil Drizzle: There’s a darker drizzle around the potato, which could be a type of sauce or perhaps a balsamic reduction. The dish seems to be a well-balanced meal with a protein (fish), vegetables (asparagus and potato), and various flavor enhancers (sauces, lemon, and herbs). Figure 11: Case study on an image of dish for GPT-4V . GPT-4V does not...
-
[74]
Onion rings (1 and 11) - Fried foods like these are often high in unhealthy fats and calories
-
[75]
Fries (8) - Another fried item that can be calorie-dense and contain unhealthy fats
-
[76]
Burgers (3 and 5) - Depending on their composition, they can be high in calories, unhealthy fats, and may contain processed meats
-
[77]
Processed meats (2 and 15) - These can be high in sodium, unhealthy fats, and other additives
-
[78]
accept" button. – B (7) : Often the
Chips (16) - Typically high in unhealthy fats and can be calorie-dense. While indulging occasionally is okay, it’s best to consume these items in moderation if you’re trying to maintain a healthy diet. Figure 12: SoM-pormpted GPT-4V gives very precise suggestions while the original one fails, even with hallucinated foods, e.g., soft drinks. 20 Tool Usage ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.