hub

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, Jianfeng Gao · 2023 · cs.CV · arXiv 2310.11441

32 Pith papers cite this work. Polarity classification is still indexing.

32 Pith papers citing it

open full Pith review browse 32 citing papers arXiv PDF

abstract

We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SEEM/SAM, to partition an image into regions at different levels of granularity, and overlay these regions with a set of marks e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can answer the questions that require visual grounding. We perform a comprehensive empirical study to validate the effectiveness of SoM on a wide range of fine-grained vision and multimodal tasks. For example, our experiments show that GPT-4V with SoM in zero-shot setting outperforms the state-of-the-art fully-finetuned referring expression comprehension and segmentation model on RefCOCOg. Code for SoM prompting is made public at: https://github.com/microsoft/SoM.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 baseline 1

citation-polarity summary

background 1 baseline 1

representative citing papers

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

cs.AI · 2026-04-30 · accept · novelty 8.0

WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

cs.AI · 2024-04-11 · accept · novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.

Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

Seg-Agent performs language-guided segmentation without training by using Set-of-Mark visual prompts to enable explicit multimodal chain-of-reasoning in three stages: generation, selection, and refinement.

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.

ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

cs.CL · 2026-05-11 · unverdicted · novelty 7.0 · 2 refs

ReVision reduces visual token usage by 46% on average in agent trajectories via a learned patch selector and improves success rates by 3% on three benchmarks, showing that history saturation stems from inefficient representations rather than lack of utility.

Sketch-based Access Control: A Multimodal Interface for Translating User Preferences into Intent-Aligned Policies

cs.HC · 2026-05-11 · unverdicted · novelty 7.0

SBAC uses sketching and multimodal LLMs to help users refine underspecified access control preferences into complete, validated policies through iterative human-AI collaboration.

GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.

ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

ChartREG++ creates a new multi-target chart grounding benchmark with diverse cues and a code-driven synthesis pipeline for accurate masks, yielding a model that outperforms baselines and generalizes to real ChartQA charts.

Weblica: Scalable and Reproducible Training Environments for Visual Web Agents

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.

FP-Agent: Fingerprinting AI Browsing Agents

cs.CR · 2026-05-02 · unverdicted · novelty 7.0

Behavioral fingerprints distinguish AI browsing agents from humans and each other, enabling superior detection compared to current bot systems.

CodeGraphVLP: Code-as-Planner Meets Semantic-Graph State for Non-Markovian Vision-Language-Action Models

cs.RO · 2026-04-24 · unverdicted · novelty 7.0

CodeGraphVLP uses a semantic-graph state and executable code planner to enable reliable long-horizon non-Markovian robot manipulation, improving task success and lowering latency over standard VLA baselines.

ViBR: Automated Bug Replay from Video-based Reports using Vision-Language Models

cs.SE · 2026-04-21 · unverdicted · novelty 7.0

ViBR reproduces 72% of bugs from video reports by segmenting actions with CLIP similarity and using VLMs for region-aware GUI state comparison, outperforming prior heuristics-based methods.

Penny Wise, Pixel Foolish: Bypassing Price Constraints in Multimodal Agents via Visual Adversarial Perturbations

cs.CV · 2026-04-15 · unverdicted · novelty 7.0

Visual adversarial perturbations bypass price constraints in multimodal agents by exploiting visual dominance over text, with PriceBlind achieving ~80% white-box ASR and 35-41% transfer ASR.

BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

By drawing object boxes and motion trails visually on video frames instead of serializing coordinates as text, BoxTuning reduces token costs dramatically and improves accuracy on video question answering benchmarks.

MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, with full release of data and code planned.

Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning

cs.CV · 2026-04-08 · unverdicted · novelty 7.0

A training-free Visual Chain-of-Thought framework reconstructs high-fidelity 3D meshes from single images and iteratively synthesizes optimal novel views to enhance MLLM spatial comprehension on benchmarks like 3DSRBench.

WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks

cs.CR · 2026-04-07 · unverdicted · novelty 7.0

WebSP-Eval shows that multimodal LLM-based web agents fail more than 45% of the time on security and privacy tasks involving stateful UI elements such as toggles and checkboxes.

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

cs.AI · 2024-05-23 · accept · novelty 7.0

AndroidWorld is a dynamic, reproducible Android benchmark that generates unlimited natural-language tasks for autonomous agents and shows current agents succeed on only 30.6 percent of them.

Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models

cs.RO · 2026-05-13 · conditional · novelty 6.0

GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.

AgentLens: Adaptive Visual Modalities for Human-Agent Interaction in Mobile GUI Agents

cs.HC · 2026-04-22 · unverdicted · novelty 6.0

AgentLens adaptively deploys Full UI, Partial UI, and GenUI modalities with virtual display overlays for mobile GUI agents, yielding 85.7% user preference and best-in-study usability in a 21-participant evaluation.

Proactive Detection of GUI Defects in Multi-Window Scenarios via Multimodal Reasoning

cs.SE · 2026-04-21 · unverdicted · novelty 6.0

Proactive multi-window state triggering plus Set-of-Mark alignment and multimodal LLM reasoning detects GUI defects in Android apps, reporting 184% more text truncation, 87.2% F1 on occlusion, and 40 defect-prone apps at 10% FPR.

Chain Of Interaction Benchmark (COIN): When Reasoning meets Embodied Interaction

cs.RO · 2026-04-18 · unverdicted · novelty 6.0

COIN provides 50 interactive robotic tasks, a 1000-demonstration dataset collected via AR teleoperation, and metrics showing that CodeAsPolicy, VLA, and H-VLA models fail at causally-dependent interactive reasoning due to gaps between vision and execution.

Long-Term Memory for VLA-based Agents in Open-World Task Execution

cs.RO · 2026-04-17 · unverdicted · novelty 6.0

ChemBot adds dual-layer memory and future-state asynchronous inference to VLA models, enabling better long-horizon success in chemical lab automation on collaborative robots.

UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

UI-Zoomer uses uncertainty quantification to trigger and size adaptive zoom-ins only on uncertain GUI grounding predictions, yielding up to 13.4% gains on benchmarks with no training.

citing papers explorer

Showing 32 of 32 citing papers.

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments cs.AI · 2026-04-30 · accept · none · ref 5 · internal anchor
WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments cs.AI · 2024-04-11 · accept · none · ref 57 · internal anchor
OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation cs.CV · 2026-05-13 · unverdicted · none · ref 21 · internal anchor
Seg-Agent performs language-guided segmentation without training by using Set-of-Mark visual prompts to enable explicit multimodal chain-of-reasoning in three stages: generation, selection, and refinement.
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark cs.CV · 2026-05-12 · unverdicted · none · ref 28 · internal anchor
Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction cs.CL · 2026-05-11 · unverdicted · none · ref 25 · 2 links · internal anchor
ReVision reduces visual token usage by 46% on average in agent trajectories via a learned patch selector and improves success rates by 3% on three benchmarks, showing that history saturation stems from inefficient representations rather than lack of utility.
Sketch-based Access Control: A Multimodal Interface for Translating User Preferences into Intent-Aligned Policies cs.HC · 2026-05-11 · unverdicted · none · ref 57 · internal anchor
SBAC uses sketching and multimodal LLMs to help users refine underspecified access control preferences into complete, validated policies through iterative human-AI collaboration.
GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning cs.CV · 2026-05-08 · unverdicted · none · ref 37 · internal anchor
GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.
ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring cs.CV · 2026-05-08 · unverdicted · none · ref 23 · internal anchor
ChartREG++ creates a new multi-target chart grounding benchmark with diverse cues and a code-driven synthesis pipeline for accurate masks, yielding a model that outperforms baselines and generalizes to real ChartQA charts.
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents cs.AI · 2026-05-07 · unverdicted · none · ref 43 · internal anchor
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.
FP-Agent: Fingerprinting AI Browsing Agents cs.CR · 2026-05-02 · unverdicted · none · ref 114 · internal anchor
Behavioral fingerprints distinguish AI browsing agents from humans and each other, enabling superior detection compared to current bot systems.
CodeGraphVLP: Code-as-Planner Meets Semantic-Graph State for Non-Markovian Vision-Language-Action Models cs.RO · 2026-04-24 · unverdicted · none · ref 36 · internal anchor
CodeGraphVLP uses a semantic-graph state and executable code planner to enable reliable long-horizon non-Markovian robot manipulation, improving task success and lowering latency over standard VLA baselines.
ViBR: Automated Bug Replay from Video-based Reports using Vision-Language Models cs.SE · 2026-04-21 · unverdicted · none · ref 71 · internal anchor
ViBR reproduces 72% of bugs from video reports by segmenting actions with CLIP similarity and using VLMs for region-aware GUI state comparison, outperforming prior heuristics-based methods.
Penny Wise, Pixel Foolish: Bypassing Price Constraints in Multimodal Agents via Visual Adversarial Perturbations cs.CV · 2026-04-15 · unverdicted · none · ref 1 · internal anchor
Visual adversarial perturbations bypass price constraints in multimodal agents by exploiting visual dominance over text, with PriceBlind achieving ~80% white-box ASR and 35-41% transfer ASR.
BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning cs.CV · 2026-04-13 · unverdicted · none · ref 39 · internal anchor
By drawing object boxes and motion trails visually on video frames instead of serializing coordinates as text, BoxTuning reduces token costs dramatically and improves accuracy on video question answering benchmarks.
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web cs.CV · 2026-04-09 · unverdicted · none · ref 20 · internal anchor
Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, with full release of data and code planned.
Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning cs.CV · 2026-04-08 · unverdicted · none · ref 56 · internal anchor
A training-free Visual Chain-of-Thought framework reconstructs high-fidelity 3D meshes from single images and iteratively synthesizes optimal novel views to enhance MLLM spatial comprehension on benchmarks like 3DSRBench.
WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks cs.CR · 2026-04-07 · unverdicted · none · ref 52 · internal anchor
WebSP-Eval shows that multimodal LLM-based web agents fail more than 45% of the time on security and privacy tasks involving stateful UI elements such as toggles and checkboxes.
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents cs.AI · 2024-05-23 · accept · none · ref 35 · internal anchor
AndroidWorld is a dynamic, reproducible Android benchmark that generates unlimited natural-language tasks for autonomous agents and shows current agents succeed on only 30.6 percent of them.
Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models cs.RO · 2026-05-13 · conditional · none · ref 34 · internal anchor
GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.
AgentLens: Adaptive Visual Modalities for Human-Agent Interaction in Mobile GUI Agents cs.HC · 2026-04-22 · unverdicted · none · ref 54 · internal anchor
AgentLens adaptively deploys Full UI, Partial UI, and GenUI modalities with virtual display overlays for mobile GUI agents, yielding 85.7% user preference and best-in-study usability in a 21-participant evaluation.
Proactive Detection of GUI Defects in Multi-Window Scenarios via Multimodal Reasoning cs.SE · 2026-04-21 · unverdicted · none · ref 8 · internal anchor
Proactive multi-window state triggering plus Set-of-Mark alignment and multimodal LLM reasoning detects GUI defects in Android apps, reporting 184% more text truncation, 87.2% F1 on occlusion, and 40 defect-prone apps at 10% FPR.
Chain Of Interaction Benchmark (COIN): When Reasoning meets Embodied Interaction cs.RO · 2026-04-18 · unverdicted · none · ref 15 · internal anchor
COIN provides 50 interactive robotic tasks, a 1000-demonstration dataset collected via AR teleoperation, and metrics showing that CodeAsPolicy, VLA, and H-VLA models fail at causally-dependent interactive reasoning due to gaps between vision and execution.
Long-Term Memory for VLA-based Agents in Open-World Task Execution cs.RO · 2026-04-17 · unverdicted · none · ref 36 · internal anchor
ChemBot adds dual-layer memory and future-state asynchronous inference to VLA models, enabling better long-horizon success in chemical lab automation on collaborative robots.
UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding cs.CV · 2026-04-15 · unverdicted · none · ref 24 · internal anchor
UI-Zoomer uses uncertainty quantification to trigger and size adaptive zoom-ins only on uncertain GUI grounding predictions, yielding up to 13.4% gains on benchmarks with no training.
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents cs.CL · 2024-10-30 · unverdicted · none · ref 128 · internal anchor
OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
Text-Guided Multi-Scale Frequency Representation Adaptation cs.CV · 2026-05-05 · unverdicted · none · ref 33 · internal anchor
FreqAdapter adapts multimodal models by text-guided multi-scale fine-tuning in the frequency domain, claiming better performance and efficiency than signal-space PEFT methods.
HOG-Layout: Hierarchical 3D Scene Generation, Optimization and Editing via Vision-Language Models cs.CV · 2026-04-12 · unverdicted · none · ref 48 · internal anchor
HOG-Layout enables text-driven hierarchical 3D scene generation, optimization, and real-time editing using LLMs, VLMs, RAG for semantic consistency, and an optimization module for physical plausibility.
LLaVA-OneVision: Easy Visual Task Transfer cs.CV · 2024-08-06 · unverdicted · none · ref 149 · internal anchor
LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
Hallucination of Multimodal Large Language Models: A Survey cs.CV · 2024-04-29 · accept · none · ref 185 · internal anchor
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
UnAC: Adaptive Visual Prompting with Abstraction and Stepwise Checking for Complex Multimodal Reasoning cs.CV · 2026-05-05 · unverdicted · none · ref 27 · internal anchor
UnAC improves LMM performance on visual reasoning benchmarks by combining adaptive visual prompting, image abstraction, and gradual self-checking.
APRVOS: 1st Place Winner of 5th PVUW MeViS-Audio Track cs.SD · 2026-04-20 · unverdicted · none · ref 24 · internal anchor
A staged pipeline using ASR transcription, visual existence verification, Sa2VA coarse segmentation, and agent-guided SAM3 refinement won first place in the PVUW MeViS-Audio track by decomposing audio-conditioned Ref-VOS into sequential verification and refinement steps.
AgentRVOS for MeViS-Text Track of 5th PVUW Challenge: 3rd Method cs.CV · 2026-04-20 · unverdicted · none · ref 19 · internal anchor
An agent-augmented Sa2VA pipeline for referring video object segmentation placed third in the MeViS-Text track of the 5th PVUW Challenge by adding verification, search, and refinement stages.

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer