hub Mixed citations

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, Jianfeng Gao · 2023 · cs.CV · arXiv 2310.11441

Mixed citation behavior. Most common role is background (60%).

96 Pith papers citing it

Background 60% of classified citations

open full Pith review browse 96 citing papers arXiv PDF

abstract

We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SEEM/SAM, to partition an image into regions at different levels of granularity, and overlay these regions with a set of marks e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can answer the questions that require visual grounding. We perform a comprehensive empirical study to validate the effectiveness of SoM on a wide range of fine-grained vision and multimodal tasks. For example, our experiments show that GPT-4V with SoM in zero-shot setting outperforms the state-of-the-art fully-finetuned referring expression comprehension and segmentation model on RefCOCOg. Code for SoM prompting is made public at: https://github.com/microsoft/SoM.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 9 method 4 baseline 1 dataset 1

citation-polarity summary

background 9 use method 4 baseline 1 use dataset 1

claims ledger

abstract We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SEEM/SAM, to partition an image into regions at different levels of granularity, and overlay these regions with a set of marks e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can answer the questions that require visual grounding. We perform a comprehensive empirical study to validate the effectiveness of SoM on a wide
baseline Multimodal-GPT [47] 7B - - - 0.5335 0.5440 - - InstructBLIP [36] 7B 10.3 86.2 45.26 0.8091 0.9392 35.5 41.3 GPT-4V [125] - 4.3 92.7 65.28 - - - - LLaVA (7B) [111] 7B 13.5 69.3 - - - 23.3 26.3 LLaVA (13B) [111] 13B - - - 0.8360 0.8729 - - MiniGPT-4 (7B) [225] 7B - - 35.78 0.5713 0.6359 61.4 50.1 MiniGPT-4 (13B) [225] 13B 15.9 76.7 - - - - - mPLUG-Owl2 [185] 7B 10.6 84.0 47.30 - - - - LLaVA-1.5 (7B) [110] 7B 8.6 82.9 - - - 44.6 46.4 LLaVA-1.5 (13B) [110] 13B - - 46.94 0.8566 0.9425 - - CogVLM [165
background [26] Taofeng Xue, Chong Peng, Mianqiu Huang, Linsen Guo, Tiancheng Han, Haozhe Wang, Jianing Wang, Xiaocheng Zhang, Xin Yang, Dengchang Zhao, et al. Evocua: Evolving computer use agents via learning from scalable synthetic experience.arXiv preprint arXiv:2601.15876, 2026. [27] Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt-4v (ision) is a generalist web agent, if grounded.arXiv preprint arXiv:2401.01614, 2024. 11 [28] Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and
background capabilities learned from single-image scenarios and relational reasoning skills developed from multi- image scenarios. The example highlights LLaV A-OneVision's proficiency in GUI understanding and task execution. S3: Set-of-mark Prompting (Transfer from single-image task composition). Different from existing open LLMs, LLaV A-OneVision demonstrates excellent set-of-marks (SoM) reasoning [149], an emerging capability shown in Table 8. To the best of our knowledge, this is the first time that op
background agent development methods like interactive learning and real-world exploration. Building realistic interactive environments is a major challenge in developing multimodal agents. Prior work that introduce executable environments simplify the observation and action spaces of human-computer interaction and limit task scope within specific applications or domains, such as web navigation in a few domains [44, 30, 58, 66], coding [57] and the combination [32, 54, 34]. Agents developed in these restric

co-cited works

representative citing papers

iOSWorld: A Benchmark for Personally Intelligent Phone Agents

cs.LG · 2026-06-08 · unverdicted · novelty 8.0

iOSWorld is a new open-source benchmark for personally intelligent phone agents featuring connected personal data across 26 iOS apps and 133 tasks in three difficulty categories.

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

cs.AI · 2026-04-30 · accept · novelty 8.0

WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

cs.AI · 2024-04-11 · accept · novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.

QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

QVal is a new evaluation framework that directly measures dense supervision quality via Q-alignment to a reference policy, showing simple prompting baselines outperform 21 other methods across environments and models.

Think While You Map: Asynchronous Vision-Language Agents for Incremental 3D Scene Graphs

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

An asynchronous architecture decouples incremental voxel-based mapping from VLM-based semantic enrichment to produce queryable open-vocabulary 3D scene graphs that match or exceed prior methods on segmentation and grounding benchmarks.

Agent-Computer Observation Interfaces Enable Dynamic Computer Use

cs.AI · 2026-06-28 · conditional · novelty 7.0

AOI adds keyframe capture, volume-gated audio transcription, and visual narration to computer-use agents, producing +17 to +48 pp gains over screenshot baselines on DynaCU-Bench with no retraining.

Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction

cs.CV · 2026-06-28 · unverdicted · novelty 7.0

Introduces VG-GUIBench benchmark and TASKER keyframe extraction algorithm that improves performance on VideoQA and video-guided agentic tasks.

Multimodal Graph RAG for Long-range Visually Rich Document Understanding

cs.IR · 2026-06-27 · unverdicted · novelty 7.0

Multimodal graph RAG with DLVQA benchmark outperforms MMRAG and KG methods on multi-hop document VQA tasks.

Trustworthy Image Authentication using Forensic Knowledge Graphs

cs.CV · 2026-06-22 · unverdicted · novelty 7.0

Forensic Knowledge Graphs integrate forensic traces, causal dependencies, and scene links via a new authentication network and Iterative Context Refinement to outperform standard detectors and VLMs on detection, localization, and justification.

SPOT-E: Test-Time Entropy Shaping with Visual Spotlights for Frozen VLMs

cs.CV · 2026-06-18 · unverdicted · novelty 7.0

SPOT-E uses entropy shaping on answer predictions with low-entropy anchors to optimize visual spotlights at test time via GRPO for better VLM performance on evidence-intensive tasks.

A History-Aware Visually Grounded Critic for Computer Use Agents

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

HiViG is a test-time critic that combines macro-action history summarization with visual grounding of execution coordinates to reduce short-sighted and visually erroneous actions in long-horizon GUI agents.

MAOAM: Unified Object and Material Selection with Vision-Language Models

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

MAOAM unifies object and material selection via a VLM with segmentation head, supporting text and click interactions through multi-task training on VLM-generated material data.

DeepLatent: Think with Images via Parallel Latent Visual Reasoning

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.

Towards Open-World Referring Expression Comprehension: A Benchmark with Training-free Multi-task Consistency Checker

cs.CV · 2026-05-25 · unverdicted · novelty 7.0

OpenRef benchmark for open-world REC with F1 and N3R metrics and training-free MCC to improve existing models in complex scenarios.

Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

Minerva-Ego is a new benchmark for egocentric visual reasoning with dense human-annotated traces and masks, showing that spatiotemporal hints substantially improve frontier model performance.

From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.

Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment

cs.LG · 2026-05-14 · unverdicted · novelty 7.0 · 2 refs

BBCritic reframes GUI critique as continuous semantic alignment via contrastive learning in an affordance space, outperforming larger binary SOTA models on a new four-level hierarchical benchmark without extra annotations.

Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

Seg-Agent performs language-guided segmentation without training by using Set-of-Mark visual prompts to enable explicit multimodal chain-of-reasoning in three stages: generation, selection, and refinement.

WinDeskGround: A Benchmark for Robust GUI Grounding in Complex Multi-Window Desktop Environments

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

WinDeskGround is a parametrically generated benchmark of 1,356 instruction-target pairs that reveals accuracy declines in state-of-the-art MLLMs under partial occlusion in multi-window GUI settings.

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.

Sketch-based Access Control: A Multimodal Interface for Translating User Preferences into Intent-Aligned Policies

cs.HC · 2026-05-11 · unverdicted · novelty 7.0

SBAC uses sketching and multimodal LLMs to help users refine underspecified access control preferences into complete, validated policies through iterative human-AI collaboration.

GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.

Weblica: Scalable and Reproducible Training Environments for Visual Web Agents

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.

FP-Agent: Fingerprinting AI Browsing Agents

cs.CR · 2026-05-02 · unverdicted · novelty 7.0

Behavioral fingerprints distinguish AI browsing agents from humans and each other, enabling superior detection compared to current bot systems.

citing papers explorer

Showing 13 of 13 citing papers after filters.

ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs cs.RO · 2026-02-09 · unverdicted · none · ref 120 · internal anchor
ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.
ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation cs.RO · 2024-09-03 · conditional · none · ref 140 · internal anchor
ReKep encodes robotic tasks as optimizable Python functions over 3D keypoints that are generated automatically from language and RGB-D input, enabling real-time hierarchical planning on single- and dual-arm platforms without task-specific data.
TAP-VLA: Tactile Annotation Prompting for Vision Language Action Models cs.RO · 2026-06-27 · unverdicted · none · ref 34 · internal anchor
TAP-VLA improves VLA performance in contact-rich manipulation by visually annotating tactile shear fields onto input images, reaching 78% success versus under 50% for vision-only and other tactile methods.
LocalNav: Distilling Frontier VLMs and Embodied RL for On-Device Object Goal Navigation cs.RO · 2026-06-26 · unverdicted · none · ref 53 · internal anchor
Distillation from frontier VLMs plus E-RLVR regularization produces a 4B local model that achieves 34.5% SR on OVON while cutting inference latency by 82.8%.
Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models cs.RO · 2026-05-13 · conditional · none · ref 34 · internal anchor
GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.
Chain Of Interaction Benchmark (COIN): When Reasoning meets Embodied Interaction cs.RO · 2026-04-18 · unverdicted · none · ref 15 · internal anchor
COIN provides 50 interactive robotic tasks, a 1000-demonstration dataset collected via AR teleoperation, and metrics showing that CodeAsPolicy, VLA, and H-VLA models fail at causally-dependent interactive reasoning due to gaps between vision and execution.
Long-Term Memory for VLA-based Agents in Open-World Task Execution cs.RO · 2026-04-17 · unverdicted · none · ref 36 · internal anchor
ChemBot adds dual-layer memory and future-state asynchronous inference to VLA models, enabling better long-horizon success in chemical lab automation on collaborative robots.
Clutter-Robust Vision-Language-Action Models through Object-Centric and Geometry Grounding cs.RO · 2025-12-27 · conditional · none · ref 17 · internal anchor
OBEYED-VLA improves VLA robustness in cluttered real-world manipulation by disentangling perception into VLM-based object-centric grounding and geometry-aware stages, then fine-tuning the policy only on single-object demonstrations.
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies cs.RO · 2024-12-13 · conditional · none · ref 100 · internal anchor
Visual trace prompting improves spatial-temporal awareness in VLA models, delivering 10% gains on SimplerEnv and 3.5x on real-robot tasks.
Grasp-Then-Plan with Failure Attribution: A Closed Two-Stage Framework for Precise and Generalizable Robotic Manipulation cs.RO · 2026-06-02 · unverdicted · none · ref 79 · internal anchor
GTP-FA is a grasp-then-plan framework with failure attribution that diagnoses errors to optimize grasping priors and planning data collection, raising success rates across RL, IL, diffusion, and VLA methods in simulation and real robots.
Bridging Semantics and Kinematics: A Modular Framework for Zero-Shot Robotic Manipulation cs.RO · 2026-06-22 · unverdicted · none · ref 32 · internal anchor
A modular framework using FastSAM with Set-of-Mark prompting, an LLM as semantic router, and MoveIt Task Constructor achieves 62% end-to-end success in zero-shot language-guided robotic manipulation across open-world and spatial reasoning tasks.
CodeGraphVLP: Code-as-Planner Meets Semantic-Graph State for Non-Markovian Vision-Language-Action Models cs.RO · 2026-04-24 · unreviewed · ref 36 · internal anchor
OpenFrontier: General Navigation with Visual-Language Grounded Frontiers cs.RO · 2026-03-05 · unreviewed · ref 48 · internal anchor

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer