pith. sign in

hub Mixed citations

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Mixed citation behavior. Most common role is background (60%).

96 Pith papers citing it
Background 60% of classified citations
abstract

We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SEEM/SAM, to partition an image into regions at different levels of granularity, and overlay these regions with a set of marks e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can answer the questions that require visual grounding. We perform a comprehensive empirical study to validate the effectiveness of SoM on a wide range of fine-grained vision and multimodal tasks. For example, our experiments show that GPT-4V with SoM in zero-shot setting outperforms the state-of-the-art fully-finetuned referring expression comprehension and segmentation model on RefCOCOg. Code for SoM prompting is made public at: https://github.com/microsoft/SoM.

hub tools

citation-role summary

background 9 method 4 baseline 1 dataset 1

citation-polarity summary

claims ledger

  • abstract We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SEEM/SAM, to partition an image into regions at different levels of granularity, and overlay these regions with a set of marks e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can answer the questions that require visual grounding. We perform a comprehensive empirical study to validate the effectiveness of SoM on a wide
  • baseline Multimodal-GPT [47] 7B - - - 0.5335 0.5440 - - InstructBLIP [36] 7B 10.3 86.2 45.26 0.8091 0.9392 35.5 41.3 GPT-4V [125] - 4.3 92.7 65.28 - - - - LLaVA (7B) [111] 7B 13.5 69.3 - - - 23.3 26.3 LLaVA (13B) [111] 13B - - - 0.8360 0.8729 - - MiniGPT-4 (7B) [225] 7B - - 35.78 0.5713 0.6359 61.4 50.1 MiniGPT-4 (13B) [225] 13B 15.9 76.7 - - - - - mPLUG-Owl2 [185] 7B 10.6 84.0 47.30 - - - - LLaVA-1.5 (7B) [110] 7B 8.6 82.9 - - - 44.6 46.4 LLaVA-1.5 (13B) [110] 13B - - 46.94 0.8566 0.9425 - - CogVLM [165
  • background [26] Taofeng Xue, Chong Peng, Mianqiu Huang, Linsen Guo, Tiancheng Han, Haozhe Wang, Jianing Wang, Xiaocheng Zhang, Xin Yang, Dengchang Zhao, et al. Evocua: Evolving computer use agents via learning from scalable synthetic experience.arXiv preprint arXiv:2601.15876, 2026. [27] Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt-4v (ision) is a generalist web agent, if grounded.arXiv preprint arXiv:2401.01614, 2024. 11 [28] Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and
  • background capabilities learned from single-image scenarios and relational reasoning skills developed from multi- image scenarios. The example highlights LLaV A-OneVision's proficiency in GUI understanding and task execution. S3: Set-of-mark Prompting (Transfer from single-image task composition). Different from existing open LLMs, LLaV A-OneVision demonstrates excellent set-of-marks (SoM) reasoning [149], an emerging capability shown in Table 8. To the best of our knowledge, this is the first time that op
  • background agent development methods like interactive learning and real-world exploration. Building realistic interactive environments is a major challenge in developing multimodal agents. Prior work that introduce executable environments simplify the observation and action spaces of human-computer interaction and limit task scope within specific applications or domains, such as web navigation in a few domains [44, 30, 58, 66], coding [57] and the combination [32, 54, 34]. Agents developed in these restric

co-cited works

clear filters

representative citing papers

Trustworthy Image Authentication using Forensic Knowledge Graphs

cs.CV · 2026-06-22 · unverdicted · novelty 7.0

Forensic Knowledge Graphs integrate forensic traces, causal dependencies, and scene links via a new authentication network and Iterative Context Refinement to outperform standard detectors and VLMs on detection, localization, and justification.

A History-Aware Visually Grounded Critic for Computer Use Agents

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

HiViG is a test-time critic that combines macro-action history summarization with visual grounding of execution coordinates to reduce short-sighted and visually erroneous actions in long-horizon GUI agents.

Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment

cs.LG · 2026-05-14 · unverdicted · novelty 7.0 · 2 refs

BBCritic reframes GUI critique as continuous semantic alignment via contrastive learning in an affordance space, outperforming larger binary SOTA models on a new four-level hierarchical benchmark without extra annotations.

FP-Agent: Fingerprinting AI Browsing Agents

cs.CR · 2026-05-02 · unverdicted · novelty 7.0

Behavioral fingerprints distinguish AI browsing agents from humans and each other, enabling superior detection compared to current bot systems.

citing papers explorer

Showing 2 of 2 citing papers after filters.

  • Weblica: Scalable and Reproducible Training Environments for Visual Web Agents cs.AI · 2026-05-07 · unverdicted · none · ref 43 · internal anchor

    Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.

  • Large Language Model-Brained GUI Agents: A Survey cs.AI · 2024-11-27 · unverdicted · none · ref 183 · internal anchor

    A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.