iOSWorld is a new open-source benchmark for personally intelligent phone agents featuring connected personal data across 26 iOS apps and 133 tasks in three difficulty categories.
hub Mixed citations
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Mixed citation behavior. Most common role is background (60%).
abstract
We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SEEM/SAM, to partition an image into regions at different levels of granularity, and overlay these regions with a set of marks e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can answer the questions that require visual grounding. We perform a comprehensive empirical study to validate the effectiveness of SoM on a wide range of fine-grained vision and multimodal tasks. For example, our experiments show that GPT-4V with SoM in zero-shot setting outperforms the state-of-the-art fully-finetuned referring expression comprehension and segmentation model on RefCOCOg. Code for SoM prompting is made public at: https://github.com/microsoft/SoM.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SEEM/SAM, to partition an image into regions at different levels of granularity, and overlay these regions with a set of marks e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can answer the questions that require visual grounding. We perform a comprehensive empirical study to validate the effectiveness of SoM on a wide
- baseline Multimodal-GPT [47] 7B - - - 0.5335 0.5440 - - InstructBLIP [36] 7B 10.3 86.2 45.26 0.8091 0.9392 35.5 41.3 GPT-4V [125] - 4.3 92.7 65.28 - - - - LLaVA (7B) [111] 7B 13.5 69.3 - - - 23.3 26.3 LLaVA (13B) [111] 13B - - - 0.8360 0.8729 - - MiniGPT-4 (7B) [225] 7B - - 35.78 0.5713 0.6359 61.4 50.1 MiniGPT-4 (13B) [225] 13B 15.9 76.7 - - - - - mPLUG-Owl2 [185] 7B 10.6 84.0 47.30 - - - - LLaVA-1.5 (7B) [110] 7B 8.6 82.9 - - - 44.6 46.4 LLaVA-1.5 (13B) [110] 13B - - 46.94 0.8566 0.9425 - - CogVLM [165
- background [26] Taofeng Xue, Chong Peng, Mianqiu Huang, Linsen Guo, Tiancheng Han, Haozhe Wang, Jianing Wang, Xiaocheng Zhang, Xin Yang, Dengchang Zhao, et al. Evocua: Evolving computer use agents via learning from scalable synthetic experience.arXiv preprint arXiv:2601.15876, 2026. [27] Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt-4v (ision) is a generalist web agent, if grounded.arXiv preprint arXiv:2401.01614, 2024. 11 [28] Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and
- background capabilities learned from single-image scenarios and relational reasoning skills developed from multi- image scenarios. The example highlights LLaV A-OneVision's proficiency in GUI understanding and task execution. S3: Set-of-mark Prompting (Transfer from single-image task composition). Different from existing open LLMs, LLaV A-OneVision demonstrates excellent set-of-marks (SoM) reasoning [149], an emerging capability shown in Table 8. To the best of our knowledge, this is the first time that op
- background agent development methods like interactive learning and real-world exploration. Building realistic interactive environments is a major challenge in developing multimodal agents. Prior work that introduce executable environments simplify the observation and action spaces of human-computer interaction and limit task scope within specific applications or domains, such as web navigation in a few domains [44, 30, 58, 66], coding [57] and the combination [32, 54, 34]. Agents developed in these restric
co-cited works
representative citing papers
WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.
OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
QVal is a new evaluation framework that directly measures dense supervision quality via Q-alignment to a reference policy, showing simple prompting baselines outperform 21 other methods across environments and models.
An asynchronous architecture decouples incremental voxel-based mapping from VLM-based semantic enrichment to produce queryable open-vocabulary 3D scene graphs that match or exceed prior methods on segmentation and grounding benchmarks.
AOI adds keyframe capture, volume-gated audio transcription, and visual narration to computer-use agents, producing +17 to +48 pp gains over screenshot baselines on DynaCU-Bench with no retraining.
Introduces VG-GUIBench benchmark and TASKER keyframe extraction algorithm that improves performance on VideoQA and video-guided agentic tasks.
Multimodal graph RAG with DLVQA benchmark outperforms MMRAG and KG methods on multi-hop document VQA tasks.
Forensic Knowledge Graphs integrate forensic traces, causal dependencies, and scene links via a new authentication network and Iterative Context Refinement to outperform standard detectors and VLMs on detection, localization, and justification.
SPOT-E uses entropy shaping on answer predictions with low-entropy anchors to optimize visual spotlights at test time via GRPO for better VLM performance on evidence-intensive tasks.
HiViG is a test-time critic that combines macro-action history summarization with visual grounding of execution coordinates to reduce short-sighted and visually erroneous actions in long-horizon GUI agents.
MAOAM unifies object and material selection via a VLM with segmentation head, supporting text and click interactions through multi-task training on VLM-generated material data.
DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.
OpenRef benchmark for open-world REC with F1 and N3R metrics and training-free MCC to improve existing models in complex scenarios.
Minerva-Ego is a new benchmark for egocentric visual reasoning with dense human-annotated traces and masks, showing that spatiotemporal hints substantially improve frontier model performance.
A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.
BBCritic reframes GUI critique as continuous semantic alignment via contrastive learning in an affordance space, outperforming larger binary SOTA models on a new four-level hierarchical benchmark without extra annotations.
Seg-Agent performs language-guided segmentation without training by using Set-of-Mark visual prompts to enable explicit multimodal chain-of-reasoning in three stages: generation, selection, and refinement.
WinDeskGround is a parametrically generated benchmark of 1,356 instruction-target pairs that reveals accuracy declines in state-of-the-art MLLMs under partial occlusion in multi-window GUI settings.
Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
SBAC uses sketching and multimodal LLMs to help users refine underspecified access control preferences into complete, validated policies through iterative human-AI collaboration.
GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.
Behavioral fingerprints distinguish AI browsing agents from humans and each other, enabling superior detection compared to current bot systems.
citing papers explorer
-
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.
-
Large Language Model-Brained GUI Agents: A Survey
A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.