WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.
hub Mixed citations
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Mixed citation behavior. Most common role is background (60%).
abstract
We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SEEM/SAM, to partition an image into regions at different levels of granularity, and overlay these regions with a set of marks e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can answer the questions that require visual grounding. We perform a comprehensive empirical study to validate the effectiveness of SoM on a wide range of fine-grained vision and multimodal tasks. For example, our experiments show that GPT-4V with SoM in zero-shot setting outperforms the state-of-the-art fully-finetuned referring expression comprehension and segmentation model on RefCOCOg. Code for SoM prompting is made public at: https://github.com/microsoft/SoM.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SEEM/SAM, to partition an image into regions at different levels of granularity, and overlay these regions with a set of marks e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can answer the questions that require visual grounding. We perform a comprehensive empirical study to validate the effectiveness of SoM on a wide
- baseline Multimodal-GPT [47] 7B - - - 0.5335 0.5440 - - InstructBLIP [36] 7B 10.3 86.2 45.26 0.8091 0.9392 35.5 41.3 GPT-4V [125] - 4.3 92.7 65.28 - - - - LLaVA (7B) [111] 7B 13.5 69.3 - - - 23.3 26.3 LLaVA (13B) [111] 13B - - - 0.8360 0.8729 - - MiniGPT-4 (7B) [225] 7B - - 35.78 0.5713 0.6359 61.4 50.1 MiniGPT-4 (13B) [225] 13B 15.9 76.7 - - - - - mPLUG-Owl2 [185] 7B 10.6 84.0 47.30 - - - - LLaVA-1.5 (7B) [110] 7B 8.6 82.9 - - - 44.6 46.4 LLaVA-1.5 (13B) [110] 13B - - 46.94 0.8566 0.9425 - - CogVLM [165
- background [26] Taofeng Xue, Chong Peng, Mianqiu Huang, Linsen Guo, Tiancheng Han, Haozhe Wang, Jianing Wang, Xiaocheng Zhang, Xin Yang, Dengchang Zhao, et al. Evocua: Evolving computer use agents via learning from scalable synthetic experience.arXiv preprint arXiv:2601.15876, 2026. [27] Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt-4v (ision) is a generalist web agent, if grounded.arXiv preprint arXiv:2401.01614, 2024. 11 [28] Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and
- background capabilities learned from single-image scenarios and relational reasoning skills developed from multi- image scenarios. The example highlights LLaV A-OneVision's proficiency in GUI understanding and task execution. S3: Set-of-mark Prompting (Transfer from single-image task composition). Different from existing open LLMs, LLaV A-OneVision demonstrates excellent set-of-marks (SoM) reasoning [149], an emerging capability shown in Table 8. To the best of our knowledge, this is the first time that op
- background agent development methods like interactive learning and real-world exploration. Building realistic interactive environments is a major challenge in developing multimodal agents. Prior work that introduce executable environments simplify the observation and action spaces of human-computer interaction and limit task scope within specific applications or domains, such as web navigation in a few domains [44, 30, 58, 66], coding [57] and the combination [32, 54, 34]. Agents developed in these restric
co-cited works
representative citing papers
OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
AOI adds keyframe capture, volume-gated audio transcription, and visual narration to computer-use agents, producing +17 to +48 pp gains over screenshot baselines on DynaCU-Bench with no retraining.
DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.
OpenRef benchmark for open-world REC with F1 and N3R metrics and training-free MCC to improve existing models in complex scenarios.
Minerva-Ego is a new benchmark for egocentric visual reasoning with dense human-annotated traces and masks, showing that spatiotemporal hints substantially improve frontier model performance.
A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.
BBCritic reframes GUI critique as continuous semantic alignment via contrastive learning in an affordance space, outperforming larger binary SOTA models on a new four-level hierarchical benchmark without extra annotations.
Seg-Agent performs language-guided segmentation without training by using Set-of-Mark visual prompts to enable explicit multimodal chain-of-reasoning in three stages: generation, selection, and refinement.
WinDeskGround is a parametrically generated benchmark of 1,356 instruction-target pairs that reveals accuracy declines in state-of-the-art MLLMs under partial occlusion in multi-window GUI settings.
Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
SBAC uses sketching and multimodal LLMs to help users refine underspecified access control preferences into complete, validated policies through iterative human-AI collaboration.
GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.
Behavioral fingerprints distinguish AI browsing agents from humans and each other, enabling superior detection compared to current bot systems.
ViBR reproduces 72% of bugs from video reports by segmenting actions with CLIP similarity and using VLMs for region-aware GUI state comparison, outperforming prior heuristics-based methods.
Visual adversarial perturbations bypass price constraints in multimodal agents by exploiting visual dominance over text, with PriceBlind achieving ~80% white-box ASR and 35-41% transfer ASR.
By drawing object boxes and motion trails visually on video frames instead of serializing coordinates as text, BoxTuning reduces token costs dramatically and improves accuracy on video question answering benchmarks.
Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, with full release of data and code planned.
A training-free Visual Chain-of-Thought framework reconstructs high-fidelity 3D meshes from single images and iteratively synthesizes optimal novel views to enhance MLLM spatial comprehension on benchmarks like 3DSRBench.
ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.
LVLM-VA aligns small vision models with human knowledge using an LVLM bidirectional interface, reducing spurious features and group biases on synthetic and real datasets without fine-grained feedback.
4D-RGPT uses perceptual 4D distillation to boost region-level 4D perception in multimodal LLMs and reports gains on existing and new video QA benchmarks.
SpatialBench creates a five-level framework and 15-task benchmark to measure hierarchical spatial reasoning in MLLMs, finding strong basic perception but weak symbolic reasoning, causal inference, and planning.
citing papers explorer
-
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.
-
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, with full release of data and code planned.
-
Large Language Model-Brained GUI Agents: A Survey
A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.
- ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring