WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.
hub Mixed citations
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Mixed citation behavior. Most common role is background (60%).
abstract
We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SEEM/SAM, to partition an image into regions at different levels of granularity, and overlay these regions with a set of marks e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can answer the questions that require visual grounding. We perform a comprehensive empirical study to validate the effectiveness of SoM on a wide range of fine-grained vision and multimodal tasks. For example, our experiments show that GPT-4V with SoM in zero-shot setting outperforms the state-of-the-art fully-finetuned referring expression comprehension and segmentation model on RefCOCOg. Code for SoM prompting is made public at: https://github.com/microsoft/SoM.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SEEM/SAM, to partition an image into regions at different levels of granularity, and overlay these regions with a set of marks e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can answer the questions that require visual grounding. We perform a comprehensive empirical study to validate the effectiveness of SoM on a wide
- baseline Multimodal-GPT [47] 7B - - - 0.5335 0.5440 - - InstructBLIP [36] 7B 10.3 86.2 45.26 0.8091 0.9392 35.5 41.3 GPT-4V [125] - 4.3 92.7 65.28 - - - - LLaVA (7B) [111] 7B 13.5 69.3 - - - 23.3 26.3 LLaVA (13B) [111] 13B - - - 0.8360 0.8729 - - MiniGPT-4 (7B) [225] 7B - - 35.78 0.5713 0.6359 61.4 50.1 MiniGPT-4 (13B) [225] 13B 15.9 76.7 - - - - - mPLUG-Owl2 [185] 7B 10.6 84.0 47.30 - - - - LLaVA-1.5 (7B) [110] 7B 8.6 82.9 - - - 44.6 46.4 LLaVA-1.5 (13B) [110] 13B - - 46.94 0.8566 0.9425 - - CogVLM [165
- background [26] Taofeng Xue, Chong Peng, Mianqiu Huang, Linsen Guo, Tiancheng Han, Haozhe Wang, Jianing Wang, Xiaocheng Zhang, Xin Yang, Dengchang Zhao, et al. Evocua: Evolving computer use agents via learning from scalable synthetic experience.arXiv preprint arXiv:2601.15876, 2026. [27] Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt-4v (ision) is a generalist web agent, if grounded.arXiv preprint arXiv:2401.01614, 2024. 11 [28] Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and
- background capabilities learned from single-image scenarios and relational reasoning skills developed from multi- image scenarios. The example highlights LLaV A-OneVision's proficiency in GUI understanding and task execution. S3: Set-of-mark Prompting (Transfer from single-image task composition). Different from existing open LLMs, LLaV A-OneVision demonstrates excellent set-of-marks (SoM) reasoning [149], an emerging capability shown in Table 8. To the best of our knowledge, this is the first time that op
- background agent development methods like interactive learning and real-world exploration. Building realistic interactive environments is a major challenge in developing multimodal agents. Prior work that introduce executable environments simplify the observation and action spaces of human-computer interaction and limit task scope within specific applications or domains, such as web navigation in a few domains [44, 30, 58, 66], coding [57] and the combination [32, 54, 34]. Agents developed in these restric
co-cited works
representative citing papers
OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.
Minerva-Ego is a new benchmark for egocentric visual reasoning with dense human-annotated traces and masks, showing that spatiotemporal hints substantially improve frontier model performance.
A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.
BBCritic reframes GUI critique as continuous semantic alignment via contrastive learning in an affordance space, outperforming larger binary SOTA models on a new four-level hierarchical benchmark without extra annotations.
Seg-Agent performs language-guided segmentation without training by using Set-of-Mark visual prompts to enable explicit multimodal chain-of-reasoning in three stages: generation, selection, and refinement.
WinDeskGround is a parametrically generated benchmark of 1,356 instruction-target pairs that reveals accuracy declines in state-of-the-art MLLMs under partial occlusion in multi-window GUI settings.
Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
SBAC uses sketching and multimodal LLMs to help users refine underspecified access control preferences into complete, validated policies through iterative human-AI collaboration.
GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.
Behavioral fingerprints distinguish AI browsing agents from humans and each other, enabling superior detection compared to current bot systems.
ViBR reproduces 72% of bugs from video reports by segmenting actions with CLIP similarity and using VLMs for region-aware GUI state comparison, outperforming prior heuristics-based methods.
Visual adversarial perturbations bypass price constraints in multimodal agents by exploiting visual dominance over text, with PriceBlind achieving ~80% white-box ASR and 35-41% transfer ASR.
By drawing object boxes and motion trails visually on video frames instead of serializing coordinates as text, BoxTuning reduces token costs dramatically and improves accuracy on video question answering benchmarks.
Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, with full release of data and code planned.
A training-free Visual Chain-of-Thought framework reconstructs high-fidelity 3D meshes from single images and iteratively synthesizes optimal novel views to enhance MLLM spatial comprehension on benchmarks like 3DSRBench.
ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.
LVLM-VA aligns small vision models with human knowledge using an LVLM bidirectional interface, reducing spurious features and group biases on synthetic and real datasets without fine-grained feedback.
4D-RGPT uses perceptual 4D distillation to boost region-level 4D perception in multimodal LLMs and reports gains on existing and new video QA benchmarks.
SpatialBench creates a five-level framework and 15-task benchmark to measure hierarchical spatial reasoning in MLLMs, finding strong basic perception but weak symbolic reasoning, causal inference, and planning.
SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.
SecureWebArena is a new benchmark suite for holistic security evaluation of LVLM-based web agents using diverse simulated environments, attack taxonomies, and multi-layered failure analysis across reasoning, behavior, and outcomes.
citing papers explorer
-
WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments
WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.
-
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
-
DeepLatent: Think with Images via Parallel Latent Visual Reasoning
DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.
-
Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding
Minerva-Ego is a new benchmark for egocentric visual reasoning with dense human-annotated traces and masks, showing that spatiotemporal hints substantially improve frontier model performance.
-
From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing
A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.
-
Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment
BBCritic reframes GUI critique as continuous semantic alignment via contrastive learning in an affordance space, outperforming larger binary SOTA models on a new four-level hierarchical benchmark without extra annotations.
-
Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation
Seg-Agent performs language-guided segmentation without training by using Set-of-Mark visual prompts to enable explicit multimodal chain-of-reasoning in three stages: generation, selection, and refinement.
-
WinDeskGround: A Benchmark for Robust GUI Grounding in Complex Multi-Window Desktop Environments
WinDeskGround is a parametrically generated benchmark of 1,356 instruction-target pairs that reveals accuracy declines in state-of-the-art MLLMs under partial occlusion in multi-window GUI settings.
-
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
-
Sketch-based Access Control: A Multimodal Interface for Translating User Preferences into Intent-Aligned Policies
SBAC uses sketching and multimodal LLMs to help users refine underspecified access control preferences into complete, validated policies through iterative human-AI collaboration.
-
GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning
GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.
-
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.
-
FP-Agent: Fingerprinting AI Browsing Agents
Behavioral fingerprints distinguish AI browsing agents from humans and each other, enabling superior detection compared to current bot systems.
-
ViBR: Automated Bug Replay from Video-based Reports using Vision-Language Models
ViBR reproduces 72% of bugs from video reports by segmenting actions with CLIP similarity and using VLMs for region-aware GUI state comparison, outperforming prior heuristics-based methods.
-
Penny Wise, Pixel Foolish: Bypassing Price Constraints in Multimodal Agents via Visual Adversarial Perturbations
Visual adversarial perturbations bypass price constraints in multimodal agents by exploiting visual dominance over text, with PriceBlind achieving ~80% white-box ASR and 35-41% transfer ASR.
-
BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning
By drawing object boxes and motion trails visually on video frames instead of serializing coordinates as text, BoxTuning reduces token costs dramatically and improves accuracy on video question answering benchmarks.
-
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, with full release of data and code planned.
-
Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning
A training-free Visual Chain-of-Thought framework reconstructs high-fidelity 3D meshes from single images and iteratively synthesizes optimal novel views to enhance MLLM spatial comprehension on benchmarks like 3DSRBench.
-
ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs
ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.
-
LVLM-Aided Alignment of Task-Specific Vision Models
LVLM-VA aligns small vision models with human knowledge using an LVLM bidirectional interface, reducing spurious features and group biases on synthetic and real datasets without fine-grained feedback.
-
4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
4D-RGPT uses perceptual 4D distillation to boost region-level 4D perception in multimodal LLMs and reports gains on existing and new video QA benchmarks.
-
SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition
SpatialBench creates a five-level framework and 15-task benchmark to measure hierarchical spatial reasoning in MLLMs, finding strong basic perception but weak symbolic reasoning, causal inference, and planning.
-
SAM 3: Segment Anything with Concepts
SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.
-
SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents
SecureWebArena is a new benchmark suite for holistic security evaluation of LVLM-based web agents using diverse simulated environments, attack taxonomies, and multi-layered failure analysis across reasoning, behavior, and outcomes.
-
From Task to Tutorial: An Automated GUI Framework for Excel Tutorial Document and Video Creation
An AI framework automates Excel tutorial and video creation from task descriptions via an Execution Agent, achieving 8.5% higher task success and 1/20th the authoring time of experts.
-
WebMall -- A Multi-Shop Benchmark for Evaluating Web Agents
WebMall is the first offline multi-shop benchmark for evaluating LLM web agents on complex comparison shopping tasks across heterogeneous product data from multiple simulated e-shops.
-
EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments
EconWebArena is a new benchmark with 360 curated economic tasks across 82 authoritative websites for evaluating multimodal web agents on navigation, grounding, and data extraction.
-
GRIT: Teaching MLLMs to Think with Images
GRIT introduces a grounded reasoning paradigm for MLLMs where reasoning chains interleave text and bounding boxes, trained via GRPO-GR reinforcement learning on as few as 20 examples without annotations.
-
ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation
ReKep encodes robotic tasks as optimizable Python functions over 3D keypoints that are generated automatically from language and RGB-D input, enabling real-time hierarchical planning on single- and dual-arm platforms without task-specific data.
-
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
AndroidWorld is a dynamic, reproducible Android benchmark that generates unlimited natural-language tasks for autonomous agents and shows current agents succeed on only 30.6 percent of them.
-
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?
WorkArena benchmark shows LLM web agents achieve partial success on enterprise tasks but have a substantial gap to full automation and perform worse with open-source models.
-
Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly
Flat-Pack Bench is a new evaluation suite that shows state-of-the-art LVLMs perform poorly on nuanced spatio-temporal reasoning required for furniture assembly videos.
-
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.
-
Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models
GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.
-
AgentLens: Adaptive Visual Modalities for Human-Agent Interaction in Mobile GUI Agents
AgentLens adaptively deploys Full UI, Partial UI, and GenUI modalities with virtual display overlays for mobile GUI agents, yielding 85.7% user preference and best-in-study usability in a 21-participant evaluation.
-
Proactive Detection of GUI Defects in Multi-Window Scenarios via Multimodal Reasoning
Proactive multi-window state triggering plus Set-of-Mark alignment and multimodal LLM reasoning detects GUI defects in Android apps, reporting 184% more text truncation, 87.2% F1 on occlusion, and 40 defect-prone apps at 10% FPR.
-
Chain Of Interaction Benchmark (COIN): When Reasoning meets Embodied Interaction
COIN provides 50 interactive robotic tasks, a 1000-demonstration dataset collected via AR teleoperation, and metrics showing that CodeAsPolicy, VLA, and H-VLA models fail at causally-dependent interactive reasoning due to gaps between vision and execution.
-
Long-Term Memory for VLA-based Agents in Open-World Task Execution
ChemBot adds dual-layer memory and future-state asynchronous inference to VLA models, enabling better long-horizon success in chemical lab automation on collaborative robots.
-
UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding
UI-Zoomer uses uncertainty quantification to trigger and size adaptive zoom-ins only on uncertain GUI grounding predictions, yielding up to 13.4% gains on benchmarks with no training.
-
WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces
WebChain supplies the largest open dataset of real human web trajectories with triple-modal alignment and a dual mid-training method that separates grounding from planning to improve web agents.
-
Clutter-Robust Vision-Language-Action Models through Object-Centric and Geometry Grounding
OBEYED-VLA improves VLA robustness in cluttered real-world manipulation by disentangling perception into VLM-based object-centric grounding and geometry-aware stages, then fine-tuning the policy only on single-object demonstrations.
-
EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration
EchoTrail-GUI builds an automated memory of successful GUI task trajectories via self-exploration and injects relevant past examples to raise success rates on Android benchmarks.
-
Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models
Visual Funnel resolves contextual blindness in MLLMs by constructing an entropy-scaled portfolio of hierarchically structured image crops that preserves both local detail and global context.
-
How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks
Multimodal foundation models achieve respectable but sub-specialist performance on semantic vision tasks and weaker results on geometric tasks when evaluated through prompt chaining on established benchmarks.
-
LPO: Towards Accurate GUI Agent Interaction via Location Preference Optimization
LPO optimizes GUI agent positional accuracy by combining information entropy for zone selection with a physical-distance reward inside a Group Relative Preference Optimization framework, claiming SOTA results on benchmarks and online tests.
-
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners
InfiGUI-R1 uses Reasoning Injection via spatial distillation followed by Deliberation Enhancement via RL to evolve GUI agents from reactive actors to deliberative reasoners, reporting strong performance on grounding and trajectory tasks.
-
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
MLLMs achieve competitive but subhuman performance on the new VSI-Bench for visual-spatial intelligence from videos, with spatial reasoning as the main bottleneck and explicit cognitive map generation improving distance estimation.
-
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies
Visual trace prompting improves spatial-temporal awareness in VLA models, delivering 10% gains on SimplerEnv and 3.5x on real-robot tasks.
-
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
-
BLINK: Multimodal Large Language Models Can See but Not Perceive
BLINK benchmark shows multimodal LLMs reach only 45-51 percent accuracy on core visual perception tasks where humans achieve 95 percent, indicating these abilities have not emerged.