OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
hub Canonical reference
ScreenAI: A Vision-Language Model for UI and Infographics Understanding , year =
Canonical reference. 100% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
roles
background 6polarities
background 6representative citing papers
Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, with full release of data and code planned.
Multimodal LLMs process code as images to achieve up to 8x token compression, with visual cues like syntax highlighting aiding tasks and clone detection remaining resilient or even improving under compression.
MAS-Bench introduces 139 tasks, 88 predefined shortcuts, and 9 metrics to evaluate hybrid GUI-shortcut mobile agents, reporting up to 68.3% success and 39% efficiency gains over GUI-only baselines.
PageGuide grounds LLM answers in webpage DOM elements using visual overlays for find, guide, and hide modes, yielding measurable gains in a 94-user study.
VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
An anonymization framework replaces sensitive UI content with deterministic placeholders to protect privacy in mobile GUI agents while preserving task performance.
MUIAnno is an expert-annotated dataset of mobile UI screens from iOS apps with structured JSON labels and baseline results for UI element detection.
The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment interventions.
Proposes four architectural patterns—Hybrid Affordance Integration, Adaptive Visual Anchoring, Visual Hierarchy Synthesis, and Semantic Scene Graph—to balance non-determinism and latency of foundation models with enterprise requirements for determinism and real-time performance.
citing papers explorer
-
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, with full release of data and code planned.
-
CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding
Multimodal LLMs process code as images to achieve up to 8x token compression, with visual cues like syntax highlighting aiding tasks and clone detection remaining resilient or even improving under compression.
-
PageGuide: Browser extension to assist users in navigating a webpage and locating information
PageGuide grounds LLM answers in webpage DOM elements using visual overlays for find, guide, and hide modes, yielding measurable gains in a 94-user study.
-
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
-
Anonymization-Enhanced Privacy Protection for Mobile GUI Agents: Available but Invisible
An anonymization framework replaces sensitive UI content with deterministic placeholders to protect privacy in mobile GUI agents while preserving task performance.
-
MUIAnno: An Expert-Annotated Dataset and Evaluation Benchmark for Mobile UI Understanding
MUIAnno is an expert-annotated dataset of mobile UI screens from iOS apps with structured JSON labels and baseline results for UI element detection.
-
Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment interventions.
-
A Pattern Language for Resilient Visual Agents
Proposes four architectural patterns—Hybrid Affordance Integration, Adaptive Visual Anchoring, Visual Hierarchy Synthesis, and Semantic Scene Graph—to balance non-determinism and latency of foundation models with enterprise requirements for determinism and real-time performance.