hub Mixed citations

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, Xin Eric Wang · 2025 · cs.AI · arXiv 2504.00906

Mixed citation behavior. Most common role is background (60%).

31 Pith papers citing it

Background 60% of classified citations

open full Pith review browse 31 citing papers arXiv PDF

abstract

Computer use agents automate digital tasks by directly interacting with graphical user interfaces (GUIs) on computers and mobile devices, offering significant potential to enhance human productivity by completing an open-ended space of user queries. However, current agents face significant challenges: imprecise grounding of GUI elements, difficulties with long-horizon task planning, and performance bottlenecks from relying on single generalist models for diverse cognitive tasks. To this end, we introduce Agent S2, a novel compositional framework that delegates cognitive responsibilities across various generalist and specialist models. We propose a novel Mixture-of-Grounding technique to achieve precise GUI localization and introduce Proactive Hierarchical Planning, dynamically refining action plans at multiple temporal scales in response to evolving observations. Evaluations demonstrate that Agent S2 establishes new state-of-the-art (SOTA) performance on three prominent computer use benchmarks. Specifically, Agent S2 achieves 18.9% and 32.7% relative improvements over leading baseline agents such as Claude Computer Use and UI-TARS on the OSWorld 15-step and 50-step evaluation. Moreover, Agent S2 generalizes effectively to other operating systems and applications, surpassing previous best methods by 52.8% on WindowsAgentArena and by 16.52% on AndroidWorld relatively. Code available at https://github.com/simular-ai/Agent-S.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7 method 2 baseline 1

citation-polarity summary

background 6 use method 2 baseline 1 unclear 1

representative citing papers

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.

OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents

cs.CL · 2026-04-27 · unverdicted · novelty 7.0

OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.

OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents

cs.AI · 2025-06-19 · unverdicted · novelty 7.0

AI agents on OSWorld take 2.7-4.3 times more steps than human trajectories, with latency rising sharply due to repeated large model calls for planning and reflection.

Learning from Failure: Inference-Time Self-Improvement for Computer-Use Agents

cs.CV · 2026-06-30 · unverdicted · novelty 6.0

Failure-driven self-improvement raises OpenCUA-72B success rate on OSWorld from 42.3% to 48.9% via LLM diagnosis and inference-time code patches, without retraining.

MemGUI-Agent: An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management

cs.HC · 2026-06-18 · unverdicted · novelty 6.0

MemGUI-Agent uses Context-as-Action (ConAct) for proactive context management in long-horizon GUI tasks, trained on the MemGUI-3K dataset to achieve top 8B-model results on MemGUI-Bench and MobileWorld.

Skill-Guided Continuation Distillation for GUI Agents

cs.AI · 2026-06-17 · unverdicted · novelty 6.0

SGCD generates supervision for off-trajectory states in GUI agents by mixing expert trajectories with continuations produced by a skill-guided policy after the base policy reaches those states.

VISUALSKILL: Multimodal Skills for Computer-Use Agents

cs.CL · 2026-06-16 · unverdicted · novelty 6.0

Multimodal skills retaining visual figures improve CUA benchmark scores by 8.3 points over text-only equivalents generated from the same source content.

Demo2Tutorial: From Human Experience to Multimodal Software Tutorials

cs.CV · 2026-06-02 · unverdicted · novelty 6.0

Demo2Tutorial distills human screen recordings into hierarchical image-text tutorials that outperform human-authored ones on a documentation-derived benchmark and improve downstream human task speed and GUI-agent planning.

What to Format and How: A Benchmark and Workflow Approach for Document Formatting

cs.CL · 2026-06-01 · unverdicted · novelty 6.0

Presents DocFormBench benchmark and DocFormFlow workflow for content-aware LLM document formatting, claiming higher accuracy and lower token use via decoupled localization and modification.

Multi-Agent Computer Use

cs.MA · 2026-06-01 · unverdicted · novelty 6.0

A manager-driven DAG decomposition with parallel subagents improves computer use agent success rates by 3.4-25.5% and reduces wall-clock time on long-horizon benchmarks.

Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents

cs.LG · 2026-05-27 · unverdicted · novelty 6.0

LearnWeak specializes small CUAs via weakness detection by a reference agent, targeted task synthesis, and error-aware training, delivering 11+ point gains on OSWorld.

OpenComputer: Verifiable Software Worlds for Computer-Use Agents

cs.AI · 2026-05-19 · unverdicted · novelty 6.0

OpenComputer introduces a verifier-grounded framework with state verifiers, self-evolving layers, task synthesis, and auditable evaluation for 33 desktop apps and 1000 tasks to support computer-use AI agents.

MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents

cs.CV · 2026-05-18 · conditional · novelty 6.0

MementoGUI introduces a modular memory-control framework with working and episodic memory operators that improves long-horizon GUI agent performance over history-replay and text-only baselines.

ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.

LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning

cs.AI · 2026-05-08 · unverdicted · novelty 6.0

LiteGUI trains 2B/3B-scale GUI agents via SFT-free guided on-policy distillation and multi-solution dual-level GRPO to reach SOTA lightweight performance and compete with larger models.

VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

cs.CL · 2026-04-23 · conditional · novelty 6.0

VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.

MemExplorer: Navigating the Heterogeneous Memory Design Space for Agentic Inference NPUs

cs.AR · 2026-04-17 · unverdicted · novelty 6.0

MemExplorer optimizes heterogeneous memory systems for agentic LLM inference on NPUs and reports up to 2.3x higher energy efficiency than baselines under fixed power budgets.

UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

UI-Zoomer uses uncertainty quantification to trigger and size adaptive zoom-ins only on uncertain GUI grounding predictions, yielding up to 13.4% gains on benchmarks with no training.

AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management

cs.AI · 2025-12-11 · conditional · novelty 6.0

AgentProg reframes interaction history as a program with variables and control flow, plus a belief state for partial observability, achieving SOTA success rates on long-horizon GUI benchmarks while baselines degrade.

MGA: Memory-Driven GUI Agent for Observation-Centric Interaction

cs.AI · 2025-10-28 · unverdicted · novelty 6.0

MGA is a memory-driven GUI agent that uses an observer for bias-free screen reading and structured memory for compact state transitions to enable efficient long-horizon automation.

VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents

cs.CL · 2025-09-09 · unverdicted · novelty 6.0

VeriOS-Agent is an OS agent that proactively queries humans in untrustworthy scenarios via a query-driven framework and three-stage training, achieving 19.72% higher step-wise success rate over baselines while preserving normal performance.

GTA1: GUI Test-time Scaling Agent

cs.AI · 2025-07-08 · unverdicted · novelty 6.0

GTA1 combines test-time scaling for action plan selection with RL-based grounding to achieve SOTA results on GUI agent benchmarks.

DroidRetriever: A Transparent and Steerable Automation System for Collaborative Mobile Information Seeking

cs.HC · 2025-05-06 · unverdicted · novelty 6.0

DroidRetriever is a transparent steerable mobile automation system that decomposes information-seeking tasks with multi-LLM agents, navigates apps, synthesizes reports with screenshots, and provides a dashboard for real-time user intervention and privacy pauses.

InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

cs.AI · 2025-04-19 · unverdicted · novelty 6.0

InfiGUI-R1 uses Reasoning Injection via spatial distillation followed by Deliberation Enhancement via RL to evolve GUI agents from reactive actors to deliberative reasoners, reporting strong performance on grounding and trajectory tasks.

citing papers explorer

Showing 31 of 31 citing papers.

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark cs.CV · 2026-05-12 · unverdicted · none · ref 5 · internal anchor
Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents cs.CL · 2026-04-27 · unverdicted · none · ref 56 · internal anchor
OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.
OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents cs.AI · 2025-06-19 · unverdicted · none · ref 1 · internal anchor
AI agents on OSWorld take 2.7-4.3 times more steps than human trajectories, with latency rising sharply due to repeated large model calls for planning and reflection.
Learning from Failure: Inference-Time Self-Improvement for Computer-Use Agents cs.CV · 2026-06-30 · unverdicted · none · ref 2 · internal anchor
Failure-driven self-improvement raises OpenCUA-72B success rate on OSWorld from 42.3% to 48.9% via LLM diagnosis and inference-time code patches, without retraining.
MemGUI-Agent: An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management cs.HC · 2026-06-18 · unverdicted · none · ref 1 · internal anchor
MemGUI-Agent uses Context-as-Action (ConAct) for proactive context management in long-horizon GUI tasks, trained on the MemGUI-3K dataset to achieve top 8B-model results on MemGUI-Bench and MobileWorld.
Skill-Guided Continuation Distillation for GUI Agents cs.AI · 2026-06-17 · unverdicted · none · ref 54 · internal anchor
SGCD generates supervision for off-trajectory states in GUI agents by mixing expert trajectories with continuations produced by a skill-guided policy after the base policy reaches those states.
VISUALSKILL: Multimodal Skills for Computer-Use Agents cs.CL · 2026-06-16 · unverdicted · none · ref 1 · internal anchor
Multimodal skills retaining visual figures improve CUA benchmark scores by 8.3 points over text-only equivalents generated from the same source content.
Demo2Tutorial: From Human Experience to Multimodal Software Tutorials cs.CV · 2026-06-02 · unverdicted · none · ref 1 · internal anchor
Demo2Tutorial distills human screen recordings into hierarchical image-text tutorials that outperform human-authored ones on a documentation-derived benchmark and improve downstream human task speed and GUI-agent planning.
What to Format and How: A Benchmark and Workflow Approach for Document Formatting cs.CL · 2026-06-01 · unverdicted · none · ref 33 · internal anchor
Presents DocFormBench benchmark and DocFormFlow workflow for content-aware LLM document formatting, claiming higher accuracy and lower token use via decoupled localization and modification.
Multi-Agent Computer Use cs.MA · 2026-06-01 · unverdicted · none · ref 1 · internal anchor
A manager-driven DAG decomposition with parallel subagents improves computer use agent success rates by 3.4-25.5% and reduces wall-clock time on long-horizon benchmarks.
Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents cs.LG · 2026-05-27 · unverdicted · none · ref 1 · internal anchor
LearnWeak specializes small CUAs via weakness detection by a reference agent, targeted task synthesis, and error-aware training, delivering 11+ point gains on OSWorld.
OpenComputer: Verifiable Software Worlds for Computer-Use Agents cs.AI · 2026-05-19 · unverdicted · none · ref 1 · internal anchor
OpenComputer introduces a verifier-grounded framework with state verifiers, self-evolving layers, task synthesis, and auditable evaluation for 33 desktop apps and 1000 tasks to support computer-use AI agents.
MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents cs.CV · 2026-05-18 · conditional · none · ref 1 · internal anchor
MementoGUI introduces a modular memory-control framework with working and episodic memory operators that improves long-horizon GUI agent performance over history-replay and text-only baselines.
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents cs.AI · 2026-05-12 · unverdicted · none · ref 1 · internal anchor
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning cs.AI · 2026-05-08 · unverdicted · none · ref 6 · internal anchor
LiteGUI trains 2B/3B-scale GUI agents via SFT-free guided on-policy distillation and multi-solution dual-level GRPO to reach SOTA lightweight performance and compete with larger models.
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation cs.CL · 2026-04-23 · conditional · none · ref 2 · internal anchor
VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
MemExplorer: Navigating the Heterogeneous Memory Design Space for Agentic Inference NPUs cs.AR · 2026-04-17 · unverdicted · none · ref 3 · internal anchor
MemExplorer optimizes heterogeneous memory systems for agentic LLM inference on NPUs and reports up to 2.3x higher energy efficiency than baselines under fixed power budgets.
UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding cs.CV · 2026-04-15 · unverdicted · none · ref 2 · internal anchor
UI-Zoomer uses uncertainty quantification to trigger and size adaptive zoom-ins only on uncertain GUI grounding predictions, yielding up to 13.4% gains on benchmarks with no training.
AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management cs.AI · 2025-12-11 · conditional · none · ref 1 · internal anchor
AgentProg reframes interaction history as a program with variables and control flow, plus a belief state for partial observability, achieving SOTA success rates on long-horizon GUI benchmarks while baselines degrade.
MGA: Memory-Driven GUI Agent for Observation-Centric Interaction cs.AI · 2025-10-28 · unverdicted · none · ref 4 · internal anchor
MGA is a memory-driven GUI agent that uses an observer for bias-free screen reading and structured memory for compact state transitions to enable efficient long-horizon automation.
VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents cs.CL · 2025-09-09 · unverdicted · none · ref 1 · internal anchor
VeriOS-Agent is an OS agent that proactively queries humans in untrustworthy scenarios via a query-driven framework and three-stage training, achieving 19.72% higher step-wise success rate over baselines while preserving normal performance.
GTA1: GUI Test-time Scaling Agent cs.AI · 2025-07-08 · unverdicted · none · ref 8 · internal anchor
GTA1 combines test-time scaling for action plan selection with RL-based grounding to achieve SOTA results on GUI agent benchmarks.
DroidRetriever: A Transparent and Steerable Automation System for Collaborative Mobile Information Seeking cs.HC · 2025-05-06 · unverdicted · none · ref 2 · internal anchor
DroidRetriever is a transparent steerable mobile automation system that decomposes information-seeking tasks with multi-LLM agents, navigates apps, synthesizes reports with screenshots, and provides a dashboard for real-time user intervention and privacy pauses.
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners cs.AI · 2025-04-19 · unverdicted · none · ref 2 · internal anchor
InfiGUI-R1 uses Reasoning Injection via spatial distillation followed by Deliberation Enhancement via RL to evolve GUI agents from reactive actors to deliberative reasoners, reporting strong performance on grounding and trajectory tasks.
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants cs.AI · 2026-04-30 · unverdicted · none · ref 4 · internal anchor
The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a future roadmap.
From Question Answering to Task Completion: A Survey on Agent System and Harness Design cs.AI · 2026-06-14 · unverdicted · none · ref 108 · internal anchor
Survey framing LLM agents as model-plus-harness systems, decomposing harness responsibilities, mapping them to tasks, and highlighting open challenges in evaluation, safety, and co-evolution.
InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction cs.AI · 2025-05-16 · unverdicted · none · ref 3 · internal anchor
InfantAgent-Next integrates tool-based and vision agents in a modular architecture and reports 7.27% accuracy on OSWorld, exceeding Claude-Computer-Use while also testing on GAIA and SWE-Bench.
PrecisionCUA: Iterative Visual Refinement for Pixel-Precise Cursor Grounding in Code Editors cs.CV · 2026-04-14 · unreviewed · ref 1 · internal anchor
IntentScore: Intent-Conditioned Action Evaluation for Computer-Use Agents cs.AI · 2026-04-06 · unreviewed · ref 2 · 2 links · internal anchor
Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward cs.MA · 2026-02-12 · unreviewed · ref 24 · internal anchor
MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents cs.AI · 2025-12-14 · unreviewed · ref 1 · internal anchor

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer