MacAgentBench is a new benchmark for macOS AI agents with 676 tasks, deterministic multi-checkpoint evaluation, and tests across frameworks showing skill libraries drive performance more than framework design.
Os-symphony: A holistic framework for robust and generalist computer-using agent
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 5roles
background 2polarities
background 2representative citing papers
HiViG is a test-time critic that combines macro-action history summarization with visual grounding of execution coordinates to reduce short-sighted and visually erroneous actions in long-horizon GUI agents.
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
Controlled empirical study shows correcting Wikipedia data coverage yields larger gains than algorithm differences in LLM search agent training, with outcome-based rewards competitive.
citing papers explorer
-
MacAgentBench: Benchmarking AI Agents on Real-World macOS Desktop
MacAgentBench is a new benchmark for macOS AI agents with 676 tasks, deterministic multi-checkpoint evaluation, and tests across frameworks showing skill libraries drive performance more than framework design.
-
A History-Aware Visually Grounded Critic for Computer Use Agents
HiViG is a test-time critic that combines macro-action history summarization with visual grounding of execution coordinates to reduce short-sighted and visually erroneous actions in long-horizon GUI agents.
-
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.