Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web

Kapoor, R · 2024 · arXiv 2402.17553

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

citation-role summary

background 3 dataset 1

citation-polarity summary

background 3 use dataset 1

representative citing papers

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

cs.AI · 2024-04-11 · accept · novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.

ScaleWoB: Guiding GUI Agents with Coding Agents via Large-Scale Environmental Synthesis

cs.AI · 2026-05-24 · unverdicted · novelty 7.0

ScaleWoB generates 100+ synthetic interactive GUI environments and 1000+ verifiable tasks as web pages, releasing a 120-task mobile benchmark where state-of-the-art agents achieve 27.92% success (17.82% on long-horizon tasks) versus 92.08% for humans, with synthetic results generalizing to real apps

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

cs.AI · 2024-05-23 · accept · novelty 7.0

AndroidWorld is a dynamic, reproducible Android benchmark that generates unlimited natural-language tasks for autonomous agents and shows current agents succeed on only 30.6 percent of them.

Learning from Failure: Inference-Time Self-Improvement for Computer-Use Agents

cs.CV · 2026-06-30 · unverdicted · novelty 6.0

Failure-driven self-improvement raises OpenCUA-72B success rate on OSWorld from 42.3% to 48.9% via LLM diagnosis and inference-time code patches, without retraining.

VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

cs.CL · 2026-04-23 · conditional · novelty 6.0

VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

cs.CL · 2024-12-05 · conditional · novelty 6.0

Aguvis presents a pure vision-based framework for autonomous GUI agents using structured reasoning via inner monologue, a new multimodal dataset, and two-stage training to reach SOTA on offline and online benchmarks.

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

cs.CL · 2024-10-30 · unverdicted · novelty 6.0

OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.

WebCanvas: Benchmarking Web Agents in Online Environments

cs.CL · 2024-06-18 · unverdicted · novelty 6.0

WebCanvas creates a dynamic benchmark for web agents with a noise-resistant evaluation metric, the Mind2Web-Live dataset of 542 tasks, and open-source tools and agent framework for ongoing online testing.

Toward Native Multimodal Modeling: A Roadmap

cs.CV · 2026-05-25 · unverdicted · novelty 3.0

A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-based native multimodal modeling.

Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security

cs.HC · 2024-01-10 · unverdicted · novelty 3.0

This survey discusses key components and challenges for Personal LLM Agents and reviews solutions for their capability, efficiency, and security.

citing papers explorer

Showing 10 of 10 citing papers.

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments cs.AI · 2024-04-11 · accept · none · ref 21
OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
ScaleWoB: Guiding GUI Agents with Coding Agents via Large-Scale Environmental Synthesis cs.AI · 2026-05-24 · unverdicted · none · ref 55
ScaleWoB generates 100+ synthetic interactive GUI environments and 1000+ verifiable tasks as web pages, releasing a 120-task mobile benchmark where state-of-the-art agents achieve 27.92% success (17.82% on long-horizon tasks) versus 92.08% for humans, with synthetic results generalizing to real apps
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents cs.AI · 2024-05-23 · accept · none · ref 13
AndroidWorld is a dynamic, reproducible Android benchmark that generates unlimited natural-language tasks for autonomous agents and shows current agents succeed on only 30.6 percent of them.
Learning from Failure: Inference-Time Self-Improvement for Computer-Use Agents cs.CV · 2026-06-30 · unverdicted · none · ref 12
Failure-driven self-improvement raises OpenCUA-72B success rate on OSWorld from 42.3% to 48.9% via LLM diagnosis and inference-time code patches, without retraining.
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation cs.CL · 2026-04-23 · conditional · none · ref 34
VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction cs.CL · 2024-12-05 · conditional · none · ref 83
Aguvis presents a pure vision-based framework for autonomous GUI agents using structured reasoning via inner monologue, a new multimodal dataset, and two-stage training to reach SOTA on offline and online benchmarks.
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents cs.CL · 2024-10-30 · unverdicted · none · ref 136
OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
WebCanvas: Benchmarking Web Agents in Online Environments cs.CL · 2024-06-18 · unverdicted · none · ref 10
WebCanvas creates a dynamic benchmark for web agents with a noise-resistant evaluation metric, the Mind2Web-Live dataset of 542 tasks, and open-source tools and agent framework for ongoing online testing.
Toward Native Multimodal Modeling: A Roadmap cs.CV · 2026-05-25 · unverdicted · none · ref 169
A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-based native multimodal modeling.
Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security cs.HC · 2024-01-10 · unverdicted · none · ref 134
This survey discusses key components and challenges for Personal LLM Agents and reviews solutions for their capability, efficiency, and security.

Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer