A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
hub
Workarena: How capable are web agents at solving common knowledge work tasks? arXiv preprint arXiv:2403.07718
19 Pith papers cite this work. Polarity classification is still indexing.
hub tools
representative citing papers
OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
EntCollabBench shows that today's LLM agents still struggle with delegation, context transfer, parameter grounding, workflow closure, and decision commitment when tested in a simulated enterprise with 11 role-specialized agents.
NeuroState-Bench is a human-calibrated benchmark with 144 tasks and 306 side-query probes showing that commitment integrity in LLM agent profiles diverges from task success, with 31 of 32 profiles changing rank under integrity evaluation.
ClawMark is a new benchmark for multi-turn multi-day multimodal coworker agents in stateful evolving services, with deterministic Python checkers showing frontier models achieve only 20% strict task success.
HiL-Bench shows frontier AI agents fail to ask for help on incomplete tasks, recovering only a fraction of full-information performance, but RL training on Ask-F1 reward improves judgment and transfers across domains.
SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 models in 6 scenarios.
Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, with full release of data and code planned.
WebSP-Eval shows that multimodal LLM-based web agents fail more than 45% of the time on security and privacy tasks involving stateful UI elements such as toggles and checkboxes.
AndroidWorld is a dynamic, reproducible Android benchmark that generates unlimited natural-language tasks for autonomous agents and shows current agents succeed on only 30.6 percent of them.
FATE lets LLM agents self-evolve safer behaviors by generating and filtering repairs from their own failure trajectories using verifiers and Pareto optimization.
Agent benchmarks can report evidence-supported score bounds instead of single misleading success rates by adding a layer that checks required artifacts for outcome verification.
LocalAlign generates near-target adversarial examples via prompting and applies margin-aware alignment training to enforce tighter boundaries against prompt injection attacks.
Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.
VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.
Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.
AgentCE-Bench is a lightweight grid-planning benchmark that controls task horizon via hidden slots H and difficulty via decoy budget B, validated across 13 models for consistent and discriminative evaluation.
The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment interventions.
citing papers explorer
-
ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents
ClawMark is a new benchmark for multi-turn multi-day multimodal coworker agents in stateful evolving services, with deterministic Python checkers showing frontier models achieve only 20% strict task success.
-
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, with full release of data and code planned.