hub Mixed citations

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair · 2024 · cs.AI · arXiv 2405.14573

Mixed citation behavior. Most common role is background (67%).

80 Pith papers citing it

Background 67% of classified citations

open full Pith review browse 80 citing papers arXiv PDF

abstract

Autonomous agents that execute human tasks by controlling computers can enhance human productivity and application accessibility. However, progress in this field will be driven by realistic and reproducible benchmarks. We present AndroidWorld, a fully functional Android environment that provides reward signals for 116 programmatic tasks across 20 real-world Android apps. Unlike existing interactive environments, which provide a static test set, AndroidWorld dynamically constructs tasks that are parameterized and expressed in natural language in unlimited ways, thus enabling testing on a much larger and more realistic suite of tasks. To ensure reproducibility, each task includes dedicated initialization, success-checking, and tear-down logic, which modifies and inspects the device's system state. We experiment with baseline agents to test AndroidWorld and provide initial results on the benchmark. Our best agent can complete 30.6% of AndroidWorld's tasks, leaving ample room for future work. Furthermore, we adapt a popular desktop web agent to work on Android, which we find to be less effective on mobile, suggesting future research is needed to achieve universal, cross-platform agents. Finally, we also conduct a robustness analysis, showing that task variations can significantly affect agent performance, demonstrating that without such testing, agent performance metrics may not fully reflect practical challenges. AndroidWorld and the experiments in this paper are available at github.com/google-research/android_world.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 12 dataset 5 baseline 1

citation-polarity summary

background 12 use dataset 5 baseline 1

representative citing papers

It Lied to a Doctor to Buy Poison Ingredients: Quantifying Real-World Misuse of Phone-use Agents

cs.MM · 2026-06-26 · unverdicted · novelty 8.0

Phone-use agents on real devices complete harmful tasks like procuring toxic precursors at 68.8% average rate with low refusal, including a documented case of deceiving a doctor for poison ingredients.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

cs.CL · 2026-05-11 · unverdicted · novelty 8.0

A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.

MobileForge: Annotation-Free Adaptation for Mobile GUI Agents with Hierarchical Feedback-Guided Policy Optimization

cs.HC · 2026-06-18 · unverdicted · novelty 7.0

MobileForge adapts Qwen3-VL-8B to 67.2% Pass@3 on AndroidWorld using only automatically generated annotation-free data via MobileGym and HiFPO, with ForgeOwl-8B reaching 77.6%.

ProCUA-SFT Technical Report

cs.LG · 2026-06-15 · conditional · novelty 7.0

ProCUA-SFT is a 3.1M-sample SFT dataset from 93K verified synthetic trajectories that lifts UI-TARS 7B OSWorld score from 26.3% to 45%.

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

Workflow-GYM is a new benchmark for long-horizon professional GUI agent tasks where state-of-the-art models reach only slightly above 30% success.

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

cs.AI · 2026-06-08 · unverdicted · novelty 7.0

SpatialWorld is a new multi-simulator benchmark showing top multimodal agents achieve under 18% success on interactive spatial tasks requiring active exploration and long-horizon planning.

DragOn: A Benchmark and Dataset for Drag-Based GUI Interactions

cs.AI · 2026-06-04 · unverdicted · novelty 7.0

DragOn provides a new drag-grounding benchmark and training dataset for GUI agents, with evaluations suggesting potential improvements on computer-use tasks.

AndroidDaily: A Verifiable Benchmark for Mobile GUI Agents on Real-World Closed-Source Applications

cs.CV · 2026-05-26 · unverdicted · novelty 7.0

AndroidDaily supplies 350 verifiable tasks on 94 closed-source Android apps evaluated by GRADE (87.37% human agreement), with the strongest model achieving 62% success.

ScaleWoB: Guiding GUI Agents with Coding Agents via Large-Scale Environmental Synthesis

cs.AI · 2026-05-24 · unverdicted · novelty 7.0

ScaleWoB generates 100+ synthetic interactive GUI environments and 1000+ verifiable tasks as web pages, releasing a 120-task mobile benchmark where state-of-the-art agents achieve 27.92% success (17.82% on long-horizon tasks) versus 92.08% for humans, with synthetic results generalizing to real apps

WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games

cs.AI · 2026-05-17 · unverdicted · novelty 7.0 · 2 refs

WebGameBench is a new benchmark that evaluates coding agents on building browser-native games from frozen specifications, with runtime browser evaluation showing best agents reach 76.9% usable rate but only 20.2% excellent rate.

Rethinking Side-Channel Analysis: Automated Discovery and Analysis of Side-Channel Leakage with LLM-Assisted Agents

cs.CR · 2026-05-17 · unverdicted · novelty 7.0

SCAgent automates side-channel leakage discovery via LLM agents for target identification and few-shot foundation models for scalable analysis on iOS.

Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment

cs.LG · 2026-05-14 · unverdicted · novelty 7.0 · 2 refs

BBCritic reframes GUI critique as continuous semantic alignment via contrastive learning in an affordance space, outperforming larger binary SOTA models on a new four-level hierarchical benchmark without extra annotations.

MMSkills: Towards Multimodal Skills for General Visual Agents

cs.AI · 2026-05-13 · unverdicted · novelty 7.0 · 3 refs

MMSkills packages multimodal procedural knowledge into state-conditioned skills with text, state cards, and multi-view keyframes, generated from public trajectories via an agentic process and used at inference via branch-loaded inspection to improve visual agents on GUI and game benchmarks.

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.

Benchmarking and Improving GUI Agents in High-Dynamic Environments

cs.CV · 2026-04-28 · unverdicted · novelty 7.0 · 2 refs

DynamicUI improves GUI agent performance in high-dynamic environments by processing interaction videos with frame clustering, action-conditioned refinement, and reflection, outperforming prior approaches on the new DynamicGUIBench spanning ten applications.

RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management

cs.AI · 2026-04-15 · unverdicted · novelty 7.0

RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.

Don't Act Blindly: Robust GUI Automation via Action-Effect Verification and Self-Correction

cs.CL · 2026-04-07 · unverdicted · novelty 7.0

VeriGUI adds a Thinking-Verification-Action-Expectation loop and two-stage training on synthetic failures to reduce undetected action errors and improve recovery in GUI automation.

OmniGUI: Benchmarking GUI Agents in Omni-Modal Smartphone Environments

cs.HC · 2026-04-03 · unverdicted · novelty 7.0

OmniGUI is the first step-level benchmark supplying interleaved image, audio, and video inputs across 709 expert episodes in 29 smartphone apps to evaluate multimodal GUI agents.

MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents

cs.AI · 2025-09-08 · conditional · novelty 7.0

MAS-Bench introduces 139 tasks, 88 predefined shortcuts, and 9 metrics to evaluate hybrid GUI-shortcut mobile agents, reporting up to 68.3% success and 39% efficiency gains over GUI-only baselines.

Beyond Syntax: Action Semantics Learning for App Agents

cs.AI · 2025-06-21 · unverdicted · novelty 7.0

Action Semantics Learning trains app agents to align with the semantic effects of actions via a Semantic Estimator module, improving robustness to out-of-distribution scenarios over syntax-matching fine-tuning.

What Memory Do GUI Agents Really Need? From Passive Records to Active Task-Driving States

cs.CV · 2026-06-30 · unverdicted · novelty 6.0 · 2 refs

Introduces Active Task Driving Memory (ATMem) and STR-GRPO to move GUI agents from passive record storage to actively maintained task states, tested on a new mobile benchmark with progress and scope-aware metrics.

OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

cs.AI · 2026-06-28 · unverdicted · novelty 6.0

OSWorld 2.0 is a benchmark of 108 realistic long-horizon computer-use tasks where current agents achieve only 20.6% binary completion, struggling with state inference and constraint tracking.

PhoneBuddy: Training Open Models for Agentic Phone Use

cs.CL · 2026-06-22 · unverdicted · novelty 6.0

PhoneBuddy combines real-app and mock-app RL after shared SFT, raising real-phone task success from 36.67% to 45.33% and AndroidWorld from 60.3% to 83.2%.

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

cs.CV · 2026-06-15 · unverdicted · novelty 6.0

Qwen-RobotWorld is a language-conditioned video world model using Double-Stream MMDiT, an 8.6M-frame embodied corpus, and progressive curriculum training that ranks first on EWMBench and DreamGen Bench.

citing papers explorer

Showing 50 of 80 citing papers.

It Lied to a Doctor to Buy Poison Ingredients: Quantifying Real-World Misuse of Phone-use Agents cs.MM · 2026-06-26 · unverdicted · none · ref 26 · internal anchor
Phone-use agents on real devices complete harmful tasks like procuring toxic precursors at 68.8% average rate with low refusal, including a documented case of deceiving a doctor for poison ingredients.
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation cs.CL · 2026-05-11 · unverdicted · none · ref 33 · internal anchor
A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
MobileForge: Annotation-Free Adaptation for Mobile GUI Agents with Hierarchical Feedback-Guided Policy Optimization cs.HC · 2026-06-18 · unverdicted · none · ref 10 · internal anchor
MobileForge adapts Qwen3-VL-8B to 67.2% Pass@3 on AndroidWorld using only automatically generated annotation-free data via MobileGym and HiFPO, with ForgeOwl-8B reaching 77.6%.
ProCUA-SFT Technical Report cs.LG · 2026-06-15 · conditional · none · ref 10 · internal anchor
ProCUA-SFT is a 3.1M-sample SFT dataset from 93K verified synthetic trajectories that lifts UI-TARS 7B OSWorld score from 26.3% to 45%.
Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields cs.AI · 2026-06-09 · unverdicted · none · ref 12 · internal anchor
Workflow-GYM is a new benchmark for long-horizon professional GUI agent tasks where state-of-the-art models reach only slightly above 30% success.
SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks cs.AI · 2026-06-08 · unverdicted · none · ref 53 · internal anchor
SpatialWorld is a new multi-simulator benchmark showing top multimodal agents achieve under 18% success on interactive spatial tasks requiring active exploration and long-horizon planning.
DragOn: A Benchmark and Dataset for Drag-Based GUI Interactions cs.AI · 2026-06-04 · unverdicted · none · ref 9 · internal anchor
DragOn provides a new drag-grounding benchmark and training dataset for GUI agents, with evaluations suggesting potential improvements on computer-use tasks.
AndroidDaily: A Verifiable Benchmark for Mobile GUI Agents on Real-World Closed-Source Applications cs.CV · 2026-05-26 · unverdicted · none · ref 38 · internal anchor
AndroidDaily supplies 350 verifiable tasks on 94 closed-source Android apps evaluated by GRADE (87.37% human agreement), with the strongest model achieving 62% success.
ScaleWoB: Guiding GUI Agents with Coding Agents via Large-Scale Environmental Synthesis cs.AI · 2026-05-24 · unverdicted · none · ref 16 · internal anchor
ScaleWoB generates 100+ synthetic interactive GUI environments and 1000+ verifiable tasks as web pages, releasing a 120-task mobile benchmark where state-of-the-art agents achieve 27.92% success (17.82% on long-horizon tasks) versus 92.08% for humans, with synthetic results generalizing to real apps
WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games cs.AI · 2026-05-17 · unverdicted · none · ref 25 · 2 links · internal anchor
WebGameBench is a new benchmark that evaluates coding agents on building browser-native games from frozen specifications, with runtime browser evaluation showing best agents reach 76.9% usable rate but only 20.2% excellent rate.
Rethinking Side-Channel Analysis: Automated Discovery and Analysis of Side-Channel Leakage with LLM-Assisted Agents cs.CR · 2026-05-17 · unverdicted · none · ref 14 · internal anchor
SCAgent automates side-channel leakage discovery via LLM agents for target identification and few-shot foundation models for scalable analysis on iOS.
Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment cs.LG · 2026-05-14 · unverdicted · none · ref 105 · 2 links · internal anchor
BBCritic reframes GUI critique as continuous semantic alignment via contrastive learning in an affordance space, outperforming larger binary SOTA models on a new four-level hierarchical benchmark without extra annotations.
MMSkills: Towards Multimodal Skills for General Visual Agents cs.AI · 2026-05-13 · unverdicted · none · ref 26 · 3 links · internal anchor
MMSkills packages multimodal procedural knowledge into state-conditioned skills with text, state cards, and multi-view keyframes, generated from public trajectories via an agentic process and used at inference via branch-loaded inspection to improve visual agents on GUI and game benchmarks.
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark cs.CV · 2026-05-12 · unverdicted · none · ref 36 · internal anchor
Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
Benchmarking and Improving GUI Agents in High-Dynamic Environments cs.CV · 2026-04-28 · unverdicted · none · ref 27 · 2 links · internal anchor
DynamicUI improves GUI agent performance in high-dynamic environments by processing interaction videos with frame clustering, action-conditioned refinement, and reflection, outperforming prior approaches on the new DynamicGUIBench spanning ten applications.
RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management cs.AI · 2026-04-15 · unverdicted · none · ref 40 · internal anchor
RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.
Don't Act Blindly: Robust GUI Automation via Action-Effect Verification and Self-Correction cs.CL · 2026-04-07 · unverdicted · none · ref 4 · internal anchor
VeriGUI adds a Thinking-Verification-Action-Expectation loop and two-stage training on synthetic failures to reduce undetected action errors and improve recovery in GUI automation.
OmniGUI: Benchmarking GUI Agents in Omni-Modal Smartphone Environments cs.HC · 2026-04-03 · unverdicted · none · ref 17 · internal anchor
OmniGUI is the first step-level benchmark supplying interleaved image, audio, and video inputs across 709 expert episodes in 29 smartphone apps to evaluate multimodal GUI agents.
MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents cs.AI · 2025-09-08 · conditional · none · ref 26 · internal anchor
MAS-Bench introduces 139 tasks, 88 predefined shortcuts, and 9 metrics to evaluate hybrid GUI-shortcut mobile agents, reporting up to 68.3% success and 39% efficiency gains over GUI-only baselines.
Beyond Syntax: Action Semantics Learning for App Agents cs.AI · 2025-06-21 · unverdicted · none · ref 6 · internal anchor
Action Semantics Learning trains app agents to align with the semantic effects of actions via a Semantic Estimator module, improving robustness to out-of-distribution scenarios over syntax-matching fine-tuning.
What Memory Do GUI Agents Really Need? From Passive Records to Active Task-Driving States cs.CV · 2026-06-30 · unverdicted · none · ref 95 · 2 links · internal anchor
Introduces Active Task Driving Memory (ATMem) and STR-GRPO to move GUI agents from passive record storage to actively maintained task states, tested on a new mobile benchmark with progress and scope-aware metrics.
OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks cs.AI · 2026-06-28 · unverdicted · none · ref 71 · internal anchor
OSWorld 2.0 is a benchmark of 108 realistic long-horizon computer-use tasks where current agents achieve only 20.6% binary completion, struggling with state inference and constraint tracking.
PhoneBuddy: Training Open Models for Agentic Phone Use cs.CL · 2026-06-22 · unverdicted · none · ref 8 · internal anchor
PhoneBuddy combines real-app and mock-app RL after shared SFT, raising real-phone task success from 36.67% to 45.33% and AndroidWorld from 60.3% to 83.2%.
Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation cs.CV · 2026-06-15 · unverdicted · none · ref 239 · internal anchor
Qwen-RobotWorld is a language-conditioned video world model using Double-Stream MMDiT, an 8.6M-frame embodied corpus, and progressive curriculum training that ranks first on EWMBench and DreamGen Bench.
Emergence World: A Platform for Evaluating Long-Horizon Multi-Agent Autonomy cs.MA · 2026-06-06 · unverdicted · none · ref 38 · internal anchor
Emergence World is a model-agnostic multi-agent simulation platform integrating live data, 120+ tools, persistent memory, and democratic governance, illustrated by a 15-day study showing divergent outcomes across five LLM models.
WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark cs.CV · 2026-06-04 · unverdicted · none · ref 174 · internal anchor
WorldBench is a visually diverse multimodal reasoning benchmark where the strongest of 15 tested MLLMs reaches only 64% accuracy.
UI-KOBE: Knowledge-Oriented Behavior Exploration for Lightweight Graph-Guided GUI Agents cs.AI · 2026-05-28 · unverdicted · none · ref 1 · internal anchor
UI-KOBE constructs reusable app knowledge graphs from autonomous exploration to provide runtime guidance that improves lightweight mobile GUI agents.
GUITestScape: Towards Open-set Evaluation on Exploratory GUI Testing cs.SE · 2026-05-28 · unverdicted · none · ref 3 · internal anchor
GUITestScape supplies an interactive benchmark for exploratory GUI testing and GUIJudge supplies an open-set process-aware evaluator that outperforms baselines on MLLM agents.
PhoneWorld: Scaling Phone-Use Agent Environments cs.CL · 2026-05-28 · unverdicted · none · ref 10 · internal anchor
PhoneWorld is a pipeline that converts real mobile trajectories into scalable controllable environments, yielding large gains on four benchmarks when used to supplement training data.
unix-ctf: Procedural Environments for Unix-Competence Reinforcement Learning cs.CR · 2026-05-27 · unverdicted · none · ref 9 · internal anchor
unix-ctf procedurally generates 656 Unix CTF tasks across 155 techniques; fine-tuning Qwen3-8B on them raises solve rate from 11.6% to 43.6% on a 15-skill holdout and yields +33 pp in Forensics on InterCode-CTF.
MobileExplorer: Accelerating On-Device Inference for Mobile GUI Agents via Online Exploration cs.AI · 2026-05-26 · unverdicted · none · ref 13 · internal anchor
MobileExplorer reduces on-device GUI agent reasoning steps and latency by 23% via parallel UI exploration, structured memory, and a two-level rollback while maintaining or improving task success rates.
OpenComputer: Verifiable Software Worlds for Computer-Use Agents cs.AI · 2026-05-19 · unverdicted · none · ref 13 · internal anchor
OpenComputer introduces a verifier-grounded framework with state verifiers, self-evolving layers, task synthesis, and auditable evaluation for 33 desktop apps and 1000 tasks to support computer-use AI agents.
AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees cs.AI · 2026-05-19 · unverdicted · none · ref 25 · internal anchor
AQuaUI uses adaptive quadtrees to cut visual tokens in GUI-agent LMMs by up to 29.52% at inference time while retaining 99.06% of full-token accuracy on grounding and navigation benchmarks.
Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation cs.AI · 2026-05-11 · unverdicted · none · ref 20 · internal anchor
Agent benchmarks can report evidence-supported score bounds instead of single misleading success rates by adding a layer that checks required artifacts for outcome verification.
Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents cs.CL · 2026-05-08 · unverdicted · none · ref 6 · internal anchor
Phone-use agents avoid harm more often through inability to act than through deliberate safe choices, so benchmarks must separate unsafe judgment from capability failure.
Augmenting Interface Usability Heuristics for Reliable Computer-Use Agents cs.HC · 2026-05-04 · unverdicted · none · ref 10 · internal anchor
Augmented Nielsen heuristics improve computer-use agent task completion on varied interfaces while preserving human usability, as shown in UI-Verse experiments and human studies.
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows cs.SE · 2026-04-30 · unverdicted · none · ref 34 · internal anchor
Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.
SOLAR-RL: Semi-Online Long-horizon Assignment Reinforcement Learning cs.LG · 2026-04-24 · unverdicted · none · ref 13 · internal anchor
SOLAR-RL assigns dense step-level rewards from static trajectory data by detecting first failure points and applying target-aligned shaping to improve long-horizon GUI task completion without full online interactions.
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation cs.CL · 2026-04-23 · conditional · none · ref 51 · internal anchor
VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
AgentLens: Adaptive Visual Modalities for Human-Agent Interaction in Mobile GUI Agents cs.HC · 2026-04-22 · unverdicted · none · ref 40 · internal anchor
AgentLens adaptively deploys Full UI, Partial UI, and GenUI modalities with virtual display overlays for mobile GUI agents, yielding 85.7% user preference and best-in-study usability in a 21-participant evaluation.
Do LLMs Need to See Everything? A Benchmark and Study of Failures in LLM-driven Smartphone Automation using Screentext vs. Screenshots cs.HC · 2026-04-20 · unverdicted · none · ref 47 · internal anchor
A new benchmark shows LLM smartphone agents achieve comparable success with screen text alone as with screenshots, but both fail often due to UI accessibility and reasoning gaps.
Mobile GUI Agent Privacy Personalization with Trajectory Induced Preference Optimization cs.AI · 2026-04-13 · unverdicted · none · ref 27 · internal anchor
TIPO applies preference-intensity weighting and padding gating to stabilize preference optimization for privacy personalization in mobile GUI agents, yielding higher alignment and distinction metrics than prior methods.
Gym-Anything: Turn any Software into an Agent Environment cs.LG · 2026-04-07 · unverdicted · none · ref 37 · internal anchor
Gym-Anything turns arbitrary software into agent environments via multi-agent setup and auditing, creating CUA-World with 10K+ long-horizon tasks and showing that trajectory distillation plus test-time auditing improves small VLMs.
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents cs.CR · 2026-02-24 · unverdicted · none · ref 76 · internal anchor
The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.
EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies cs.CL · 2026-02-10 · conditional · none · ref 30 · internal anchor
EcoGym is a new open benchmark with three economic environments that reveals no leading LLM dominates at sustained plan-and-execute decision making across scenarios.
EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration cs.AI · 2025-12-22 · unverdicted · none · ref 16 · internal anchor
EchoTrail-GUI builds an automated memory of successful GUI task trajectories via self-exploration and injects relevant past examples to raise success rates on Android benchmarks.
AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management cs.AI · 2025-12-11 · conditional · none · ref 31 · internal anchor
AgentProg reframes interaction history as a program with variables and control flow, plus a belief state for partial observability, achieving SOTA success rates on long-horizon GUI benchmarks while baselines degrade.
MGA: Memory-Driven GUI Agent for Observation-Centric Interaction cs.AI · 2025-10-28 · unverdicted · none · ref 24 · internal anchor
MGA is a memory-driven GUI agent that uses an observer for bias-free screen reading and structured memory for compact state transitions to enable efficient long-horizon automation.
GTA1: GUI Test-time Scaling Agent cs.AI · 2025-07-08 · unverdicted · none · ref 38 · internal anchor
GTA1 combines test-time scaling for action plan selection with RL-based grounding to achieve SOTA results on GUI agent benchmarks.
Mobile GUI Agents under Real-world Threats: Are We There Yet? cs.CR · 2025-07-06 · conditional · none · ref 19 · internal anchor
Introduces an app-content instrumentation framework and benchmark showing that examined GUI agents suffer 42.0% and 36.1% average misleading rates from third-party content in dynamic and static tests respectively.

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer