arxiv: 2405.14573 · v5 · submitted 2024-05-23 · 💻 cs.AI · cs.LG

Recognition: 3 theorem links

· Lean Theorem

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Alice Li, Christopher Rawles, Daniel Toyama, Divya Tyamagundlu, Folawiyo Campbell-Ajala, Gabrielle Lau, Jonathan Waltz, Marybeth Fair, Oriana Riva, Robert Berry, Sarah Clinckemaillie, Timothy Lillicrap, Wei Li, William Bishop, Yifan Chang

Pith reviewed 2026-05-13 12:02 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords autonomous agentsAndroid benchmarkmobile agentsdynamic tasksagent evaluationreinforcement learningtask success

0 comments

The pith

AndroidWorld supplies a dynamic Android environment with 116 tasks across 20 apps where top agents reach only 30.6 percent success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AndroidWorld as a benchmark that runs actual Android apps and generates tasks from natural language instructions in many different forms. Each task comes with its own setup, success check based on device state, and cleanup steps so results stay reproducible. Experiments with baseline agents show the strongest one finishes 30.6 percent of the tasks, while a desktop web agent adapted to Android performs worse. The dynamic construction of tasks means performance can vary sharply with small wording or parameter changes. This setup lets researchers test whether agents can handle realistic mobile use rather than fixed test cases.

Core claim

AndroidWorld is a fully functional Android environment that provides reward signals for 116 programmatic tasks across 20 real-world apps. Tasks are dynamically constructed and expressed in natural language in unlimited ways, with dedicated initialization, success-checking, and tear-down logic that modifies and inspects the device's system state. Baseline experiments show the best agent completes 30.6 percent of tasks, an adapted desktop web agent proves less effective on mobile, and robustness analysis confirms that task variations significantly affect performance.

What carries the argument

AndroidWorld, a dynamic Android environment that parameterizes tasks in natural language and supplies built-in success verification through system-state inspection.

If this is right

Current agents leave substantial room for improvement on realistic mobile tasks.
Desktop web agents require specific changes to perform well on Android.
Metrics that ignore task variations can overstate an agent's practical reliability.
Unlimited dynamic task generation supports training and evaluation at larger scale than static test sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Improved performance here could support agents that handle phone-based tasks for users who cannot interact with apps directly.
The benchmark's emphasis on state changes may encourage agents that maintain context across app switches and interruptions.
Extending the same dynamic construction approach to multi-app sequences could reveal new failure modes not visible in single-app tasks.

Load-bearing premise

Success on the 116 tasks and their variations will translate to useful behavior on actual user goals outside the benchmark.

What would settle it

Measure whether agents that score high on AndroidWorld also succeed when given real user instructions on unmodified phones without the benchmark's setup and verification logic.

read the original abstract

Autonomous agents that execute human tasks by controlling computers can enhance human productivity and application accessibility. However, progress in this field will be driven by realistic and reproducible benchmarks. We present AndroidWorld, a fully functional Android environment that provides reward signals for 116 programmatic tasks across 20 real-world Android apps. Unlike existing interactive environments, which provide a static test set, AndroidWorld dynamically constructs tasks that are parameterized and expressed in natural language in unlimited ways, thus enabling testing on a much larger and more realistic suite of tasks. To ensure reproducibility, each task includes dedicated initialization, success-checking, and tear-down logic, which modifies and inspects the device's system state. We experiment with baseline agents to test AndroidWorld and provide initial results on the benchmark. Our best agent can complete 30.6% of AndroidWorld's tasks, leaving ample room for future work. Furthermore, we adapt a popular desktop web agent to work on Android, which we find to be less effective on mobile, suggesting future research is needed to achieve universal, cross-platform agents. Finally, we also conduct a robustness analysis, showing that task variations can significantly affect agent performance, demonstrating that without such testing, agent performance metrics may not fully reflect practical challenges. AndroidWorld and the experiments in this paper are available at github.com/google-research/android_world.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AndroidWorld delivers a reproducible dynamic benchmark for real Android apps that the field actually needs, though its 116 tasks lack any validation against real usage patterns.

read the letter

The core contribution is a live Android testbed with 116 programmatic tasks across 20 apps that generate unlimited natural-language variations on the fly, plus built-in initialization, success checks, and teardown for each run. That setup is new and directly tackles the reproducibility problems that have held back mobile agent work compared to desktop or web benchmarks. The baselines are honest: their strongest agent reaches 30.6 percent success, a web-agent adaptation performs worse on mobile, and the robustness checks show that task wording changes can move scores noticeably. All of this ships with code, which makes the environment usable right away. The main limitation is that the paper gives no usage data, app-store telemetry, or user study to show these tasks reflect typical Android goals rather than a convenient selection. Without that, the 30.6 percent number is harder to treat as a reliable signal of progress toward practical agents. The results also omit error bars or statistical tests, which is a minor but easy fix. This paper is for people building or evaluating autonomous agents on mobile devices. It is worth sending to peer review because the environment itself is a concrete, open resource that others can extend, even if the initial baselines stay preliminary.

Referee Report

2 major / 2 minor

Summary. The paper introduces AndroidWorld, a dynamic Android benchmarking environment providing reward signals for 116 programmatic tasks across 20 real apps. Tasks are parameterized and expressed in natural language with dedicated initialization, success-checking, and tear-down logic. Baseline agents achieve up to 30.6% success; a web agent is adapted to Android (performing worse); robustness analysis shows task variations affect performance. The environment and code are open-sourced.

Significance. If the task set proves representative, AndroidWorld offers a reproducible, state-grounded benchmark that advances mobile agent evaluation beyond static suites. The dynamic generation and real-app grounding are clear strengths, and open-sourcing supports community progress. The 30.6% baseline and cross-platform gap highlight concrete research directions.

major comments (2)

[Experiments] Experiments section: the headline 30.6% success rate is presented without error bars, confidence intervals, or statistical tests despite dynamic task variations and multiple runs; this weakens claims about baseline performance and robustness.
[Benchmark construction] Benchmark construction (Sections 3–4): no usage-frequency data, app-store telemetry, or user-study validation is reported for the choice of 20 apps and 116 tasks, leaving the central claim of realism and real-world transfer ungrounded and potentially subject to selection bias.

minor comments (2)

[Experiments] Agent descriptions in the experiments are high-level; additional implementation details (prompts, action spaces, observation formats) would improve reproducibility.
[Figures/Tables] Figure captions and table legends could more explicitly link visual results to the robustness analysis on task variations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and outline the changes we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: the headline 30.6% success rate is presented without error bars, confidence intervals, or statistical tests despite dynamic task variations and multiple runs; this weakens claims about baseline performance and robustness.

Authors: We agree that the presentation of the 30.6% success rate would be strengthened by including error bars, confidence intervals, and statistical tests. The experiments were run across multiple random seeds to account for task variation, but these details were not fully reported. In the revised manuscript we will add error bars computed over independent runs, report confidence intervals, and include statistical tests comparing agent variants and task conditions. revision: yes
Referee: [Benchmark construction] Benchmark construction (Sections 3–4): no usage-frequency data, app-store telemetry, or user-study validation is reported for the choice of 20 apps and 116 tasks, leaving the central claim of realism and real-world transfer ungrounded and potentially subject to selection bias.

Authors: The 20 apps and 116 tasks were selected to cover a broad range of common Android interactions and popular application categories, as described in Sections 3 and 4. We did not collect new usage-frequency telemetry or conduct a dedicated user study for this paper. In the revision we will expand the justification of the selection criteria, cite publicly available app-usage statistics where possible, and explicitly acknowledge the absence of formal validation data as a limitation while noting that the dynamic parameterization and open-sourced environment allow the community to extend the benchmark. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark results are direct measurements, not derivations

full rationale

The paper defines AndroidWorld as a new environment with 116 programmatically specified tasks across 20 real Android apps, each with explicit initialization, success-checking, and tear-down logic. It then runs baseline agents on this fixed task set and reports the empirical success rate of 30.6%. No equations, predictions, or first-principles claims appear; the headline number is simply the observed fraction of tasks completed. Task parameterization and natural-language variations are part of the benchmark definition itself, not a fitted or renamed quantity. Any self-citations are incidental and not load-bearing for the central empirical result. Concerns about task representativeness of real-user distributions are external-validity issues, not reductions of the reported result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution is an engineered environment rather than a mathematical derivation; it rests on standard Android OS behavior and task definitions supplied by the authors.

axioms (1)

domain assumption Android apps respond deterministically to UI actions when initialized to the same state
Invoked in the task initialization and success-checking logic sections

pith-pipeline@v0.9.0 · 5585 in / 1095 out tokens · 22444 ms · 2026-05-13T12:02:09.882835+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear
Our best agent can complete 30.6% of AndroidWorld's tasks, leaving ample room for future work.

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
cs.CL 2026-05 unverdicted novelty 8.0

A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
cs.CV 2026-05 unverdicted novelty 7.0

Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
Benchmarking and Improving GUI Agents in High-Dynamic Environments
cs.CV 2026-04 unverdicted novelty 7.0

DynamicUI improves GUI agent performance in high-dynamic environments by processing interaction videos with frame clustering, action-conditioned refinement, and reflection, outperforming prior approaches on the new Dy...
Benchmarking and Improving GUI Agents in High-Dynamic Environments
cs.CV 2026-04 conditional novelty 7.0

DynamicUI improves GUI agent performance in high-dynamic environments by using video-based dynamic perception, action-conditioned refinement, and reflection, outperforming prior agents on the new DynamicGUIBench while...
RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management
cs.AI 2026-04 unverdicted novelty 7.0

RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.
Don't Act Blindly: Robust GUI Automation via Action-Effect Verification and Self-Correction
cs.CL 2026-04 unverdicted novelty 7.0

VeriGUI adds a Thinking-Verification-Action-Expectation loop and two-stage training on synthetic failures to reduce undetected action errors and improve recovery in GUI automation.
MMSkills: Towards Multimodal Skills for General Visual Agents
cs.AI 2026-05 unverdicted novelty 6.0

MMSkills turns public interaction trajectories into compact multimodal skill packages that visual agents can consult at runtime to improve decision-making on benchmarks.
Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation
cs.AI 2026-05 unverdicted novelty 6.0

Agent benchmarks can report evidence-supported score bounds instead of single misleading success rates by adding a layer that checks required artifacts for outcome verification.
How Mobile World Model Guides GUI Agents?
cs.AI 2026-05 unverdicted novelty 6.0

Mobile world models in text, image, and code modalities reach state-of-the-art on their benchmarks and improve downstream GUI agent performance, with code best for in-distribution accuracy and text more robust for out...
Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents
cs.CL 2026-05 unverdicted novelty 6.0

Phone-use agents avoid harm more often through inability to act than through deliberate safe choices, so benchmarks must separate unsafe judgment from capability failure.
Augmenting Interface Usability Heuristics for Reliable Computer-Use Agents
cs.HC 2026-05 unverdicted novelty 6.0

Augmented Nielsen heuristics improve computer-use agent task completion on varied interfaces while preserving human usability, as shown in UI-Verse experiments and human studies.
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows
cs.SE 2026-04 unverdicted novelty 6.0

Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.
SOLAR-RL: Semi-Online Long-horizon Assignment Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 6.0

SOLAR-RL assigns dense step-level rewards from static trajectory data by detecting first failure points and applying target-aligned shaping to improve long-horizon GUI task completion without full online interactions.
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
cs.CL 2026-04 conditional novelty 6.0

VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
AgentLens: Adaptive Visual Modalities for Human-Agent Interaction in Mobile GUI Agents
cs.HC 2026-04 unverdicted novelty 6.0

AgentLens adaptively deploys Full UI, Partial UI, and GenUI modalities with virtual display overlays for mobile GUI agents, yielding 85.7% user preference and best-in-study usability in a 21-participant evaluation.
Do LLMs Need to See Everything? A Benchmark and Study of Failures in LLM-driven Smartphone Automation using Screentext vs. Screenshots
cs.HC 2026-04 unverdicted novelty 6.0

A new benchmark shows LLM smartphone agents achieve comparable success with screen text alone as with screenshots, but both fail often due to UI accessibility and reasoning gaps.
Mobile GUI Agent Privacy Personalization with Trajectory Induced Preference Optimization
cs.AI 2026-04 unverdicted novelty 6.0

TIPO applies preference-intensity weighting and padding gating to stabilize preference optimization for privacy personalization in mobile GUI agents, yielding higher alignment and distinction metrics than prior methods.
Gym-Anything: Turn any Software into an Agent Environment
cs.LG 2026-04 unverdicted novelty 6.0

Gym-Anything turns arbitrary software into agent environments via multi-agent setup and auditing, creating CUA-World with 10K+ long-horizon tasks and showing that trajectory distillation plus test-time auditing improv...
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
cs.CV 2025-07 unverdicted novelty 6.0

GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work
cs.AI 2026-05 conditional novelty 5.0

Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.
GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents
cs.CV 2026-04 unverdicted novelty 5.0

GLM-5V-Turbo integrates multimodal perception as a core part of reasoning and execution for agentic tasks, reporting strong results in visual tool use and multimodal coding while keeping text-only performance competitive.
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
cs.AI 2025-09 conditional novelty 5.0

UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents
cs.CV 2026-04 unverdicted novelty 4.0

GLM-5V-Turbo integrates multimodal perception directly into reasoning, planning, tool use, and execution for agents, yielding strong results in multimodal coding and framework-based tasks while keeping text coding com...
GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents
cs.CV 2026-04 unverdicted novelty 4.0

GLM-5V-Turbo integrates multimodal perception directly into reasoning and agent workflows, reporting strong results on visual tool use, multimodal coding, and framework-based agent tasks while keeping text coding competitive.
Seed1.5-VL Technical Report
cs.CV 2025-05 unverdicted novelty 4.0

Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction
cs.CV 2026-05 unverdicted novelty 3.0

X-OmniClaw presents a unified architecture for Android mobile agents using Omni Perception, Memory, and Action modules to enable efficient multimodal task handling and personalized interactions.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · cited by 23 Pith papers · 5 internal anchors

[1]

Evaluating multimodal interactive agents,

Josh Abramson, Arun Ahuja, Federico Carnevale, Petko Georgiev, Alex Goldin, Alden Hung, Jes- sica Landon, Timothy Lillicrap, Alistair Muldal, Blake Richards, Adam Santoro, Tamara von 10 Published as a conference paper at ICLR 2025 Glehn, Greg Wayne, Nathaniel Wong, and Chen Yan. Evaluating multimodal interactive agents,

work page 2025
[2]

Bonatti, D

URL https://arxiv. org/abs/2409.08264. Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Kumar, Kate Saenko, and Bryan A. Plum- mer. Mobile app tasks with iterative feedback (motif): Addressing task feasibility in interac- tive visual environments. CoRR, abs/2104.08560,

work page arXiv
[3]

URL https://arxiv.org/abs/ 2104.08560. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea ...

work page arXiv
[4]

Seeclick: Harnessing gui grounding for advanced visual gui agents

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiy- ong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents. arXiv preprint arXiv:2401.10935,

work page arXiv
[5]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132,

work page internal anchor Pith review arXiv
[6]

How many random seeds? statistical power analysis in deep reinforcement learning experiments

C´edric Colas, Olivier Sigaud, and Pierre-Yves Oudeyer. How many random seeds? statistical power analysis in deep reinforcement learning experiments. arXiv preprint arXiv:1806.08295,

work page arXiv
[7]

Workarena: How capable are web agents at solving common knowledge work tasks? arXiv preprint arXiv:2403.07718,

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H Laradji, Manuel Del Verme, Tom Marty, L ´eo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, et al. Workarena: How capable are web agents at solving common knowledge work tasks? arXiv preprint arXiv:2403.07718,

work page arXiv
[8]

Tooltalk: Evaluating tool-usage in a conversational setting

Nicholas Farn and Richard Shin. Tooltalk: Evaluating tool-usage in a conversational setting. arXiv preprint arXiv:2311.10775,

work page arXiv
[9]

Assistgui: Task-oriented desktop graphical user interface automation

Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qinchen Wu, Weichen Zhang, Peiyi Wang, Xiangwu Guo, et al. Assistgui: Task-oriented desktop graphical user interface automation. arXiv preprint arXiv:2312.13108,

work page arXiv
[10]

11 Published as a conference paper at ICLR 2025 Significant Gravitas. AutoGPT. https://agpt.co,

work page 2025
[11]

Izzeddin Gur, Natasha Jaques, Yingjie Miao, Jongwook Choi, Manoj Tiwari, Honglak Lee, and Aleksandra Faust

https://agpt.co. Izzeddin Gur, Natasha Jaques, Yingjie Miao, Jongwook Choi, Manoj Tiwari, Honglak Lee, and Aleksandra Faust. Environment generation for zero-shot compositional reinforcement learning, 2022a. Izzeddin Gur, Ofir Nachum, Yingjie Miao, Mustafa Safdari, Austin Huang, Aakanksha Chowdhery, Sharan Narang, Noah Fiedel, and Aleksandra Faust. Underst...

work page arXiv
[12]

Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914,

work page arXiv
[13]

Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, and Ruslan Salakhutdinov

URL https://proceedings.mlr.press/v162/ humphreys22a.html. Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, and Ruslan Salakhutdinov. Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. arXiv preprint arXiv:2402.17553,

work page arXiv
[14]

Evaluating language-model agents on realistic autonomous tasks

Megan Kinniment, Lucas Jun Koba Sato, Haoxing Du, Brian Goodrich, Max Hasin, Lawrence Chan, Luke Harold Miles, Tao R Lin, Hjalmar Wijk, Joel Burget, et al. Evaluating language-model agents on realistic autonomous tasks. arXiv preprint arXiv:2312.11671,

work page arXiv
[15]

arXiv preprint arXiv:2401.13649 , year=

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. arXiv preprint arXiv:2401.13649,

work page arXiv
[16]

Autowebglm: Bootstrap and reinforce a large lan- guage model-based web navigating agent

Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, et al. Autowebglm: Bootstrap and reinforce a large lan- guage model-based web navigating agent. arXiv preprint arXiv:2404.03648,

work page arXiv
[17]

Benchmarking mo- bile device control agents across diverse configurations

Juyong Lee, Taywon Min, Minyong An, Changyeon Kim, and Kimin Lee. Benchmarking mo- bile device control agents across diverse configurations. In ICLR 2024 Workshop on Generative Models for Decision Making,

work page 2024
[18]

API-Bank: A comprehensive benchmark for Tool-Augmented LLMs

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. API-Bank: A comprehensive benchmark for Tool-Augmented LLMs. April 2023a. Tao Li, Gang Li, Zhiwei Deng, Bryan Wang, and Yang Li. A Zero-Shot language agent for computer control with structured reflection. In Houda Bouamor, Juan Pino, and Kalika B...

work page 2023
[19]

arXiv preprint arXiv:2406.03679 , year=

URL https://arxiv.org/abs/ 2406.03679. Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason Baldridge. Mapping natural language instructions to mobile UI action sequences. In Proc. of the 58th Annual Meeting of the Associa- tion for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp. 8198–8210. Associa- tion for Computational Linguistics,

work page arXiv 2020
[20]

Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, and Percy Liang

URLhttps://www.aclweb.org/anthology/ 2020.acl-main.729/. Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, and Percy Liang. Reinforcement learning on web interfaces using workflow-guided exploration. In 6th International Conference on Learn- ing Representations (ICLR ’18) , 2018a. URL https://openreview.net/forum?id= ryTp3f-0-. Thomas F. Liu, Mark Craft, Jas...

work page doi:10.1145/3242587.3242650 2020
[21]

URL https://arxiv.org/ abs/2403.08140. OpenAI. GPT-4 technical report,

work page arXiv
[22]

Android in the wild: A large-scale dataset for android device control

Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. Android in the wild: A large-scale dataset for android device control. arXiv preprint arXiv:2307.10088,

work page arXiv
[23]

From pixels to UI actions: Learning to follow instructions via graphical user interfaces

13 Published as a conference paper at ICLR 2025 Peter Shaw, Mandar Joshi, James Cohan, Jonathan Berant, Panupong Pasupat, Hexiang Hu, Urvashi Khandelwal, Kenton Lee, and Kristina Toutanova. From pixels to UI actions: Learning to follow instructions via graphical user interfaces. May

work page 2025
[24]

Reflexion: Language Agents with Verbal Reinforcement Learning

URL http://proceedings. mlr.press/v70/shi17a.html. Noah Shinn, Beck Labash, and Ashwin Gopinath. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

David Silver, Aja Huang, Chris J

URL https://arxiv.org/abs/2106.00133. David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepe...

work page arXiv
[26]

Weihao Tan, Ziluo Ding, Wentao Zhang, Boyu Li, Bohan Zhou, Junpeng Yue, Haochong Xia, Jiechuan Jiang, Longtao Zheng, Xinrun Xu, Yifei Bi, Pengjie Gu, Xinrun Wang, B ¨orje F

doi: 10.1038/nature16961. Weihao Tan, Ziluo Ding, Wentao Zhang, Boyu Li, Bohan Zhou, Junpeng Yue, Haochong Xia, Jiechuan Jiang, Longtao Zheng, Xinrun Xu, Yifei Bi, Pengjie Gu, Xinrun Wang, B ¨orje F. Karls- son, Bo An, and Zongqing Lu. Towards General Computer Control: A Multimodal Agent For Red Dead Redemption II As A Case Study. arXiv preprint arXiv:2403.03186,

work page doi:10.1038/nature16961
[27]

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhu- patiraju, L ´eonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram ´e, Johan Fer- ret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Char- line Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Serta...

work page 2025
[28]

Gemma 2: Improving Open Language Models at a Practical Size

URL https://arxiv.org/abs/2408.00118. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open founda- tion and fine-tuned chat models. arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Sagar Gubbi Venkatesh, Partha Talukdar, and Srini Narayanan

URL https://arxiv.org/abs/2105.13231. Sagar Gubbi Venkatesh, Partha Talukdar, and Srini Narayanan. Ugif: Ui grounded instruction fol- lowing,

work page arXiv
[30]

URL https://arxiv.org/abs/2211.07615. Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Micha ¨el Mathieu, Andrew Dudzik, Juny- oung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, Junhyuk Oh, Dan Horgan, Manuel Kroiss, Ivo Danihelka, Aja Huang, Laurent Sifre, Trevor Cai, John P Agapiou, Max Jaderberg, Alexander S Vezhnevets, R´emi L...

work page arXiv
[31]

Enabling conversational interaction with mobile ui using large language models

Bryan Wang, Gang Li, and Yang Li. Enabling conversational interaction with mobile ui using large language models. In Proc. of the 2023 CHI Conference on Human Factors in Computing Systems, CHI ’23. Association for Computing Machinery, 2023a. ISBN 9781450394215. doi: 10.1145/3544548.3580895. URL https://doi.org/10.1145/3544548.3580895. Guanzhi Wang, Yuqi X...

work page doi:10.1145/3544548.3580895 2023
[32]

Os-copilot: Towards generalist computer agents with self-improvement

Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu, Shunyu Yao, Tao Yu, and Lingpeng Kong. Os-copilot: Towards generalist computer agents with self-improvement. arXiv preprint arXiv:2402.07456,

work page arXiv
[33]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. arXiv preprint arXiv:2404.07972 ,

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Understanding the weakness of large language model agents within a complex android environment

Mingzhe Xing, Rongkai Zhang, Hui Xue, Qi Chen, Fan Yang, and Zhen Xiao. Understanding the weakness of large language model agents within a complex android environment. InProceedings 15 Published as a conference paper at ICLR 2025 of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pp. 6061– 6072,

work page 2025
[35]

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441, 2023a. Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771, 20...

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Chaoyun Zhang, Liqun Li, Shilin He, Xu Zhang, Bo Qiao, Si Qin, Minghua Ma, Yu Kang, Qingwei Lin, Saravan Rajmohan, et al

URL https://arxiv.org/abs/2404.05719. Chaoyun Zhang, Liqun Li, Shilin He, Xu Zhang, Bo Qiao, Si Qin, Minghua Ma, Yu Kang, Qingwei Lin, Saravan Rajmohan, et al. Ufo: A ui-focused agent for windows os interaction.arXiv preprint arXiv:2402.07939, 2024a. Danyang Zhang, Zhennan Shen, Rui Xie, Situo Zhang, Tianbao Xie, Zihan Zhao, Siyuan Chen, Lu Chen, Hongshen...

work page arXiv
[37]

Igniting language intelligence: The hitchhiker’s guide from chain-of-thought reasoning to language agents

Zhuosheng Zhang, Yao Yao, Aston Zhang, Xiangru Tang, Xinbei Ma, Zhiwei He, Yiming Wang, Mark Gerstein, Rui Wang, Gongshen Liu, et al. Igniting language intelligence: The hitchhiker’s guide from chain-of-thought reasoning to language agents. arXiv preprint arXiv:2311.11797 ,

work page arXiv
[38]

MMInA: Benchmarking multihop multi- modal internet agents

Ziniu Zhang, Shulin Tian, Liangyu Chen, and Ziwei Liu. MMInA: Benchmarking multihop multi- modal internet agents. arXiv preprint arXiv:2404.09992, 2024e. Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt-4v(ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614, 2024a. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Z...

work page arXiv
[39]

Agentstudio: A toolkit for building general virtual agents, 2024b

Longtao Zheng, Zhiyuan Huang, Zhenghai Xue, Xinrun Wang, Bo An, and Shuicheng Yan. Agentstudio: A toolkit for building general virtual agents, 2024b. URL https://arxiv. org/abs/2403.17918. 16 Published as a conference paper at ICLR 2025 Longtao Zheng, Rundong Wang, Xinrun Wang, and Bo An. Synapse: Trajectory-as-exemplar prompting with memory for computer ...

work page arXiv 2025
[40]

CLICK":

• Accessibility tree: A raw representation of the accessibility tree. 6 This UI tree provides a detailed snapshot of all UI elements currently displayed on the screen. We utilize an acces- sibility forwarding app from AndroidEnv (Toyama et al., 2021), which leverages gRPC to transmit the accessibility tree data efficiently to the device. • UI elements: A ...

work page 2021
[41]

"" 6 7 template = ( 8

was guided by three main factors: use case, popular- ity, and the need for consistency and reproducibility. Use case and categories We analyzed popular app categories in app stores, focusing on produc- tivity, communication, and multimedia. Selected apps had to meet criteria such as not requiring a login and storing data locally on the device. Additionall...

work page 2025
[42]

replace” variant requires a more complex sequence of UI interactions (long-press, text selection, deletion, then text entry) compared to the simpler

across different seeds demonstrates how task parameteri- zation fundamentally changes task difficulty. For instance, in the ExpenseAddSingle task, the seed determines which expense category must be selected (see UI in Figure 8). When the seed specifies readily on-screen visible categories (e.g., ”Housing”, ”Social”), the agent can complete the task. Howev...

work page 2025
[43]

Home" icon 41 B

For handling the select dropdown elements on a screen, it’s not necessary for you to provide completely accurate options right now. The full list of options for these elements will be supplied later. 29 30 > Role: ASSISTANT 31 <AGENT RESPONSE TO ABOVE> 32 33 > Role: USER 34 (Reiteration) 35 First, reiterate your next target element, its detailed location,...

work page 2025