iOSWorld: A Benchmark for Personally Intelligent Phone Agents

Andrew Keunwoo Jang; Geronimo Carom; Jing Yu Koh; Lawrence Keunho Jang; Mareks Woodside; Ruslan Salakhutdinov

arxiv: 2606.09764 · v1 · pith:NKEX2ZE2new · submitted 2026-06-08 · 💻 cs.LG · cs.CL

iOSWorld: A Benchmark for Personally Intelligent Phone Agents

Lawrence Keunho Jang , Mareks Woodside , Geronimo Carom , Andrew Keunwoo Jang , Jing Yu Koh , Ruslan Salakhutdinov This is my paper

Pith reviewed 2026-06-27 17:21 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords iOS agentsmobile benchmarkspersonalizationmulti-app taskscomputer use modelsagent evaluationvision language modelsmemory tasks

0 comments

The pith

iOSWorld benchmark shows frontier phone agents reach 52% overall but only 37% on multi-app personal tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces iOSWorld to test whether phone agents can reason over a persistent user identity, history, and preferences stored across connected native apps rather than treating each interaction in isolation. It builds 26 iOS apps with linked personal data such as messages, transactions, and travel records, then defines 133 tasks split into single-app, multi-app spanning up to eight apps, and memory tasks that require inferring patterns from that data. Evaluations of frontier and open-source models in vision-only and privileged vision-plus-XML settings reveal the best result at 52% overall success, dropping sharply on multi-app work, while added XML input lifts large models by as much as 26 points but leaves smaller models unchanged. This matters because agents that cannot use personal context on the device will remain limited to generic command following.

Core claim

We introduce iOSWorld, the first interactive native iOS simulator benchmark built around a persistent user identity spanning 26 newly built iOS apps. These apps contain connected data such as transactions, messages, travel records, social relationships, and financial activity. iOSWorld includes 133 tasks across three increasingly difficult categories. Single-app tasks test one app, multi-app tasks span 2 to 8 apps, and memory and personalization tasks require agents to infer patterns from personal data. The best configuration reaches 52% overall but only 37% on multi-app tasks. Privileged vision+XML access improves frontier models by up to 26 percentage points, while smaller models do not be

What carries the argument

iOSWorld, the interactive native iOS simulator with 26 apps seeded with connected personal data and 133 tasks that require single-app execution, multi-app coordination, or inference from user history.

If this is right

Current agents still lack reliable multi-app coordination even when given privileged access.
Frontier models gain substantial performance from accessibility-tree input while smaller models do not.
Personalization tasks expose the gap between following isolated instructions and maintaining user-specific memory.
The benchmark supplies a concrete, reproducible testbed for measuring progress on device-native agent intelligence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar simulators could be built for Android or other platforms to check whether the observed limits are platform-specific.
Agents may need explicit long-term memory architectures beyond what current vision-language models provide.
If seeded data patterns prove too regular, real-user variability could lower observed success rates further.
The results suggest prioritizing cross-app data sharing mechanisms in future agent designs.

Load-bearing premise

The seeded personal data and task rubrics across the 26 apps accurately represent the kinds of personalization and memory demands that would appear in real user phone usage.

What would settle it

Testing the same agent configurations on actual iOS devices using real user data and observing whether success rates remain near 52% overall and 37% on multi-app tasks.

Figures

Figures reproduced from arXiv: 2606.09764 by Andrew Keunwoo Jang, Geronimo Carom, Jing Yu Koh, Lawrence Keunho Jang, Mareks Woodside, Ruslan Salakhutdinov.

**Figure 1.** Figure 1: Overview of iOSWorld. 26 purpose-built iOS applications share a single user identity (Jordan Avery) and connected data across apps. The benchmark includes 133 tasks across single-app, multi-app, and memory/personalization categories. iOSWorld is the first dynamic native iOS simulator benchmark built around a user’s personal identity. We built 26 native iOS applications and populated them with connected dat… view at source ↗

**Figure 2.** Figure 2: Jordan Avery’s digital life. 26 iOS apps across 10 domains sharing one identity. We display app names in bold and real-life analogues in italics. Edge thickness represents the number of shared data points; Mail is the primary hub. See [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: We visualize a multi-app QuickBite → TeamChat task across two modalities with the same model (Opus 4.6). Vision-only finds Nobu, adds an item, and reaches checkout, but spends the remaining budget trying to toggle payment confirmation and never opens TeamChat (50 steps, score 0.20). Vision+XML places the order, posts a deployment update in #launch-war-room, and checks #general announcements in 22 steps (sc… view at source ↗

**Figure 4.** Figure 4: Pass rates by task category and observation modality across all six models. Vision+XML (blue) outperforms vision-only (gray) for the stronger frontier models. GPT-5.4 Mini and the opensource Qwen3.5 baseline do not show the same benefit from the additional modality. 4 Experiments 4.1 Setup We evaluate five frontier computer-use models: Claude Opus 4.6 and Claude Sonnet 4.6 (Anthropic, 2026b), GPT-5.4 and … view at source ↗

**Figure 5.** Figure 5: Successful memory trajectory (Opus 4.6, vision+XML, 29 steps, score 1.0). For “Give me a full picture of my finances,” Opus pulls balances from MyBank, checks pending requests in SplitPay, opens the Budget Tracker in CloudDocs, and writes a synthesis spanning five apps in 29 steps. → TeamChat task. Vision-only Opus reaches checkout but gets stuck on a small paymentconfirmation control and never opens Team… view at source ↗

**Figure 6.** Figure 6: A representative budget-exhausted failure (Opus 4.6, vision+XML, 50 steps, score 0.45). Opus explores CityRide (step 3), finds Mail receipts (step 17), reaches MyBank transactions (step 24), but exhausts the 50-step budget before completing data entry in CloudSheets. Budget-exhausted runs account for 51% of frontier-model failures. Failure taxonomy. We group the 422 frontier vision+XML failures into three … view at source ↗

**Figure 7.** Figure 7: dinespot-001, Qwen3.5 vision-only. Top, structured MCP tools (17 steps, score 1.0). Typed calls take the agent directly to a confirmed Harborline Seafood booking. Bottom, cookbook mobile use (50 steps, score 0.25). The same model gets stuck on the filter sheet and never makes a reservation. Action Opus Sonnet GPT-5.4 Mini Gemini Qwen3.5 tap xy 62% 62% 55% 37% 54% 66% swipe 21% 23% 12% 35% 10% 13% type 9% 8… view at source ↗

**Figure 8.** Figure 8: Vision-only vs. vision+XML accuracy per model. Privileged vision+XML access [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Cumulative pass rate vs. step budget. Left: overall. Right: by task category. Solid: vision+XML; dashed: vision-only. Single-app saturates by step 20; multi-app scales through step 40; memory shows varied scaling with Opus climbing steeply past step 30. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: GPT-5.4 Mini on the same DineSpot reservation task under both modalities. Left: visiononly navigates cleanly to a confirmed booking in 24 steps (score 1.0). Right: vision+XML loops through filter menus for 30 steps and ultimately forgets the original goal (score 0.4, 37 steps). The accessibility tree overwhelms the smaller model’s context capacity. Clock: launched Step 2 Add Alarm: time picker open Step … view at source ↗

**Figure 11.** Figure 11: Qwen3.5 35B-A3B (vision+XML) on a simple “Set a 6:45 AM alarm labeled Gym” singleapp task. The agent reaches the Add Alarm screen by step 5 but then issues the same swipe-down action on the time picker 38 consecutive times (steps 6–46), never adjusting to 6:45, never setting the label, and never tapping Save. The 50-step budget is exhausted on a task Opus and Sonnet both solve in 25 steps. Repeated-actio… view at source ↗

**Figure 12.** Figure 12: Successful single-app Notes task. Team Standup Notes, add bug-fix bullet (Sonnet [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Successful multi-app DineSpot → TeamChat task (Opus 4.6, vision+XML, 26 steps). SkyTrip: flight details Step 3 CityRide: ride booked Step 22 Clock: alarm set Step 46 QuickChat: budget exhausted Step 50 Task: "Check my SkyTrip departure time, request a CityRide Black to SFO, set an alarm 30 min before departure, and message Maya Patel in QuickChat with details." Result: Fail (score 0.56) | 50 steps | Compl… view at source ↗

**Figure 14.** Figure 14: Failed multi-app SkyTrip → CityRide → Clock → QuickChat task. The agent completed 3/4 subtasks but ran out of budget before messaging (50 steps, score 0.56). 22 [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

**Figure 15.** Figure 15: Failed memory Notes → QuickChat → MegaMart → DineSpot task. The agent found birthday info and bought a gift but ran out of budget before the dinner reservation (50 steps, score 0.50). 23 [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

read the original abstract

A useful phone agent needs to be personally intelligent. It should reason over a user's identity, history, and preferences as they exist on the device, not just follow isolated instructions in an impersonal sandbox. Existing mobile agent benchmarks lack this kind of personalization. We introduce iOSWorld, the first interactive native iOS simulator benchmark built around a persistent user identity spanning 26 newly built iOS apps. These apps contain connected data such as transactions, messages, travel records, social relationships, and financial activity. iOSWorld includes 133 tasks across three increasingly difficult categories. Single-app tasks (27) test one app, multi-app tasks (60) span 2 to 8 apps, and memory and personalization tasks (46) require agents to infer patterns from personal data. We evaluate frontier and open-source computer-use models in both vision-only and privileged vision+XML settings. The best configuration reaches 52\% overall but only 37\% on multi-app tasks. Privileged vision+XML access improves frontier models by up to 26 percentage points, while smaller models do not benefit from added accessibility-tree input. We release iOSWorld as an open-source benchmark with all apps, seeded data, tasks, rubrics, and evaluation code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

iOSWorld is a solid first benchmark for iOS agents with persistent personal data across apps, but the synthetic tasks and rubrics need checking against real usage patterns.

read the letter

The paper's main contribution is releasing iOSWorld, the first interactive native iOS simulator benchmark built around one persistent user identity with connected data across 26 custom apps. It includes 133 tasks in three categories: single-app, multi-app, and memory/personalization ones that require inferring from personal records like transactions and messages.

The release itself is done right. Everything is open source—apps, seeded data, tasks, rubrics, and evaluation code—which lets others build on it directly. The model results give a clear baseline: top setups hit 52% overall but only 37% on multi-app tasks, and adding XML input lifts frontier models by up to 26 points while smaller ones see little change. That highlights where current agents fall short on cross-app and memory work.

The soft spot is whether the seeded synthetic data and author-defined tasks actually test the personalization and memory demands that matter in practice. The patterns in the data are constructed rather than drawn from real device logs, so the difficulty of the inference steps or the noise level might not match everyday phone use. If the connections between apps or the required recall are cleaner or simpler than typical cases, the gap between vision-only and privileged settings could overstate or understate real progress. More detail on task validation and rubric reliability would strengthen the claim.

This is aimed at researchers building or benchmarking mobile agents. It shows straightforward engagement with the problem and supplies a usable resource. I would bring it to a reading group to talk through benchmark design choices. It deserves peer review because a new benchmark with this scope is worth referee time even if the tasks need some refinement.

Referee Report

3 major / 2 minor

Summary. The paper introduces iOSWorld, the first interactive native iOS simulator benchmark for personally intelligent phone agents. It features 26 newly built apps containing connected personal data (transactions, messages, travel records, etc.) and 133 tasks divided into single-app (27), multi-app (60), and memory/personalization (46) categories. Frontier and open-source computer-use models are evaluated in vision-only and privileged vision+XML settings, with the best configuration achieving 52% overall success (37% on multi-app tasks) and privileged access improving frontier models by up to 26 percentage points. The benchmark, including all apps, seeded data, tasks, rubrics, and evaluation code, is released open-source.

Significance. If the seeded data and rubrics validly exercise personalization and memory demands, iOSWorld would be a valuable open benchmark addressing a clear gap in existing mobile-agent evaluations. The reported performance gaps (especially multi-app and the differential benefit of XML access) provide concrete, falsifiable baselines for future work. The full release of apps, data, tasks, rubrics, and code is a notable strength for reproducibility.

major comments (3)

[Task construction and validation sections] Task construction and validation (likely §3–4): the manuscript provides no details on how the 133 tasks or rubrics were validated, inter-rater reliability of scoring, or whether data seeding avoids leakage between training corpora and the benchmark; without this, the headline claim that the 46 memory/personalization tasks test inference over personal history remains unanchored.
[Data seeding and task design] Ecological validity of seeded data (§3): the central interpretation that privileged XML access advances 'personally intelligent' agents rests on the assumption that the synthetic connected data (transactions, messages, etc.) across 26 apps matches the frequency, noise, and inference difficulty of real device usage; the paper should include a concrete comparison or user study to support this.
[Task categories] Category definitions and scoring (Table 1 or equivalent): the distinction between multi-app and memory/personalization tasks is load-bearing for the 37% multi-app result; the manuscript should clarify overlap and whether any memory tasks can be solved without cross-app reasoning.

minor comments (2)

[Evaluation setup] Figure clarity: screenshots of the simulator and XML structure would help readers understand the privileged vs. vision-only conditions.
[Introduction] Missing reference: cite prior mobile-agent benchmarks (e.g., AndroidWorld or similar) when claiming novelty.

Simulated Author's Rebuttal

3 responses · 2 unresolved

We thank the referee for the constructive and detailed comments. We respond to each major comment below and note planned revisions to the manuscript.

read point-by-point responses

Referee: [Task construction and validation sections] Task construction and validation (likely §3–4): the manuscript provides no details on how the 133 tasks or rubrics were validated, inter-rater reliability of scoring, or whether data seeding avoids leakage between training corpora and the benchmark; without this, the headline claim that the 46 memory/personalization tasks test inference over personal history remains unanchored.

Authors: We agree that the original manuscript lacks sufficient detail on task and rubric construction. In the revision we will add an expanded subsection in §3 describing the iterative design process for the 133 tasks, the criteria used to ensure memory/personalization tasks require inference over seeded personal history, and the rubric development workflow. We did not conduct formal inter-rater reliability measurements or explicit training-corpus leakage audits; these will be noted as limitations. revision: partial
Referee: [Data seeding and task design] Ecological validity of seeded data (§3): the central interpretation that privileged XML access advances 'personally intelligent' agents rests on the assumption that the synthetic connected data (transactions, messages, etc.) across 26 apps matches the frequency, noise, and inference difficulty of real device usage; the paper should include a concrete comparison or user study to support this.

Authors: We acknowledge that a direct empirical comparison or user study would strengthen claims about ecological validity. The seeded data was constructed to reflect common real-world patterns (e.g., transaction frequencies, message threading, travel itineraries) drawn from public iOS app documentation and typical usage scenarios, with explicit cross-app linkages. A controlled user study or quantitative distributional comparison lies outside the scope of the current work; we will expand the data-seeding description in §3 and add an explicit limitations paragraph. revision: partial
Referee: [Task categories] Category definitions and scoring (Table 1 or equivalent): the distinction between multi-app and memory/personalization tasks is load-bearing for the 37% multi-app result; the manuscript should clarify overlap and whether any memory tasks can be solved without cross-app reasoning.

Authors: We will revise the category definitions and Table 1 to make the distinctions explicit. Memory/personalization tasks are defined by the requirement to perform inference over a user's personal history or preferences; multi-app tasks are defined by the number of distinct apps that must be accessed. We will add a paragraph clarifying potential overlap, provide examples of memory tasks that can be solved within a single app, and note that the 37% multi-app figure reflects tasks whose primary difficulty is cross-app coordination rather than memory inference. revision: yes

standing simulated objections not resolved

Formal inter-rater reliability assessment for task scoring
Empirical user study or quantitative comparison validating ecological validity of the synthetic seeded data

Circularity Check

0 steps flagged

No circularity: benchmark release with empirical evaluation only

full rationale

The paper introduces iOSWorld as a new benchmark consisting of 26 apps, seeded personal data, 133 tasks, and rubrics. It reports empirical model performance (e.g., 52% overall) on these tasks under vision-only vs. privileged settings. No equations, fitted parameters, predictions derived from inputs, or self-citation chains are present that reduce any claimed result to the inputs by construction. The contribution is the benchmark definition and initial scores; validity concerns about whether the seeded data matches real usage are separate from circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on the domain assumption that the constructed apps and seeded data form a representative test of personalization; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The seeded data in the 26 apps (transactions, messages, travel records, social relationships, financial activity) represents realistic user personal information patterns.
This assumption underpins the memory and personalization task category.

pith-pipeline@v0.9.1-grok · 5771 in / 1158 out tokens · 16360 ms · 2026-06-27T17:21:52.136929+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks
cs.AI 2026-06 unverdicted novelty 6.0

OSWorld 2.0 is a benchmark of 108 realistic long-horizon computer-use tasks where current agents achieve only 20.6% binary completion, struggling with state inference and constraint tracking.

Reference graph

Works this paper leans on

47 extracted references · 3 linked inside Pith · cited by 1 Pith paper

[1]

ACL , year=

LaMP: When Large Language Models Meet Personalization , author=. ACL , year=
[2]

ACL , year=

Evaluating Very Long-Term Conversational Memory of LLM Agents , author=. ACL , year=
[3]

AAAI , year=

MemoryBank: Enhancing Large Language Models with Long-Term Memory , author=. AAAI , year=
[4]

UIST , year=

Generative Agents: Interactive Simulacra of Human Behavior , author=. UIST , year=
[5]

Artificial Intelligence , volume=

Planning and Acting in Partially Observable Stochastic Domains , author=. Artificial Intelligence , volume=
[6]

ICML , year=

World of Bits: An Open-Domain Platform for Web-Based Agents , author=. ICML , year=
[7]

ICLR , year=

Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration , author=. ICLR , year=
[8]

NeurIPS , year=

WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents , author=. NeurIPS , year=
[9]

NeurIPS , year=

Mind2Web: Towards a Generalist Agent for the Web , author=. NeurIPS , year=
[10]

ICLR , year=

WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. ICLR , year=
[11]

ACL , year=

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks , author=. ACL , year=
[12]

ACL , year=

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models , author=. ACL , year=
[13]

COLM , year=

An Illusion of Progress? Assessing the Current State of Web Agents , author=. COLM , year=
[14]

arXiv preprint arXiv:2504.12516 , year=

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents , author=. arXiv preprint arXiv:2504.12516 , year=

Pith/arXiv arXiv
[15]

arXiv preprint arXiv:2310.11441 , year=

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V , author=. arXiv preprint arXiv:2310.11441 , year=

Pith/arXiv arXiv
[16]

2025 , journal=

MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive and MCP-Augmented Environments , author=. 2025 , journal=

2025
[17]

NeurIPS , year=

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments , author=. NeurIPS , year=
[18]

ICML , year=

Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale , author=. ICML , year=
[19]

NeurIPS , year=

macOSWorld: A Multilingual Interactive Benchmark for GUI Agents , author=. NeurIPS , year=
[20]

ICML , year=

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? , author=. ICML , year=
[21]

ICLR , year=

GAIA: A benchmark for General AI Assistants , author=. ICLR , year=
[22]

NeurIPS , year=

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks , author=. NeurIPS , year=
[23]

ICLR , year=

-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains , author=. ICLR , year=
[24]

arXiv preprint arXiv:2105.13231 , year=

AndroidEnv: A Reinforcement Learning Platform for Android , author=. arXiv preprint arXiv:2105.13231 , year=

arXiv
[25]

NeurIPS , year=

Android in the Wild: A Large-Scale Dataset for Android Device Control , author=. NeurIPS , year=
[26]

ICLR , year=

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents , author=. ICLR , year=
[27]

arXiv preprint arXiv:2305.08144 , year=

Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction , author=. arXiv preprint arXiv:2305.08144 , year=

arXiv
[28]

CHI , year=

AppAgent: Multimodal Agents as Smartphone Users , author=. CHI , year=
[29]

CVPR , year=

CogAgent: A Visual Language Model for GUI Agents , author=. CVPR , year=
[30]

ICLR , year=

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception , author=. ICLR , year=
[31]

ACL , year=

AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents , author=. ACL , year=
[32]

ICLR , year=

SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation , author=. ICLR , year=
[33]

CoLLAs , year=

B-MoCA: Benchmarking Mobile Device Control Agents across Diverse Configurations , author=. CoLLAs , year=
[34]

arXiv preprint arXiv:2501.12326 , year=

UI-TARS: Pioneering Automated GUI Interaction with Native Agents , author=. arXiv preprint arXiv:2501.12326 , year=

Pith/arXiv arXiv
[35]

MobiCom , year=

AutoDroid: LLM-powered Task Automation in Android , author=. MobiCom , year=
[36]

NeurIPS , year=

DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning , author=. NeurIPS , year=
[37]

ICLR , year=

Digi-Q: Learning VLM Q-Value Functions for Training Device-Control Agents , author=. ICLR , year=
[38]

ICCV , year=

GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices , author=. ICCV , year=
[39]

ECCV , year=

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs , author=. ECCV , year=
[40]

2024 , journal=

SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents , author=. 2024 , journal=

2024
[41]

2023 , journal=

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. 2023 , journal=

2023
[42]

2026 , howpublished=

Claude Opus 4.6 , author=. 2026 , howpublished=

2026
[43]

2026 , howpublished=

Claude Code , author=. 2026 , howpublished=

2026
[44]

2026 , howpublished=

GPT-5.4 , author=. 2026 , howpublished=

2026
[45]

2026 , howpublished=

Gemini 3 Flash , author=. 2026 , howpublished=

2026
[46]

2026 , howpublished=

Qwen3.5-35B-A3B , author=. 2026 , howpublished=

2026
[47]

ACL , year=

SkillBot: Towards Automatic Skill Development via User Demonstration , author=. ACL , year=

[1] [1]

ACL , year=

LaMP: When Large Language Models Meet Personalization , author=. ACL , year=

[2] [2]

ACL , year=

Evaluating Very Long-Term Conversational Memory of LLM Agents , author=. ACL , year=

[3] [3]

AAAI , year=

MemoryBank: Enhancing Large Language Models with Long-Term Memory , author=. AAAI , year=

[4] [4]

UIST , year=

Generative Agents: Interactive Simulacra of Human Behavior , author=. UIST , year=

[5] [5]

Artificial Intelligence , volume=

Planning and Acting in Partially Observable Stochastic Domains , author=. Artificial Intelligence , volume=

[6] [6]

ICML , year=

World of Bits: An Open-Domain Platform for Web-Based Agents , author=. ICML , year=

[7] [7]

ICLR , year=

Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration , author=. ICLR , year=

[8] [8]

NeurIPS , year=

WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents , author=. NeurIPS , year=

[9] [9]

NeurIPS , year=

Mind2Web: Towards a Generalist Agent for the Web , author=. NeurIPS , year=

[10] [10]

ICLR , year=

WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. ICLR , year=

[11] [11]

ACL , year=

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks , author=. ACL , year=

[12] [12]

ACL , year=

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models , author=. ACL , year=

[13] [13]

COLM , year=

An Illusion of Progress? Assessing the Current State of Web Agents , author=. COLM , year=

[14] [14]

arXiv preprint arXiv:2504.12516 , year=

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents , author=. arXiv preprint arXiv:2504.12516 , year=

Pith/arXiv arXiv

[15] [15]

arXiv preprint arXiv:2310.11441 , year=

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V , author=. arXiv preprint arXiv:2310.11441 , year=

Pith/arXiv arXiv

[16] [16]

2025 , journal=

MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive and MCP-Augmented Environments , author=. 2025 , journal=

2025

[17] [17]

NeurIPS , year=

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments , author=. NeurIPS , year=

[18] [18]

ICML , year=

Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale , author=. ICML , year=

[19] [19]

NeurIPS , year=

macOSWorld: A Multilingual Interactive Benchmark for GUI Agents , author=. NeurIPS , year=

[20] [20]

ICML , year=

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? , author=. ICML , year=

[21] [21]

ICLR , year=

GAIA: A benchmark for General AI Assistants , author=. ICLR , year=

[22] [22]

NeurIPS , year=

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks , author=. NeurIPS , year=

[23] [23]

ICLR , year=

-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains , author=. ICLR , year=

[24] [24]

arXiv preprint arXiv:2105.13231 , year=

AndroidEnv: A Reinforcement Learning Platform for Android , author=. arXiv preprint arXiv:2105.13231 , year=

arXiv

[25] [25]

NeurIPS , year=

Android in the Wild: A Large-Scale Dataset for Android Device Control , author=. NeurIPS , year=

[26] [26]

ICLR , year=

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents , author=. ICLR , year=

[27] [27]

arXiv preprint arXiv:2305.08144 , year=

Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction , author=. arXiv preprint arXiv:2305.08144 , year=

arXiv

[28] [28]

CHI , year=

AppAgent: Multimodal Agents as Smartphone Users , author=. CHI , year=

[29] [29]

CVPR , year=

CogAgent: A Visual Language Model for GUI Agents , author=. CVPR , year=

[30] [30]

ICLR , year=

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception , author=. ICLR , year=

[31] [31]

ACL , year=

AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents , author=. ACL , year=

[32] [32]

ICLR , year=

SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation , author=. ICLR , year=

[33] [33]

CoLLAs , year=

B-MoCA: Benchmarking Mobile Device Control Agents across Diverse Configurations , author=. CoLLAs , year=

[34] [34]

arXiv preprint arXiv:2501.12326 , year=

UI-TARS: Pioneering Automated GUI Interaction with Native Agents , author=. arXiv preprint arXiv:2501.12326 , year=

Pith/arXiv arXiv

[35] [35]

MobiCom , year=

AutoDroid: LLM-powered Task Automation in Android , author=. MobiCom , year=

[36] [36]

NeurIPS , year=

DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning , author=. NeurIPS , year=

[37] [37]

ICLR , year=

Digi-Q: Learning VLM Q-Value Functions for Training Device-Control Agents , author=. ICLR , year=

[38] [38]

ICCV , year=

GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices , author=. ICCV , year=

[39] [39]

ECCV , year=

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs , author=. ECCV , year=

[40] [40]

2024 , journal=

SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents , author=. 2024 , journal=

2024

[41] [41]

2023 , journal=

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. 2023 , journal=

2023

[42] [42]

2026 , howpublished=

Claude Opus 4.6 , author=. 2026 , howpublished=

2026

[43] [43]

2026 , howpublished=

Claude Code , author=. 2026 , howpublished=

2026

[44] [44]

2026 , howpublished=

GPT-5.4 , author=. 2026 , howpublished=

2026

[45] [45]

2026 , howpublished=

Gemini 3 Flash , author=. 2026 , howpublished=

2026

[46] [46]

2026 , howpublished=

Qwen3.5-35B-A3B , author=. 2026 , howpublished=

2026

[47] [47]

ACL , year=

SkillBot: Towards Automatic Skill Development via User Demonstration , author=. ACL , year=