arxiv: 2605.07630 · v1 · submitted 2026-05-08 · 💻 cs.CL · cs.AI· cs.LG

Recognition: no theorem link

Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents

Zhengyang Tang , Yi Zhang , Chenxin Li , Xin Lai , Pengyuan Lyu , Yiduo Guo , Weinong Wang , Junyi Li

show 13 more authors

Yang Ding Huawen Shen Zhengyao Fang Xingran Zhou Liang Wu Fei Tang Sunqi Fan Shangpin Peng Zheng Ruan Anran Zhang Benyou Wang Chengquan Zhang Han Hu

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:57 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords phone-use agentssafety evaluationAI agentsbenchmarksLLM safetyrisk assessmentagent capability

0 comments

The pith

A harmless outcome from a phone-use agent does not prove safety, as it may reflect inability to act rather than safe judgment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that existing safety evaluations for phone-use agents cannot distinguish between avoiding harm through deliberate safe choices and simply failing to understand the screen or execute any action. This conflation leads benchmarks to mislabel incapability as safety. To address it, the authors created PhoneSafety, a benchmark of 700 safety-critical moments drawn from real interactions across more than 130 apps. Each moment isolates the next decision and classifies the outcome as safe action, unsafe action, or failure to act usefully. Testing eight agents reveals that stronger general performance does not predict safer choices, and failures cluster in complex screens as a capability issue rather than a safety one.

Core claim

The central claim is that phone-use agents show two distinct patterns at risky moments: unsafe choices when they can act but select wrongly, and inability to act in visually or operationally demanding contexts. Stronger agents on ordinary tasks are not reliably safer at these points. Failures to act behave like a capability signal, remaining stable across evaluation changes and concentrated in harder settings. Therefore, a harmless outcome alone is insufficient evidence of safety; evaluations must separate unsafe judgment from inability to act.

What carries the argument

The PhoneSafety benchmark, which isolates each safety-critical moment from real phone interactions to ask whether the model takes the safe action, the unsafe action, or fails to do anything useful.

Load-bearing premise

That the 700 selected moments truly isolate a single next decision and that human or automated labeling of actions as safe, unsafe, or failure is reliable and not itself influenced by model capability.

What would settle it

Re-running the benchmark on a larger set of models while controlling for screen complexity and measuring whether high-capability models still choose unsafe actions when they can act at all.

Figures

Figures reproduced from arXiv: 2605.07630 by Anran Zhang, Benyou Wang, Chengquan Zhang, Chenxin Li, Fei Tang, Han Hu, Huawen Shen, Junyi Li, Liang Wu, Pengyuan Lyu, Shangpin Peng, Sunqi Fan, Weinong Wang, Xingran Zhou, Xin Lai, Yang Ding, Yiduo Guo, Yi Zhang, Zheng Ruan, Zhengyang Tang, Zhengyao Fang.

**Figure 2.** Figure 2: Construction of PHONESAFETY. Structured realistic tasks are executed on real Android devices to produce 4,512 trajectories (∼75K steps, 130+ apps), from which safety-critical moments are screened, validated, and annotated with protocol-grounded safe and unsafe reference behaviors to form the final 700-case evaluation set [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Summary of main empirical findings. (a) General capability vs. safe-action rate. (b) Failure [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

When a phone-use agent avoids harm, does that show safety, or simply inability to act? Existing evaluations often cannot tell. A harmful outcome may be avoided because the agent recognized the risk and chose the safe action, or because it failed to understand the screen or execute any relevant action at all. These cases have different causes and call for different fixes, yet current benchmarks often merge them under task success, refusal, or final harmful outcome. We address this problem with PhoneSafety, a benchmark of 700 safety-critical moments drawn from real phone interactions across more than 130 apps. Each instance isolates the next decision at a risky moment and asks a simple question: does the model take the safe action, take the unsafe action, or fail to do anything useful? We evaluate eight representative phone-use agents under this framework. Our results reveal two main patterns. First, stronger general phone-use ability does not reliably imply safer choices at risky moments. Models that perform better on ordinary app tasks are not always the ones that behave more safely when the next action matters. Second, failures to do anything useful behave like a capability signal rather than a safety signal: they are concentrated in more visually and operationally demanding settings and remain stable when the evaluation protocol changes. Across models, failures split into two recurring patterns: unsafe choices in settings where the model can act but chooses wrongly, and inability to act in more visually and operationally demanding screens. Overall, a harmless outcome is not enough to count as evidence of safety. Evaluating phone-use agents requires separating unsafe judgment from inability to act.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper separates safety from capability in phone agent evals but labeling independence needs checking.

read the letter

The key thing to know is that this paper argues we need to stop treating harmless outcomes as proof of safety in phone agents, since many times the agent simply couldn't act. They created the PhoneSafety benchmark of 700 isolated risky moments from real phone use across lots of apps. For each, they check if the agent picks the safe option, the unsafe one, or fails to do anything relevant. What stands out is the evaluation of eight agents showing that stronger general phone skills don't always mean better safety judgments. Also, the cases where agents do nothing useful tend to happen on harder screens and act as signs of limited capability rather than caution. This separation of judgment from execution is the main new angle. The paper does a decent job making the case with concrete examples and consistent patterns across models. The stability when they tweak the evaluation protocol is a positive sign. The soft spot is the reliability of splitting actions into those three categories. If how you decide what the agent did depends on the model's output quality, then weaker models might get pushed into the failure bucket more easily, which could make the capability-signal finding partly circular. They say the patterns hold up under protocol changes, but without specifics on how they selected the 700 moments or validated the labels, it's hard to judge how clean the split really is. This is for folks working on agent benchmarks or safety for real-world tools like phone assistants. Readers interested in evaluation methods will get something out of the two patterns and the benchmark construction. It deserves peer review because the core problem it identifies is important and the approach is a step in the right direction. I'd recommend sending it for review, focusing on getting more transparency around the annotation and selection process.

Referee Report

2 major / 2 minor

Summary. The paper introduces PhoneSafety, a benchmark of 700 safety-critical moments drawn from real phone interactions across more than 130 apps. Each instance isolates the next decision at a risky moment and classifies agent behavior into one of three categories: safe action taken, unsafe action taken, or failure to do anything useful. Evaluation of eight representative phone-use agents reveals two patterns: stronger general phone-use ability does not reliably imply safer choices at risky moments, and failures to act usefully behave like a capability signal (concentrated in visually and operationally demanding screens) rather than a safety signal. The central conclusion is that a harmless outcome is insufficient evidence of safety and that evaluations must separate unsafe judgment from inability to act.

Significance. If the three-way categorization proves robust, the work provides a valuable empirical contribution by constructing a new benchmark that disentangles distinct failure modes in phone-use agents. This addresses a genuine gap in existing evaluations that often conflate task success or refusal with safety. The use of real interactions from diverse apps and the observation of stable patterns across protocol changes are strengths, offering concrete, falsifiable distinctions between capability and safety that could inform future agent design and benchmarking.

major comments (2)

[Evaluation methodology] Abstract and evaluation methodology: The central claim that failures act as a capability signal (rather than safety signal) and that stronger general ability does not imply safer choices rests on reliable three-way categorization of actions at each of the 700 moments. However, the manuscript provides no details on instance selection criteria, how actions are elicited and mapped to categories, inter-annotator agreement for human labeling, or statistical tests supporting the observed patterns. This is load-bearing because, as the skeptic concern notes, more capable models may produce clearer outputs that are easier to label as safe/unsafe while weaker models default to 'failure,' potentially rendering the separation partly tautological.
[Results and patterns] Abstract: The reported stability of failure patterns 'when the evaluation protocol changes' is presented as supporting evidence that failures are capability-driven, but without quantitative details on what protocol variations were tested, how many instances were affected, or controls for labeling consistency, it is difficult to assess whether this rules out confounding from model-specific output styles.

minor comments (2)

The abstract states evaluation on 'eight representative phone-use agents' but does not name them or their capability baselines; adding this in the main text would improve reproducibility.
Minor presentation: The phrase 'remain stable when the evaluation protocol changes' could be clarified with a brief parenthetical example of one such change to aid reader understanding without requiring the full methods section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments highlight important areas for improving methodological transparency, which we will address in the revision. We respond to each major comment below.

read point-by-point responses

Referee: [Evaluation methodology] Abstract and evaluation methodology: The central claim that failures act as a capability signal (rather than safety signal) and that stronger general ability does not imply safer choices rests on reliable three-way categorization of actions at each of the 700 moments. However, the manuscript provides no details on instance selection criteria, how actions are elicited and mapped to categories, inter-annotator agreement for human labeling, or statistical tests supporting the observed patterns. This is load-bearing because, as the skeptic concern notes, more capable models may produce clearer outputs that are easier to label as safe/unsafe while weaker models default to 'failure,' potentially rendering the separation partly tautological.

Authors: We agree that the current manuscript would benefit from greater methodological detail to support the central claims. In the revised version, we will expand the Evaluation section with: explicit criteria for selecting the 700 instances from real interactions (diversity across 130+ apps and risk types), the standardized protocol for eliciting and parsing agent actions, the precise mapping rules for the three categories, inter-annotator agreement metrics for the human labeling, and statistical tests (including significance testing for the concentration of failures in demanding screens). Regarding the potential tautology concern, the human categorization relies on objective semantic assessment of whether an action addresses the risk or is useful, independent of output clarity; results show capable models still select unsafe actions in some cases while weaker models occasionally select safe ones, indicating the distinction is substantive rather than artifactual. revision: yes
Referee: [Results and patterns] Abstract: The reported stability of failure patterns 'when the evaluation protocol changes' is presented as supporting evidence that failures are capability-driven, but without quantitative details on what protocol variations were tested, how many instances were affected, or controls for labeling consistency, it is difficult to assess whether this rules out confounding from model-specific output styles.

Authors: We will add quantitative details on the protocol stability analysis in the revision. The variations tested included two alternative prompt phrasings and output formatting requirements applied to all 700 instances. We will report the exact number of instances with category changes (under 8%) and include controls such as a standardized post-processing parser to normalize outputs before labeling. These additions will clarify that the stability is not attributable to model-specific output styles and further support the interpretation of failures as capability signals. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark construction

full rationale

The paper introduces PhoneSafety as a new benchmark of 700 safety-critical moments drawn from real phone interactions, then evaluates eight agents by categorizing outcomes into safe action, unsafe action, or failure. No mathematical derivations, equations, fitted parameters, or predictions appear in the provided text. All claims rest on fresh data collection and direct observation rather than any reduction of results to prior inputs, self-citations, or ansatzes. The separation of safety from capability is presented as an empirical finding from the new protocol, not a definitional or fitted tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no free parameters, no invented physical or mathematical entities, and relies on the domain assumption that safety-critical moments can be isolated from phone trajectories and labeled into three mutually exclusive categories.

axioms (1)

domain assumption Safety-critical moments can be isolated from real phone interaction trajectories such that the next action decision is unambiguous.
This premise underpins the entire benchmark construction described in the abstract.

pith-pipeline@v0.9.0 · 5659 in / 1181 out tokens · 30988 ms · 2026-05-11T01:57:16.993156+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 4 internal anchors

[1]

Agentharm: A benchmark for measuring harmfulness of llm agents

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, et al. Agentharm: A benchmark for measuring harmfulness of llm agents. InInternational Conference on Learning Representations, volume 2025, pages 79185–79220,

work page 2025
[2]

Ghostei-bench: Do mobile agents resilience to environmental injection in dynamic on-device environments?arXiv preprint arXiv:2510.20333, 2025

URL https://www-cdn.anthropic.com/ 0dd865075ad3132672ee0ab40b05a53f14cf5288.pdf. System card. Chiyu Chen, Xinhao Song, Yunkai Chai, Yang Yao, Haodong Zhao, Lijun Li, Jie Li, Yan Teng, Gongshen Liu, and Yingchun Wang. Ghostei-bench: Do mobile agents resilience to environmental injection in dynamic on-device environments?arXiv preprint arXiv:2510.20333,

work page arXiv
[3]

Autoglm: Autonomous foundation agents for guis.arXiv preprint arXiv:2411.00820, 2024

Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, et al. Autoglm: Autonomous foundation agents for guis.arXiv preprint arXiv:2411.00820,

work page arXiv
[4]

Zero-permission manipulation: Can we trust large multimodal model powered gui agents?arXiv preprint arXiv:2601.12349, 2026

Yi Qian, Kunwei Qian, Xingbang He, Ligeng Chen, Jikang Zhang, Tiantai Zhang, Haiyang Wei, Linzhang Wang, Hao Wu, and Bing Mao. Zero-permission manipulation: Can we trust large multimodal model powered gui agents?arXiv preprint arXiv:2601.12349,

work page arXiv
[5]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326,

work page Pith review arXiv
[6]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573,

work page internal anchor Pith review arXiv
[7]

Towards trustworthy gui agents: A survey.arXiv preprint arXiv:2503.23434, 2025

Yucheng Shi, Wenhao Yu, Wenlin Yao, Wenhu Chen, and Ninghao Liu. Towards trustworthy GUI agents: A survey.CoRR, abs/2503.23434,

work page arXiv
[9]

Gemini: A Family of Highly Capable Multimodal Models

URL https: //github.com/stepfun-ai/gelab-zero. Gemini Team. Gemini: A family of highly capable multimodal models.CoRR, abs/2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Gemini: A Family of Highly Capable Multimodal Models

doi: 10.48550/ARXIV .2312.11805. URLhttps://doi.org/10.48550/arXiv.2312.11805. 10 Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv
[11]

Openagentsafety: A comprehensive framework for evaluating real-world ai agent safety.arXiv preprint arXiv:2507.06134,

Sanidhya Vijayvargiya, Aditya Bharat Soni, Xuhui Zhou, Zora Zhiruo Wang, Nouha Dziri, Graham Neubig, and Maarten Sap. Openagentsafety: A comprehensive framework for evaluating real-world ai agent safety.arXiv preprint arXiv:2507.06134,

work page arXiv
[12]

GUIGuard: Toward a General Framework for Privacy-Preserving GUI Agents, jan 2026

Yanxi Wang, Zhiling Zhang, Wenbo Zhou, Weiming Zhang, Jie Zhang, Qiannan Zhu, Yu Shi, Shuxin Zheng, and Jiyan He. Guiguard: Toward a general framework for privacy-preserving gui agents. arXiv preprint arXiv:2601.18842,

work page internal anchor Pith review arXiv
[13]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. In Amir Globersons, Lester Mackey,...

work page 2024
[14]

URL http://papers.nips.cc/paper_files/paper/2024/hash/ 5d413e48f84dc61244b6be550f1cd8f5-Abstract-Datasets_and_Benchmarks_Track. html. Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, et al. Mobile-agent-v3. 5: Multi-platform fundamental gui agents. arXiv preprint arXiv:2602.16855,

work page arXiv 2024
[15]

Riosworld: Benchmarking the risk of multimodal computer-use agents.arXiv preprint arXiv:2506.00618, 2025

Jingyi Yang, Shuai Shao, Dongrui Liu, and Jing Shao. Riosworld: Benchmarking the risk of multimodal computer-use agents.arXiv preprint arXiv:2506.00618,

work page arXiv
[16]

Mai-ui technical report: Real-world centric foundation gui agents.arXiv preprint arXiv:2512.22047, 2025

Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, et al. Mai-ui technical report: Real-world centric foundation gui agents.arXiv preprint arXiv:2512.22047,

work page arXiv
[17]

Benchmarks such as MobileSafetyBench Lee et al

11 A Related Work Recent work has made clear that mobile and computer-use agents face safety risks that differ from those of text-only assistants. Benchmarks such as MobileSafetyBench Lee et al. [2026], GhostEI- Bench Chen et al. [2025], RiOSWorld Yang et al. [2025], and GUIGuard Wang et al. [2026], together with attack studies on environmental injection,...

work page 2026