Recognition: no theorem link
Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents
Pith reviewed 2026-05-11 01:57 UTC · model grok-4.3
The pith
A harmless outcome from a phone-use agent does not prove safety, as it may reflect inability to act rather than safe judgment.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that phone-use agents show two distinct patterns at risky moments: unsafe choices when they can act but select wrongly, and inability to act in visually or operationally demanding contexts. Stronger agents on ordinary tasks are not reliably safer at these points. Failures to act behave like a capability signal, remaining stable across evaluation changes and concentrated in harder settings. Therefore, a harmless outcome alone is insufficient evidence of safety; evaluations must separate unsafe judgment from inability to act.
What carries the argument
The PhoneSafety benchmark, which isolates each safety-critical moment from real phone interactions to ask whether the model takes the safe action, the unsafe action, or fails to do anything useful.
Load-bearing premise
That the 700 selected moments truly isolate a single next decision and that human or automated labeling of actions as safe, unsafe, or failure is reliable and not itself influenced by model capability.
What would settle it
Re-running the benchmark on a larger set of models while controlling for screen complexity and measuring whether high-capability models still choose unsafe actions when they can act at all.
Figures
read the original abstract
When a phone-use agent avoids harm, does that show safety, or simply inability to act? Existing evaluations often cannot tell. A harmful outcome may be avoided because the agent recognized the risk and chose the safe action, or because it failed to understand the screen or execute any relevant action at all. These cases have different causes and call for different fixes, yet current benchmarks often merge them under task success, refusal, or final harmful outcome. We address this problem with PhoneSafety, a benchmark of 700 safety-critical moments drawn from real phone interactions across more than 130 apps. Each instance isolates the next decision at a risky moment and asks a simple question: does the model take the safe action, take the unsafe action, or fail to do anything useful? We evaluate eight representative phone-use agents under this framework. Our results reveal two main patterns. First, stronger general phone-use ability does not reliably imply safer choices at risky moments. Models that perform better on ordinary app tasks are not always the ones that behave more safely when the next action matters. Second, failures to do anything useful behave like a capability signal rather than a safety signal: they are concentrated in more visually and operationally demanding settings and remain stable when the evaluation protocol changes. Across models, failures split into two recurring patterns: unsafe choices in settings where the model can act but chooses wrongly, and inability to act in more visually and operationally demanding screens. Overall, a harmless outcome is not enough to count as evidence of safety. Evaluating phone-use agents requires separating unsafe judgment from inability to act.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PhoneSafety, a benchmark of 700 safety-critical moments drawn from real phone interactions across more than 130 apps. Each instance isolates the next decision at a risky moment and classifies agent behavior into one of three categories: safe action taken, unsafe action taken, or failure to do anything useful. Evaluation of eight representative phone-use agents reveals two patterns: stronger general phone-use ability does not reliably imply safer choices at risky moments, and failures to act usefully behave like a capability signal (concentrated in visually and operationally demanding screens) rather than a safety signal. The central conclusion is that a harmless outcome is insufficient evidence of safety and that evaluations must separate unsafe judgment from inability to act.
Significance. If the three-way categorization proves robust, the work provides a valuable empirical contribution by constructing a new benchmark that disentangles distinct failure modes in phone-use agents. This addresses a genuine gap in existing evaluations that often conflate task success or refusal with safety. The use of real interactions from diverse apps and the observation of stable patterns across protocol changes are strengths, offering concrete, falsifiable distinctions between capability and safety that could inform future agent design and benchmarking.
major comments (2)
- [Evaluation methodology] Abstract and evaluation methodology: The central claim that failures act as a capability signal (rather than safety signal) and that stronger general ability does not imply safer choices rests on reliable three-way categorization of actions at each of the 700 moments. However, the manuscript provides no details on instance selection criteria, how actions are elicited and mapped to categories, inter-annotator agreement for human labeling, or statistical tests supporting the observed patterns. This is load-bearing because, as the skeptic concern notes, more capable models may produce clearer outputs that are easier to label as safe/unsafe while weaker models default to 'failure,' potentially rendering the separation partly tautological.
- [Results and patterns] Abstract: The reported stability of failure patterns 'when the evaluation protocol changes' is presented as supporting evidence that failures are capability-driven, but without quantitative details on what protocol variations were tested, how many instances were affected, or controls for labeling consistency, it is difficult to assess whether this rules out confounding from model-specific output styles.
minor comments (2)
- The abstract states evaluation on 'eight representative phone-use agents' but does not name them or their capability baselines; adding this in the main text would improve reproducibility.
- Minor presentation: The phrase 'remain stable when the evaluation protocol changes' could be clarified with a brief parenthetical example of one such change to aid reader understanding without requiring the full methods section.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. The comments highlight important areas for improving methodological transparency, which we will address in the revision. We respond to each major comment below.
read point-by-point responses
-
Referee: [Evaluation methodology] Abstract and evaluation methodology: The central claim that failures act as a capability signal (rather than safety signal) and that stronger general ability does not imply safer choices rests on reliable three-way categorization of actions at each of the 700 moments. However, the manuscript provides no details on instance selection criteria, how actions are elicited and mapped to categories, inter-annotator agreement for human labeling, or statistical tests supporting the observed patterns. This is load-bearing because, as the skeptic concern notes, more capable models may produce clearer outputs that are easier to label as safe/unsafe while weaker models default to 'failure,' potentially rendering the separation partly tautological.
Authors: We agree that the current manuscript would benefit from greater methodological detail to support the central claims. In the revised version, we will expand the Evaluation section with: explicit criteria for selecting the 700 instances from real interactions (diversity across 130+ apps and risk types), the standardized protocol for eliciting and parsing agent actions, the precise mapping rules for the three categories, inter-annotator agreement metrics for the human labeling, and statistical tests (including significance testing for the concentration of failures in demanding screens). Regarding the potential tautology concern, the human categorization relies on objective semantic assessment of whether an action addresses the risk or is useful, independent of output clarity; results show capable models still select unsafe actions in some cases while weaker models occasionally select safe ones, indicating the distinction is substantive rather than artifactual. revision: yes
-
Referee: [Results and patterns] Abstract: The reported stability of failure patterns 'when the evaluation protocol changes' is presented as supporting evidence that failures are capability-driven, but without quantitative details on what protocol variations were tested, how many instances were affected, or controls for labeling consistency, it is difficult to assess whether this rules out confounding from model-specific output styles.
Authors: We will add quantitative details on the protocol stability analysis in the revision. The variations tested included two alternative prompt phrasings and output formatting requirements applied to all 700 instances. We will report the exact number of instances with category changes (under 8%) and include controls such as a standardized post-processing parser to normalize outputs before labeling. These additions will clarify that the stability is not attributable to model-specific output styles and further support the interpretation of failures as capability signals. revision: yes
Circularity Check
No circularity: purely empirical benchmark construction
full rationale
The paper introduces PhoneSafety as a new benchmark of 700 safety-critical moments drawn from real phone interactions, then evaluates eight agents by categorizing outcomes into safe action, unsafe action, or failure. No mathematical derivations, equations, fitted parameters, or predictions appear in the provided text. All claims rest on fresh data collection and direct observation rather than any reduction of results to prior inputs, self-citations, or ansatzes. The separation of safety from capability is presented as an empirical finding from the new protocol, not a definitional or fitted tautology.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Safety-critical moments can be isolated from real phone interaction trajectories such that the next action decision is unambiguous.
Reference graph
Works this paper leans on
-
[1]
Agentharm: A benchmark for measuring harmfulness of llm agents
Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, et al. Agentharm: A benchmark for measuring harmfulness of llm agents. InInternational Conference on Learning Representations, volume 2025, pages 79185–79220,
work page 2025
-
[2]
URL https://www-cdn.anthropic.com/ 0dd865075ad3132672ee0ab40b05a53f14cf5288.pdf. System card. Chiyu Chen, Xinhao Song, Yunkai Chai, Yang Yao, Haodong Zhao, Lijun Li, Jie Li, Yan Teng, Gongshen Liu, and Yingchun Wang. Ghostei-bench: Do mobile agents resilience to environmental injection in dynamic on-device environments?arXiv preprint arXiv:2510.20333,
-
[3]
Autoglm: Autonomous foundation agents for guis.arXiv preprint arXiv:2411.00820, 2024
Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, et al. Autoglm: Autonomous foundation agents for guis.arXiv preprint arXiv:2411.00820,
-
[4]
Yi Qian, Kunwei Qian, Xingbang He, Ligeng Chen, Jikang Zhang, Tiantai Zhang, Haiyang Wei, Linzhang Wang, Hao Wu, and Bing Mao. Zero-permission manipulation: Can we trust large multimodal model powered gui agents?arXiv preprint arXiv:2601.12349,
-
[5]
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326,
-
[6]
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573,
work page internal anchor Pith review arXiv
-
[7]
Towards trustworthy gui agents: A survey.arXiv preprint arXiv:2503.23434, 2025
Yucheng Shi, Wenhao Yu, Wenlin Yao, Wenhu Chen, and Ninghao Liu. Towards trustworthy GUI agents: A survey.CoRR, abs/2503.23434,
-
[9]
Gemini: A Family of Highly Capable Multimodal Models
URL https: //github.com/stepfun-ai/gelab-zero. Gemini Team. Gemini: A family of highly capable multimodal models.CoRR, abs/2312.11805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Gemini: A Family of Highly Capable Multimodal Models
doi: 10.48550/ARXIV .2312.11805. URLhttps://doi.org/10.48550/arXiv.2312.11805. 10 Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv
-
[11]
Sanidhya Vijayvargiya, Aditya Bharat Soni, Xuhui Zhou, Zora Zhiruo Wang, Nouha Dziri, Graham Neubig, and Maarten Sap. Openagentsafety: A comprehensive framework for evaluating real-world ai agent safety.arXiv preprint arXiv:2507.06134,
-
[12]
GUIGuard: Toward a General Framework for Privacy-Preserving GUI Agents, jan 2026
Yanxi Wang, Zhiling Zhang, Wenbo Zhou, Weiming Zhang, Jie Zhang, Qiannan Zhu, Yu Shi, Shuxin Zheng, and Jiyan He. Guiguard: Toward a general framework for privacy-preserving gui agents. arXiv preprint arXiv:2601.18842,
work page internal anchor Pith review arXiv
-
[13]
Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. In Amir Globersons, Lester Mackey,...
work page 2024
-
[14]
URL http://papers.nips.cc/paper_files/paper/2024/hash/ 5d413e48f84dc61244b6be550f1cd8f5-Abstract-Datasets_and_Benchmarks_Track. html. Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, et al. Mobile-agent-v3. 5: Multi-platform fundamental gui agents. arXiv preprint arXiv:2602.16855,
-
[15]
Jingyi Yang, Shuai Shao, Dongrui Liu, and Jing Shao. Riosworld: Benchmarking the risk of multimodal computer-use agents.arXiv preprint arXiv:2506.00618,
-
[16]
Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, et al. Mai-ui technical report: Real-world centric foundation gui agents.arXiv preprint arXiv:2512.22047,
-
[17]
Benchmarks such as MobileSafetyBench Lee et al
11 A Related Work Recent work has made clear that mobile and computer-use agents face safety risks that differ from those of text-only assistants. Benchmarks such as MobileSafetyBench Lee et al. [2026], GhostEI- Bench Chen et al. [2025], RiOSWorld Yang et al. [2025], and GUIGuard Wang et al. [2026], together with attack studies on environmental injection,...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.