pith. machine review for the scientific record. sign in

arxiv: 2604.09574 · v1 · submitted 2026-02-24 · 💻 cs.AI · cs.LG

Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization

Pith reviewed 2026-05-15 20:32 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords GUI agentshumanizationTuring testtouch dynamicsbehavioral divergenceimitabilityMinMax optimizationmobile interfaces
0
0 comments X

The pith

GUI agents can reach high human imitability in mobile touch interactions without losing task performance by minimizing behavioral divergence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames agent humanization as a MinMax optimization game between a detector trying to spot non-humans and an agent trying to reduce differences in screen interaction behavior. It shows that standard large multimodal model agents produce unnatural movement patterns that make them stand out immediately. Using a new dataset of real human touch dynamics, the authors build the Agent Humanization Benchmark and test methods that add noise or match observed human patterns. These methods raise imitability scores substantially while keeping the same task success rates. The work argues that future agents must succeed not only at completing tasks but at doing so in ways that fit inside human-centric digital systems.

Core claim

By modeling the interaction as a MinMax problem and optimizing agents to minimize behavioral divergence from human touch kinematics on a collected high-fidelity mobile dataset, agents can achieve high imitability scores both theoretically and empirically without any measurable drop in utility or robustness.

What carries the argument

The MinMax optimization between detector and agent that quantifies behavioral divergence, supported by the Agent Humanization Benchmark and associated detection metrics.

If this is right

  • Vanilla LMM-based agents produce detectable unnatural kinematics in touch trajectories.
  • Heuristic noise injection and data-driven behavioral matching both raise imitability without harming task performance.
  • The new benchmark and metrics make the imitability-utility trade-off measurable and comparable across methods.
  • Successful humanization allows agents to operate inside human-centric platforms without triggering adversarial countermeasures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar humanization techniques could be required for agents on non-mobile interfaces such as web or desktop.
  • Detector designers may need to incorporate higher-order statistics or multi-session patterns once basic kinematic matching becomes common.
  • The MinMax framing suggests a possible arms race where continued improvement in humanization forces detectors to adopt more sophisticated models.

Load-bearing premise

The collected dataset of mobile touch dynamics represents the full range of behaviors that real detectors would rely on, and reducing measured divergence in the model produces actual undetectability in deployed systems.

What would settle it

A controlled test in which a humanized agent is run against production mobile-platform detectors on live apps and still receives non-human flags at rates comparable to vanilla agents.

Figures

Figures reproduced from arXiv: 2604.09574 by Congmin Zheng, Jiachen Zhu, Jianghao Lin, Lingyu Yang, Rong Shan, Weinan Zhang, Weiwen Liu, Yong Yu, Zeyu Zheng.

Figure 1
Figure 1. Figure 1: The adversarial landscape between GUI Agents and Mobile Platforms. The figure illustrates three key stages: [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The difference between human and agent swipe. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The visualization of action interval and tap duration differences between human and agents. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Impact of humanization on detection accuracy across feature clusters. The chart compares the detection [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Impact of Feature Selection on Detection Accuracy. Comparison of (a) SVM and (b) XGBoost performance as the number of features increases. 5.2 In-Depth Feature Analysis [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Doubao Mobile Assistant Working Scene on the Offical Website. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: For more details, see the official documentation: Android Sensor Overview and Android Sensor Types [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The Lengths and Durations of Each Action. Actions with [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The correlation of these 24 features. Red color means stronger correlation. [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Impact of Online Humanization on Task Utility. This chart compares the success rates of raw agents (light [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Validation of interval mimicry using fake actions. The figure compares the normalized action interval [PITH_FULL_IMAGE:figures/full_fig_p033_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Distribution analysis of trajectory deviation. We compare the distribution of the [PITH_FULL_IMAGE:figures/full_fig_p034_12.png] view at source ↗
read the original abstract

The rise of autonomous GUI agents has triggered adversarial countermeasures from digital platforms, yet existing research prioritizes utility and robustness over the critical dimension of anti-detection. We argue that for agents to survive in human-centric ecosystems, they must evolve Humanization capabilities. We introduce the ``Turing Test on Screen,'' formally modeling the interaction as a MinMax optimization problem between a detector and an agent aiming to minimize behavioral divergence. We then collect a new high-fidelity dataset of mobile touch dynamics, and conduct our analysis that vanilla LMM-based agents are easily detectable due to unnatural kinematics. Consequently, we establish the Agent Humanization Benchmark (AHB) and detection metrics to quantify the trade-off between imitability and utility. Finally, we propose methods ranging from heuristic noise to data-driven behavioral matching, demonstrating that agents can achieve high imitability theoretically and empirically without sacrificing performance. This work shifts the paradigm from whether an agent can perform a task to how it performs it within a human-centric ecosystem, laying the groundwork for seamless coexistence in adversarial digital environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the 'Turing Test on Screen' benchmark, modeling mobile GUI agent humanization as a MinMax optimization between an agent minimizing behavioral divergence and a detector. It collects a high-fidelity dataset of mobile touch dynamics, shows that vanilla LMM-based agents are easily detectable due to unnatural kinematics, establishes the Agent Humanization Benchmark (AHB) with associated metrics, and proposes methods (heuristic noise injection and data-driven behavioral matching) that achieve high imitability without sacrificing task performance.

Significance. If the empirical claims hold under rigorous validation, the work could meaningfully advance GUI agent research by formalizing the trade-off between utility and undetectability in adversarial environments. The MinMax framing, specialized touch-dynamics dataset, and AHB provide a concrete foundation for future studies on behavioral naturalness, potentially influencing platform policies and agent deployment strategies.

major comments (2)
  1. [Analysis and Proposed Methods] The central claim that vanilla LMM agents are 'easily detectable due to unnatural kinematics' and that proposed methods achieve high imitability without utility loss lacks reported validation metrics, error bars, statistical controls, or ablation studies on the benchmark metrics (as highlighted by the low soundness score). This is load-bearing for the empirical success assertions.
  2. [Dataset Collection and MinMax Formulation] The premise that the collected high-fidelity dataset spans the full distribution of human behavior (and that MinMax divergence minimization on the AHB directly implies evasion against real or adaptive detectors using different features/temporal patterns) is untested. No cross-validation against alternative detectors or online adaptation scenarios is provided.
minor comments (2)
  1. [Abstract] The abstract and introduction would benefit from explicit quantitative results (e.g., specific imitability scores or detection rates) rather than qualitative statements.
  2. [Formal Modeling] Clarify the precise mathematical definition of the behavioral divergence metric and the trade-off weight in the MinMax objective to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments below and will revise the paper to incorporate additional validation and clarifications where appropriate.

read point-by-point responses
  1. Referee: [Analysis and Proposed Methods] The central claim that vanilla LMM agents are 'easily detectable due to unnatural kinematics' and that proposed methods achieve high imitability without utility loss lacks reported validation metrics, error bars, statistical controls, or ablation studies on the benchmark metrics (as highlighted by the low soundness score). This is load-bearing for the empirical success assertions.

    Authors: We acknowledge the need for stronger statistical support. In the revised manuscript, we will add error bars to all reported metrics, include statistical significance tests (such as paired t-tests), provide ablation studies isolating the contributions of heuristic noise injection and data-driven matching, and report full benchmark scores with controls for task difficulty. These additions will directly substantiate the claims on detectability and the imitability-utility trade-off. revision: yes

  2. Referee: [Dataset Collection and MinMax Formulation] The premise that the collected high-fidelity dataset spans the full distribution of human behavior (and that MinMax divergence minimization on the AHB directly implies evasion against real or adaptive detectors using different features/temporal patterns) is untested. No cross-validation against alternative detectors or online adaptation scenarios is provided.

    Authors: The dataset was gathered from multiple users performing varied tasks to capture diverse touch dynamics, but we agree that explicit cross-validation and adaptation tests would improve rigor. In revision, we will include experiments evaluating our methods against alternative detector feature sets and discuss limitations for fully adaptive online settings. We will also clarify that the MinMax formulation is a foundational model rather than a complete proof of evasion in all scenarios. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines the MinMax optimization as an explicit modeling choice for the detector-agent interaction, collects an independent high-fidelity dataset of mobile touch dynamics, performs kinematic analysis on vanilla LMM agents, establishes the AHB benchmark from that analysis, and evaluates proposed methods (heuristic noise and data-driven matching) empirically on the new data. No equation or claim reduces by construction to a fitted parameter from the same dataset, no self-citation bears the central load, and the imitability-utility trade-off is demonstrated rather than assumed via renaming or ansatz smuggling. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Abstract provides limited detail on internal parameters; the MinMax optimization likely involves implicit trade-off weights between imitability and utility that are not specified as fitted or derived.

free parameters (1)
  • trade-off weight in MinMax optimization
    Not detailed in abstract but required to balance detection minimization against task performance.

pith-pipeline@v0.9.0 · 5501 in / 1141 out tokens · 40591 ms · 2026-05-15T20:32:33.562137+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability

    cs.CL 2026-05 unverdicted novelty 4.0

    The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    Gpt-4 technical report, 2024

    OpenAI, Josh Achiam, Steven Adler, and Sandhini Agarwal. Gpt-4 technical report, 2024

  2. [2]

    Gemini: A family of highly capable multimodal models, 2025

    Gemini Team, Rohan Anil, and Sebastian Borgeaud. Gemini: A family of highly capable multimodal models, 2025

  3. [3]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, volume 36, 2023

  4. [4]

    Appagent: Multimodal agents as smartphone users

    Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users. InThe Twelfth International Conference on Learning Representations, 2024

  5. [5]

    Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

    Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weiezhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile- agent: Autonomous multi-modal mobile device agent with visual perception.arXiv preprint arXiv:2401.16158, 2024

  6. [6]

    Cogagent: A visual language model for gui agents

    Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, and Jie Tang. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14281–14290, 2024

  7. [7]

    Mind2web: Towards a generalist agent for the web

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. InAdvances in Neural Information Processing Systems, volume 36, 2023

  8. [8]

    Webshop: Towards scalable real-world web interaction with grounded language agents

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. InAdvances in Neural Information Processing Systems, volume 35, pages 20744–20757, 2022

  9. [9]

    Superplatforms have to attack ai agents, 2025

    Jianghao Lin, Jiachen Zhu, Zheli Zhou, Yunjia Xi, Weiwen Liu, Yong Yu, and Weinan Zhang. Superplatforms have to attack ai agents, 2025. 12 APREPRINT- APRIL14, 2026

  10. [10]

    What is your ai agent buying? evaluation, biases, model dependence, & emerging implications for agentic e-commerce, 2025

    Amine Allouah, Omar Besbes, Josué D Figueroa, Yash Kanoria, and Akshit Kumar. What is your ai agent buying? evaluation, biases, model dependence, & emerging implications for agentic e-commerce, 2025

  11. [11]

    How can recommender systems benefit from large language models: A survey, 2024

    Jianghao Lin, Xinyi Dai, Yunjia Xi, Weiwen Liu, Bo Chen, Hao Zhang, Yong Liu, Chuhan Wu, Xiangyang Li, Chenxu Zhu, Huifeng Guo, Yong Yu, Ruiming Tang, and Weinan Zhang. How can recommender systems benefit from large language models: A survey, 2024

  12. [12]

    Computing machinery and intelligence.Mind, 59(236):433–460, 1950

    Alan M Turing. Computing machinery and intelligence.Mind, 59(236):433–460, 1950

  13. [13]

    Touch-based continuous mobile device authentication: State-of-the-art, challenges and opportunities.Journal of Network and Computer Applications, 191:103162, 2021

    Ahmad Zairi Zaidi, Chun Yong Chong, Zhe Jin, Rajendran Parthiban, and Ali Safaa Sadiq. Touch-based continuous mobile device authentication: State-of-the-art, challenges and opportunities.Journal of Network and Computer Applications, 191:103162, 2021

  14. [14]

    Mario Frank, Ralf Biedert, Eugene Ma, Ivan Martinovic, and Dawn Song. Touchalytics: On the applicability of touchscreen input as a behavioral biometric for continuous authentication.IEEE Transactions on Information Forensics and Security, 8(1):136–148, 2013

  15. [15]

    AlQahtani, and Muhammad Khurram Khan

    Reem Alrawili, Ali Abdullah S. AlQahtani, and Muhammad Khurram Khan. Comprehensive survey: Biometric user authentication application, evaluation, and discussion, 2024

  16. [16]

    Princeton University Press, Princeton, NJ, 1944

    John von Neumann and Oskar Morgenstern.Theory of Games and Economic Behavior. Princeton University Press, Princeton, NJ, 1944

  17. [17]

    Generative adversarial nets

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. InAdvances in Neural Information Processing Systems, volume 27, pages 2672–2680, 2014

  18. [18]

    Ui-tars: Pioneering automated gui interaction with native agents, 2025

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Ya...

  19. [19]

    Mobile-agent-e: Self-evolving mobile assistant for complex tasks, 2025

    Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, and Heng Ji. Mobile-agent-e: Self-evolving mobile assistant for complex tasks, 2025

  20. [20]

    Agentcpm-gui: Building mobile-use agents with reinforcement fine-tuning, 2025

    Zhong Zhang, Yaxi Lu, Yikun Fu, Yupeng Huo, Shenzhi Yang, Yesai Wu, Han Si, Xin Cong, Haotian Chen, Yankai Lin, Jie Xie, Wei Zhou, Wang Xu, Yuanheng Zhang, Zhou Su, Zhongwu Zhai, Xiaoming Liu, Yudong Mei, Jianming Xu, Hongyan Tian, Chongyi Wang, Chi Chen, Yuan Yao, Zhiyuan Liu, and Maosong Sun. Agentcpm-gui: Building mobile-use agents with reinforcement f...

  21. [21]

    Autoglm: Autonomous foundation agents for guis, 2024

    Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, Junjie Gao, Junjun Shan, Kangning Liu, Shudan Zhang, Shuntian Yao, Siyi Cheng, Wentao Yao, Wenyi Zhao, Xinghan Liu, Xinyi Liu, Xinying Chen, Xinyue Yang, Yang Yang, Yifan Xu, Yu Yang, Yujia Wang, Yulin Xu, Zehan Qi, Yuxiao Dong, and J...

  22. [22]

    Mario Frank, Ralf Biedert, Eugene Ma, Ivan Martinovic, and Dawn Song. Touchalytics: On the applicability of touchscreen input as a behavioral biometric for continuous authentication.IEEE transactions on information forensics and security, 8(1):136–148, 2012

  23. [23]

    A mathematical theory of communication.The Bell system technical journal, 27(3):379–423, 1948

    Claude E Shannon. A mathematical theory of communication.The Bell system technical journal, 27(3):379–423, 1948

  24. [24]

    Support-vector networks.Machine learning, 20(3):273–297, 1995

    Corinna Cortes and Vladimir Vapnik. Support-vector networks.Machine learning, 20(3):273–297, 1995

  25. [25]

    Xgboost: A scalable tree boosting system.Cornell University, 2016

    Tianqi Chen. Xgboost: A scalable tree boosting system.Cornell University, 2016

  26. [26]

    On calculating with b-splines.Journal of Approximation Theory, 6(1):50–62, 1972

    Carl De Boor. On calculating with b-splines.Journal of Approximation Theory, 6(1):50–62, 1972

  27. [27]

    Mobile-agent-v3: Fundamental agents for gui automation

    Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, et al. Mobile-agent-v3: Fundamental agents for gui automation, 2025.URL https://arxiv. org/abs/2508.15144, 4:21–27

  28. [28]

    Coco-agent: A comprehensive cognitive mllm agent for smartphone gui automation.arXiv preprint arXiv:2402.11941, 2024

    Xinbei Ma, Zhuosheng Zhang, and Hai Zhao. Coco-agent: A comprehensive cognitive mllm agent for smartphone gui automation.arXiv preprint arXiv:2402.11941, 2024

  29. [29]

    Mobileuse: A gui agent with hierarchical reflection for autonomous mobile operation.arXiv preprint arXiv:2507.16853, 2025

    Ning Li, Xiangmou Qu, Jiamu Zhou, Jun Wang, Muning Wen, Kounianhua Du, Xingyu Lou, Qiuying Peng, and Weinan Zhang. Mobileuse: A gui agent with hierarchical reflection for autonomous mobile operation.arXiv preprint arXiv:2507.16853, 2025

  30. [30]

    Caution for the environment: Multimodal agents are susceptible to environ- mental distractions

    Xinbei Ma, Yiting Wang, Yao Yao, Tongxin Yuan, Aston Zhang, Zhuosheng Zhang, and Hai Zhao. Caution for the environment: Multimodal agents are susceptible to environmental distractions.arXiv preprint arXiv:2408.02544, 2024. 13 APREPRINT- APRIL14, 2026

  31. [31]

    VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents

    Zheng Wu, Heyuan Huang, Xingyu Lou, Xiangmou Qu, Pengzhou Cheng, Zongru Wu, Weiwen Liu, Weinan Zhang, Jun Wang, Zhaoxiang Wang, et al. Verios: Query-driven proactive human-agent-gui interaction for trustworthy os agents.arXiv preprint arXiv:2509.07553, 2025

  32. [32]

    Os-kairos: Adaptive interaction for mllm-powered gui agents

    Pengzhou Cheng, Zheng Wu, Zongru Wu, Tianjie Ju, Aston Zhang, Zhuosheng Zhang, and Gongshen Liu. Os-kairos: Adaptive interaction for mllm-powered gui agents. InFindings of the Association for Computational Linguistics: ACL 2025, pages 6701–6725, 2025

  33. [33]

    Mobile-R1: Towards Interactive Capability for VLM-Based Mobile Agent via Systematic Training

    Jihao Gu, Qihang Ai, Yingyao Wang, Pi Bu, Jingxuan Xing, Zekun Zhu, Wei Jiang, Ziming Wang, Yingxiu Zhao, Ming-Liang Zhang, et al. Mobile-r1: Towards interactive reinforcement learning for vlm-based mobile agent via task-level rewards.arXiv preprint arXiv:2506.20332, 2025

  34. [34]

    Arpo: End-to-end policy optimization for gui agents with experience replay.arXiv preprint arXiv:2505.16282, 2025

    Fanbin Lu, Zhisheng Zhong, Shu Liu, Chi-Wing Fu, and Jiaya Jia. Arpo: End-to-end policy optimization for gui agents with experience replay.arXiv preprint arXiv:2505.16282, 2025

  35. [35]

    Mobilerl: Advancing mobile use agents with adaptive online reinforcement learning, 2025.URL https://github

    Yifan Xu, Xiao Liu, Xinghan Liu, Jiaqi Fu, Hanchen Zhang, Bohao Jing, Shudan Zhang, Yuting Wang, Wenyi Zhao, and Yuxiao Dong. Mobilerl: Advancing mobile use agents with adaptive online reinforcement learning, 2025.URL https://github. com/THUDM/MobileRL

  36. [36]

    The rise and potential of large language model based agents: A survey.Science China Information Sciences, 68(2):121101, 2025

    Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. The rise and potential of large language model based agents: A survey.Science China Information Sciences, 68(2):121101, 2025

  37. [37]

    Dissecting adversarial robustness of multimodal lm agents, 2025

    Chen Henry Wu, Rishi Shah, Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried, and Aditi Raghunathan. Dissecting adversarial robustness of multimodal lm agents, 2025

  38. [38]

    Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents, 2025

    Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hongwei Wang, and Yongfeng Zhang. Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents, 2025

  39. [39]

    Advagent: Controllable blackbox red-teaming on web agents, 2025

    Chejian Xu, Mintong Kang, Jiawei Zhang, Zeyi Liao, Lingbo Mo, Mengqi Yuan, Huan Sun, and Bo Li. Advagent: Controllable blackbox red-teaming on web agents, 2025

  40. [40]

    Agent smith: A single image can jailbreak one million multimodal llm agents exponentially fast.arXiv preprint arXiv:2402.08567, 2024

    Xiangming Gu, Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Ye Wang, Jing Jiang, and Min Lin. Agent smith: A single image can jailbreak one million multimodal llm agents exponentially fast.arXiv preprint arXiv:2402.08567, 2024

  41. [41]

    On the robustness of large multimodal models against image adversarial attacks, 2023

    Xuanming Cui, Alejandro Aparcedo, Young Kyun Jang, and Ser-Nam Lim. On the robustness of large multimodal models against image adversarial attacks, 2023

  42. [42]

    How robust is google’s bard to adversarial image attacks?arXiv preprint arXiv:2309.11751,

    Yinpeng Dong, Huanran Chen, Jiawei Chen, Zhengwei Fang, Xiao Yang, Yichi Zhang, Yu Tian, Hang Su, and Jun Zhu. How robust is google’s bard to adversarial image attacks?arXiv preprint arXiv:2309.11751, 2023

  43. [43]

    Eia: Environmental injection attack on generalist web agents for privacy leakage, 2025

    Zeyi Liao, Lingbo Mo, Chejian Xu, Mintong Kang, Jiawei Zhang, Chaowei Xiao, Yuan Tian, Bo Li, and Huan Sun. Eia: Environmental injection attack on generalist web agents for privacy leakage, 2025

  44. [44]

    Evaluating the robustness of multimodal agents against active environmental injection attacks, 2025

    Yurun Chen, Xavier Hu, Keting Yin, Juncheng Li, and Shengyu Zhang. Evaluating the robustness of multimodal agents against active environmental injection attacks, 2025

  45. [45]

    The obvious invisible threat: Llm-powered gui agents’ vulnerability to fine-print injections, 2025

    Chaoran Chen, Zhiping Zhang, Bingcan Guo, Shang Ma, Ibrahim Khalilov, Simret A Gebreegziabher, Yanfang Ye, Ziang Xiao, Yaxing Yao, Tianshi Li, and Toby Jia-Jun Li. The obvious invisible threat: Llm-powered gui agents’ vulnerability to fine-print injections, 2025

  46. [46]

    Attacking vision-language computer agents via pop-ups, 2025

    Yanzhe Zhang, Tao Yu, and Diyi Yang. Attacking vision-language computer agents via pop-ups, 2025

  47. [47]

    Clip-guided generative networks for transferable targeted adversarial attacks, 2024

    Hao Fang, Jiawei Kong, Bin Chen, Tao Dai, Hao Wu, and Shu-Tao Xia. Clip-guided generative networks for transferable targeted adversarial attacks, 2024

  48. [48]

    Qava: Query-agnostic visual attack to large vision-language models, 2025

    Yudong Zhang, Ruobing Xie, Jiansheng Chen, Xingwu Sun, Zhanhui Kang, and Yu Wang. Qava: Query-agnostic visual attack to large vision-language models, 2025

  49. [49]

    Exploring the adversarial robustness of clip for ai-generated image detection

    Vincenzo De Rosa, Fabrizio Guillaro, Giovanni Poggi, Davide Cozzolino, and Luisa Verdoliva. Exploring the adversarial robustness of clip for ai-generated image detection. In2024 IEEE International Workshop on Information Forensics and Security (WIFS), pages 1–6. IEEE, 2024

  50. [50]

    Badagent: Inserting and activating backdoor attacks in llm agents, 2024

    Yifei Wang, Dizhan Xue, Shengjie Zhang, and Shengsheng Qian. Badagent: Inserting and activating backdoor attacks in llm agents, 2024

  51. [51]

    Watch out for your agents! investigating backdoor threats to llm-based agents, 2024

    Wenkai Yang, Xiaohan Bi, Yankai Lin, Sishuo Chen, Jie Zhou, and Xu Sun. Watch out for your agents! investigating backdoor threats to llm-based agents, 2024. 14 APREPRINT- APRIL14, 2026

  52. [52]

    Foot-in-the-door: A multi-turn jailbreak for LLMs

    Zixuan Weng, Xiaolong Jin, Jinyuan Jia, and Xiangyu Zhang. Foot-in-the-door: A multi-turn jailbreak for LLMs. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 1939–1950, Suzhou, China, November 2025. Association for Compu...

  53. [53]

    Sensor-based continuous authentication of smartphones’ users using behavioral biometrics: A survey.IEEE Access, 5:15226–15257, 2017

    Ahmed Mahfouz, Tarek M Mahmoud, and Ahmed Sharaf Eldin. Sensor-based continuous authentication of smartphones’ users using behavioral biometrics: A survey.IEEE Access, 5:15226–15257, 2017

  54. [54]

    In27th USENIX Security Symposium (USENIX Security 18), pages 135–150, 2018

    Antoine Vastel, Pierre Laperdrix, Walter Rudametkin, and Romain Rouvoy.{Fp-Scanner}: The privacy implica- tions of browser fingerprint inconsistencies. In27th USENIX Security Symposium (USENIX Security 18), pages 135–150, 2018

  55. [55]

    Browser fingerprinting: A survey.ACM Transactions on the Web (TWEB), 14(2):1–33, 2020

    Pierre Laperdrix, Nataliia Bielova, Benoit Baudry, and Gildas Avoine. Browser fingerprinting: A survey.ACM Transactions on the Web (TWEB), 14(2):1–33, 2020

  56. [56]

    Continuous mobile authentication using touchscreen gestures.2012 IEEE Conference on Technologies for Homeland Security (HST), pages 451–456, 2012

    Tao Feng, Ziyi Liu, Kyeong-An Kwon, Weidong Larry Shi, Bogdan Carbunar, Jiang Yifei, and Nhung Nguyen. Continuous mobile authentication using touchscreen gestures.2012 IEEE Conference on Technologies for Homeland Security (HST), pages 451–456, 2012

  57. [57]

    Kroeze and Katherine Mary Malan

    Christina J. Kroeze and Katherine Mary Malan. User authentication based on continuous touch biometrics.South Afr. Comput. J., 28, 2016

  58. [58]

    Increauth: Incremental-learning-based behavioral biometric authentication on smartphones.IEEE Internet of Things Journal, 11:1589–1603, 2024

    Zhihao Shen, Shun Li, Xi Zhao, and Jianhua Zou. Increauth: Incremental-learning-based behavioral biometric authentication on smartphones.IEEE Internet of Things Journal, 11:1589–1603, 2024

  59. [59]

    Mouse dynamics behavioral biometrics: A survey.ACM Computing Surveys, 56(6):1–33, 2024

    Simon Khan, Charles Devlen, Michael Manno, and Daqing Hou. Mouse dynamics behavioral biometrics: A survey.ACM Computing Surveys, 56(6):1–33, 2024

  60. [60]

    Game bot detection via avatar trajectory analysis.IEEE Transactions on Computational Intelligence and AI in Games, 2(3):162–175, 2010

    Hsing-Kuo Pao, Kuan-Ta Chen, and Hong-Chung Chang. Game bot detection via avatar trajectory analysis.IEEE Transactions on Computational Intelligence and AI in Games, 2(3):162–175, 2010

  61. [61]

    Forgery-resistant touch-based authentica- tion on mobile devices

    Neil Zhenqiang Gong, Mathias Payer, Reza Moazzezi, and Mario Frank. Forgery-resistant touch-based authentica- tion on mobile devices. InProceedings of the 11th ACM on Asia Conference on Computer and Communications Security, ASIA CCS ’16, pages 499–510, New York, NY , USA, 2016. ACM

  62. [62]

    Toward robotic robbery on the touch screen.ACM Transactions on Information and System Security (TISSEC), 18(4):1–25, 2016

    Abdul Serwadda, Vir V Phoha, Zibo Wang, Rajesh Kumar, and Diksha Shukla. Toward robotic robbery on the touch screen.ACM Transactions on Information and System Security (TISSEC), 18(4):1–25, 2016

  63. [63]

    Gantouch: An attack-resilient framework for touch-based continuous authentication system.IEEE Transactions on Biometrics, Behavior, and Identity Science, 4(4):533–543, 2022

    Mohit Agrawal, Pragyan Mehrotra, Rajesh Kumar, and Rajiv Ratn Shah. Gantouch: An attack-resilient framework for touch-based continuous authentication system.IEEE Transactions on Biometrics, Behavior, and Identity Science, 4(4):533–543, 2022

  64. [64]

    A survey of ai agent protocols, 2025

    Yingxuan Yang, Huacan Chai, Yuanyi Song, Siyuan Qi, Muning Wen, Ning Li, Junwei Liao, Haoyi Hu, Jianghao Lin, Gaowei Chang, Weiwen Liu, Ying Wen, Yong Yu, and Weinan Zhang. A survey of ai agent protocols, 2025

  65. [65]

    Agentic information retrieval, 2025

    Weinan Zhang, Junwei Liao, Ning Li, Kounianhua Du, and Jianghao Lin. Agentic information retrieval, 2025

  66. [66]

    A survey of llm-based deep search agents: Paradigm, optimization, evaluation, and challenges, 2025

    Yunjia Xi, Jianghao Lin, Yongzhao Xiao, Zheli Zhou, Rong Shan, Te Gao, Jiachen Zhu, Weiwen Liu, Yong Yu, and Weinan Zhang. A survey of llm-based deep search agents: Paradigm, optimization, evaluation, and challenges, 2025

  67. [67]

    Evolutionary perspectives on the evaluation of llm-based ai agents: A comprehensive survey, 2025

    Jiachen Zhu, Menghui Zhu, Renting Rui, Rong Shan, Congmin Zheng, Bo Chen, Yunjia Xi, Jianghao Lin, Weiwen Liu, Ruiming Tang, Yong Yu, and Weinan Zhang. Evolutionary perspectives on the evaluation of llm-based ai agents: A comprehensive survey, 2025

  68. [68]

    Long short-term memory.Supervised sequence labelling with recurrent neural networks, pages 37–45, 2012

    Alex Graves. Long short-term memory.Supervised sequence labelling with recurrent neural networks, pages 37–45, 2012

  69. [69]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  70. [70]

    see” the screen and simulate physical taps. This allows it to execute cross-app workflows without manual input, promising a “zero-touch

    Weiwen Liu, Jiarui Qin, Xu Huang, Xingshan Zeng, Yunjia Xi, Jianghao Lin, Chuhan Wu, Yasheng Wang, Lifeng Shang, Ruiming Tang, Defu Lian, Yong Yu, and Weinan Zhang. Position: The real barrier to llm agent usability is agentic roi, 2026. A The Conflict Between GUI Agents and App Platforms A.1 Background To understand the gravity of this incident, it is ess...

  71. [71]

    They contend that since the user explicitly authorized the assistant, the AI acts as a legitimate digital proxy for human intent

    The OS/Agent Provider (ByteDance/Nubia):They argue forUser AgencyandInnovation. They contend that since the user explicitly authorized the assistant, the AI acts as a legitimate digital proxy for human intent. ByteDance further emphasized that their tool adheres to privacy standards and deliberately avoids sensitive operations like financial transactions

  72. [72]

    Turing Test on Screen

    The Super-Platform (Tencent/Banks):They citeSecurity and Ecosystem Integrity. Reports indicate that WeChat’s restrictions were not specifically targeted at Doubao but were unintentional triggers of existing risk control measures. They implies that allowing external programs to drive the apps bypasses critical security checks, creating a vulnerability that...

  73. [73]

    Swipe (x1, y1), (x2, y2)

  74. [74]

    Type (text) / Unable to Type

  75. [75]

    Completed contents

    Stop ### Output format ### ### Thought ### ### Action ### ### Operation ### F.4 Action Reflection Prompt Used after an operation to verify if the result meets the expected thought. ### Before the current operation ### Screenshot info & Keyboard status... ### After the current operation ### Screenshot info & Keyboard status... ### Current operation ### Ins...

  76. [76]

    Observe the current screenshot carefully

  77. [77]

    Consider the previous actions and the progress made so far

  78. [78]

    If the task is completed, use the "stop" action

    Determine the next logical step. If the task is completed, use the "stop" action

  79. [79]

    # Action Space - click(x, y): Tap the screen at normalized coordinates (x, y)

    All coordinates must be normalized to a range of 0 to 1000. # Action Space - click(x, y): Tap the screen at normalized coordinates (x, y). - swipe(x1, y1, x2, y2): Swipe from (x1, y1) to (x2, y2). - type(text): Type the specified text into the focused input field. - key(name): Press system keys like ’HOME’, ’BACK’, or ’MENU’. - wait(): Wait for the screen...