arxiv: 2604.09574 · v1 · submitted 2026-02-24 · 💻 cs.AI · cs.LG

Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization

Jiachen Zhu , Lingyu Yang , Rong Shan , Congmin Zheng , Zeyu Zheng , Weiwen Liu , Yong Yu , Weinan Zhang

show 1 more author

Jianghao Lin

This is my paper

Pith reviewed 2026-05-15 20:32 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords GUI agentshumanizationTuring testtouch dynamicsbehavioral divergenceimitabilityMinMax optimizationmobile interfaces

0 comments

The pith

GUI agents can reach high human imitability in mobile touch interactions without losing task performance by minimizing behavioral divergence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames agent humanization as a MinMax optimization game between a detector trying to spot non-humans and an agent trying to reduce differences in screen interaction behavior. It shows that standard large multimodal model agents produce unnatural movement patterns that make them stand out immediately. Using a new dataset of real human touch dynamics, the authors build the Agent Humanization Benchmark and test methods that add noise or match observed human patterns. These methods raise imitability scores substantially while keeping the same task success rates. The work argues that future agents must succeed not only at completing tasks but at doing so in ways that fit inside human-centric digital systems.

Core claim

By modeling the interaction as a MinMax problem and optimizing agents to minimize behavioral divergence from human touch kinematics on a collected high-fidelity mobile dataset, agents can achieve high imitability scores both theoretically and empirically without any measurable drop in utility or robustness.

What carries the argument

The MinMax optimization between detector and agent that quantifies behavioral divergence, supported by the Agent Humanization Benchmark and associated detection metrics.

If this is right

Vanilla LMM-based agents produce detectable unnatural kinematics in touch trajectories.
Heuristic noise injection and data-driven behavioral matching both raise imitability without harming task performance.
The new benchmark and metrics make the imitability-utility trade-off measurable and comparable across methods.
Successful humanization allows agents to operate inside human-centric platforms without triggering adversarial countermeasures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar humanization techniques could be required for agents on non-mobile interfaces such as web or desktop.
Detector designers may need to incorporate higher-order statistics or multi-session patterns once basic kinematic matching becomes common.
The MinMax framing suggests a possible arms race where continued improvement in humanization forces detectors to adopt more sophisticated models.

Load-bearing premise

The collected dataset of mobile touch dynamics represents the full range of behaviors that real detectors would rely on, and reducing measured divergence in the model produces actual undetectability in deployed systems.

What would settle it

A controlled test in which a humanized agent is run against production mobile-platform detectors on live apps and still receives non-human flags at rates comparable to vanilla agents.

Figures

Figures reproduced from arXiv: 2604.09574 by Congmin Zheng, Jiachen Zhu, Jianghao Lin, Lingyu Yang, Rong Shan, Weinan Zhang, Weiwen Liu, Yong Yu, Zeyu Zheng.

**Figure 1.** Figure 1: The adversarial landscape between GUI Agents and Mobile Platforms. The figure illustrates three key stages: [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: The difference between human and agent swipe. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: The visualization of action interval and tap duration differences between human and agents. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Impact of humanization on detection accuracy across feature clusters. The chart compares the detection [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Impact of Feature Selection on Detection Accuracy. Comparison of (a) SVM and (b) XGBoost performance as the number of features increases. 5.2 In-Depth Feature Analysis [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Doubao Mobile Assistant Working Scene on the Offical Website. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: For more details, see the official documentation: Android Sensor Overview and Android Sensor Types [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: The Lengths and Durations of Each Action. Actions with [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: The correlation of these 24 features. Red color means stronger correlation. [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Impact of Online Humanization on Task Utility. This chart compares the success rates of raw agents (light [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗

**Figure 11.** Figure 11: Validation of interval mimicry using fake actions. The figure compares the normalized action interval [PITH_FULL_IMAGE:figures/full_fig_p033_11.png] view at source ↗

**Figure 12.** Figure 12: Distribution analysis of trajectory deviation. We compare the distribution of the [PITH_FULL_IMAGE:figures/full_fig_p034_12.png] view at source ↗

read the original abstract

The rise of autonomous GUI agents has triggered adversarial countermeasures from digital platforms, yet existing research prioritizes utility and robustness over the critical dimension of anti-detection. We argue that for agents to survive in human-centric ecosystems, they must evolve Humanization capabilities. We introduce the ``Turing Test on Screen,'' formally modeling the interaction as a MinMax optimization problem between a detector and an agent aiming to minimize behavioral divergence. We then collect a new high-fidelity dataset of mobile touch dynamics, and conduct our analysis that vanilla LMM-based agents are easily detectable due to unnatural kinematics. Consequently, we establish the Agent Humanization Benchmark (AHB) and detection metrics to quantify the trade-off between imitability and utility. Finally, we propose methods ranging from heuristic noise to data-driven behavioral matching, demonstrating that agents can achieve high imitability theoretically and empirically without sacrificing performance. This work shifts the paradigm from whether an agent can perform a task to how it performs it within a human-centric ecosystem, laying the groundwork for seamless coexistence in adversarial digital environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New MinMax benchmark and touch dataset for GUI agent humanization, but the empirical case for real-world undetectability is still thin.

read the letter

The core contribution here is a MinMax framing of GUI agents versus detectors, plus a new high-fidelity mobile touch-dynamics dataset and the AHB benchmark to measure how well agents can imitate human kinematics without losing task performance. They show that off-the-shelf LMM agents are easy to spot from unnatural touch patterns and then test simple fixes like heuristic noise and data-driven matching that reportedly close the gap on their metrics while keeping utility intact. That setup and the dataset are genuinely new and not just re-labeled prior work; the practical focus on anti-detection for deployed agents is also useful given how platforms are pushing back against automation.

Referee Report

2 major / 2 minor

Summary. The paper introduces the 'Turing Test on Screen' benchmark, modeling mobile GUI agent humanization as a MinMax optimization between an agent minimizing behavioral divergence and a detector. It collects a high-fidelity dataset of mobile touch dynamics, shows that vanilla LMM-based agents are easily detectable due to unnatural kinematics, establishes the Agent Humanization Benchmark (AHB) with associated metrics, and proposes methods (heuristic noise injection and data-driven behavioral matching) that achieve high imitability without sacrificing task performance.

Significance. If the empirical claims hold under rigorous validation, the work could meaningfully advance GUI agent research by formalizing the trade-off between utility and undetectability in adversarial environments. The MinMax framing, specialized touch-dynamics dataset, and AHB provide a concrete foundation for future studies on behavioral naturalness, potentially influencing platform policies and agent deployment strategies.

major comments (2)

[Analysis and Proposed Methods] The central claim that vanilla LMM agents are 'easily detectable due to unnatural kinematics' and that proposed methods achieve high imitability without utility loss lacks reported validation metrics, error bars, statistical controls, or ablation studies on the benchmark metrics (as highlighted by the low soundness score). This is load-bearing for the empirical success assertions.
[Dataset Collection and MinMax Formulation] The premise that the collected high-fidelity dataset spans the full distribution of human behavior (and that MinMax divergence minimization on the AHB directly implies evasion against real or adaptive detectors using different features/temporal patterns) is untested. No cross-validation against alternative detectors or online adaptation scenarios is provided.

minor comments (2)

[Abstract] The abstract and introduction would benefit from explicit quantitative results (e.g., specific imitability scores or detection rates) rather than qualitative statements.
[Formal Modeling] Clarify the precise mathematical definition of the behavioral divergence metric and the trade-off weight in the MinMax objective to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments below and will revise the paper to incorporate additional validation and clarifications where appropriate.

read point-by-point responses

Referee: [Analysis and Proposed Methods] The central claim that vanilla LMM agents are 'easily detectable due to unnatural kinematics' and that proposed methods achieve high imitability without utility loss lacks reported validation metrics, error bars, statistical controls, or ablation studies on the benchmark metrics (as highlighted by the low soundness score). This is load-bearing for the empirical success assertions.

Authors: We acknowledge the need for stronger statistical support. In the revised manuscript, we will add error bars to all reported metrics, include statistical significance tests (such as paired t-tests), provide ablation studies isolating the contributions of heuristic noise injection and data-driven matching, and report full benchmark scores with controls for task difficulty. These additions will directly substantiate the claims on detectability and the imitability-utility trade-off. revision: yes
Referee: [Dataset Collection and MinMax Formulation] The premise that the collected high-fidelity dataset spans the full distribution of human behavior (and that MinMax divergence minimization on the AHB directly implies evasion against real or adaptive detectors using different features/temporal patterns) is untested. No cross-validation against alternative detectors or online adaptation scenarios is provided.

Authors: The dataset was gathered from multiple users performing varied tasks to capture diverse touch dynamics, but we agree that explicit cross-validation and adaptation tests would improve rigor. In revision, we will include experiments evaluating our methods against alternative detector feature sets and discuss limitations for fully adaptive online settings. We will also clarify that the MinMax formulation is a foundational model rather than a complete proof of evasion in all scenarios. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines the MinMax optimization as an explicit modeling choice for the detector-agent interaction, collects an independent high-fidelity dataset of mobile touch dynamics, performs kinematic analysis on vanilla LMM agents, establishes the AHB benchmark from that analysis, and evaluates proposed methods (heuristic noise and data-driven matching) empirically on the new data. No equation or claim reduces by construction to a fitted parameter from the same dataset, no self-citation bears the central load, and the imitability-utility trade-off is demonstrated rather than assumed via renaming or ansatz smuggling. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Abstract provides limited detail on internal parameters; the MinMax optimization likely involves implicit trade-off weights between imitability and utility that are not specified as fitted or derived.

free parameters (1)

trade-off weight in MinMax optimization
Not detailed in abstract but required to balance detection minimization against task performance.

pith-pipeline@v0.9.0 · 5501 in / 1141 out tokens · 40591 ms · 2026-05-15T20:32:33.562137+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
cs.CL 2026-05 unverdicted novelty 4.0

The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Gpt-4 technical report, 2024

OpenAI, Josh Achiam, Steven Adler, and Sandhini Agarwal. Gpt-4 technical report, 2024

work page 2024
[2]

Gemini: A family of highly capable multimodal models, 2025

Gemini Team, Rohan Anil, and Sebastian Borgeaud. Gemini: A family of highly capable multimodal models, 2025

work page 2025
[3]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[4]

Appagent: Multimodal agents as smartphone users

Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[5]

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weiezhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile- agent: Autonomous multi-modal mobile device agent with visual perception.arXiv preprint arXiv:2401.16158, 2024

work page internal anchor Pith review arXiv 2024
[6]

Cogagent: A visual language model for gui agents

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, and Jie Tang. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14281–14290, 2024

work page 2024
[7]

Mind2web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[8]

Webshop: Towards scalable real-world web interaction with grounded language agents

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. InAdvances in Neural Information Processing Systems, volume 35, pages 20744–20757, 2022

work page 2022
[9]

Superplatforms have to attack ai agents, 2025

Jianghao Lin, Jiachen Zhu, Zheli Zhou, Yunjia Xi, Weiwen Liu, Yong Yu, and Weinan Zhang. Superplatforms have to attack ai agents, 2025. 12 APREPRINT- APRIL14, 2026

work page 2025
[10]

What is your ai agent buying? evaluation, biases, model dependence, & emerging implications for agentic e-commerce, 2025

Amine Allouah, Omar Besbes, Josué D Figueroa, Yash Kanoria, and Akshit Kumar. What is your ai agent buying? evaluation, biases, model dependence, & emerging implications for agentic e-commerce, 2025

work page 2025
[11]

How can recommender systems benefit from large language models: A survey, 2024

Jianghao Lin, Xinyi Dai, Yunjia Xi, Weiwen Liu, Bo Chen, Hao Zhang, Yong Liu, Chuhan Wu, Xiangyang Li, Chenxu Zhu, Huifeng Guo, Yong Yu, Ruiming Tang, and Weinan Zhang. How can recommender systems benefit from large language models: A survey, 2024

work page 2024
[12]

Computing machinery and intelligence.Mind, 59(236):433–460, 1950

Alan M Turing. Computing machinery and intelligence.Mind, 59(236):433–460, 1950

work page 1950
[13]

Touch-based continuous mobile device authentication: State-of-the-art, challenges and opportunities.Journal of Network and Computer Applications, 191:103162, 2021

Ahmad Zairi Zaidi, Chun Yong Chong, Zhe Jin, Rajendran Parthiban, and Ali Safaa Sadiq. Touch-based continuous mobile device authentication: State-of-the-art, challenges and opportunities.Journal of Network and Computer Applications, 191:103162, 2021

work page 2021
[14]

Mario Frank, Ralf Biedert, Eugene Ma, Ivan Martinovic, and Dawn Song. Touchalytics: On the applicability of touchscreen input as a behavioral biometric for continuous authentication.IEEE Transactions on Information Forensics and Security, 8(1):136–148, 2013

work page 2013
[15]

AlQahtani, and Muhammad Khurram Khan

Reem Alrawili, Ali Abdullah S. AlQahtani, and Muhammad Khurram Khan. Comprehensive survey: Biometric user authentication application, evaluation, and discussion, 2024

work page 2024
[16]

Princeton University Press, Princeton, NJ, 1944

John von Neumann and Oskar Morgenstern.Theory of Games and Economic Behavior. Princeton University Press, Princeton, NJ, 1944

work page 1944
[17]

Generative adversarial nets

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. InAdvances in Neural Information Processing Systems, volume 27, pages 2672–2680, 2014

work page 2014
[18]

Ui-tars: Pioneering automated gui interaction with native agents, 2025

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Ya...

work page 2025
[19]

Mobile-agent-e: Self-evolving mobile assistant for complex tasks, 2025

Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, and Heng Ji. Mobile-agent-e: Self-evolving mobile assistant for complex tasks, 2025

work page 2025
[20]

Agentcpm-gui: Building mobile-use agents with reinforcement fine-tuning, 2025

Zhong Zhang, Yaxi Lu, Yikun Fu, Yupeng Huo, Shenzhi Yang, Yesai Wu, Han Si, Xin Cong, Haotian Chen, Yankai Lin, Jie Xie, Wei Zhou, Wang Xu, Yuanheng Zhang, Zhou Su, Zhongwu Zhai, Xiaoming Liu, Yudong Mei, Jianming Xu, Hongyan Tian, Chongyi Wang, Chi Chen, Yuan Yao, Zhiyuan Liu, and Maosong Sun. Agentcpm-gui: Building mobile-use agents with reinforcement f...

work page 2025
[21]

Autoglm: Autonomous foundation agents for guis, 2024

Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, Junjie Gao, Junjun Shan, Kangning Liu, Shudan Zhang, Shuntian Yao, Siyi Cheng, Wentao Yao, Wenyi Zhao, Xinghan Liu, Xinyi Liu, Xinying Chen, Xinyue Yang, Yang Yang, Yifan Xu, Yu Yang, Yujia Wang, Yulin Xu, Zehan Qi, Yuxiao Dong, and J...

work page 2024
[22]

Mario Frank, Ralf Biedert, Eugene Ma, Ivan Martinovic, and Dawn Song. Touchalytics: On the applicability of touchscreen input as a behavioral biometric for continuous authentication.IEEE transactions on information forensics and security, 8(1):136–148, 2012

work page 2012
[23]

A mathematical theory of communication.The Bell system technical journal, 27(3):379–423, 1948

Claude E Shannon. A mathematical theory of communication.The Bell system technical journal, 27(3):379–423, 1948

work page 1948
[24]

Support-vector networks.Machine learning, 20(3):273–297, 1995

Corinna Cortes and Vladimir Vapnik. Support-vector networks.Machine learning, 20(3):273–297, 1995

work page 1995
[25]

Xgboost: A scalable tree boosting system.Cornell University, 2016

Tianqi Chen. Xgboost: A scalable tree boosting system.Cornell University, 2016

work page 2016
[26]

On calculating with b-splines.Journal of Approximation Theory, 6(1):50–62, 1972

Carl De Boor. On calculating with b-splines.Journal of Approximation Theory, 6(1):50–62, 1972

work page 1972
[27]

Mobile-agent-v3: Fundamental agents for gui automation

Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, et al. Mobile-agent-v3: Fundamental agents for gui automation, 2025.URL https://arxiv. org/abs/2508.15144, 4:21–27

work page arXiv 2025
[28]

Coco-agent: A comprehensive cognitive mllm agent for smartphone gui automation.arXiv preprint arXiv:2402.11941, 2024

Xinbei Ma, Zhuosheng Zhang, and Hai Zhao. Coco-agent: A comprehensive cognitive mllm agent for smartphone gui automation.arXiv preprint arXiv:2402.11941, 2024

work page arXiv 2024
[29]

Mobileuse: A gui agent with hierarchical reflection for autonomous mobile operation.arXiv preprint arXiv:2507.16853, 2025

Ning Li, Xiangmou Qu, Jiamu Zhou, Jun Wang, Muning Wen, Kounianhua Du, Xingyu Lou, Qiuying Peng, and Weinan Zhang. Mobileuse: A gui agent with hierarchical reflection for autonomous mobile operation.arXiv preprint arXiv:2507.16853, 2025

work page arXiv 2025
[30]

Caution for the environment: Multimodal agents are susceptible to environ- mental distractions

Xinbei Ma, Yiting Wang, Yao Yao, Tongxin Yuan, Aston Zhang, Zhuosheng Zhang, and Hai Zhao. Caution for the environment: Multimodal agents are susceptible to environmental distractions.arXiv preprint arXiv:2408.02544, 2024. 13 APREPRINT- APRIL14, 2026

work page arXiv 2024
[31]

VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents

Zheng Wu, Heyuan Huang, Xingyu Lou, Xiangmou Qu, Pengzhou Cheng, Zongru Wu, Weiwen Liu, Weinan Zhang, Jun Wang, Zhaoxiang Wang, et al. Verios: Query-driven proactive human-agent-gui interaction for trustworthy os agents.arXiv preprint arXiv:2509.07553, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Os-kairos: Adaptive interaction for mllm-powered gui agents

Pengzhou Cheng, Zheng Wu, Zongru Wu, Tianjie Ju, Aston Zhang, Zhuosheng Zhang, and Gongshen Liu. Os-kairos: Adaptive interaction for mllm-powered gui agents. InFindings of the Association for Computational Linguistics: ACL 2025, pages 6701–6725, 2025

work page 2025
[33]

Mobile-R1: Towards Interactive Capability for VLM-Based Mobile Agent via Systematic Training

Jihao Gu, Qihang Ai, Yingyao Wang, Pi Bu, Jingxuan Xing, Zekun Zhu, Wei Jiang, Ziming Wang, Yingxiu Zhao, Ming-Liang Zhang, et al. Mobile-r1: Towards interactive reinforcement learning for vlm-based mobile agent via task-level rewards.arXiv preprint arXiv:2506.20332, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Arpo: End-to-end policy optimization for gui agents with experience replay.arXiv preprint arXiv:2505.16282, 2025

Fanbin Lu, Zhisheng Zhong, Shu Liu, Chi-Wing Fu, and Jiaya Jia. Arpo: End-to-end policy optimization for gui agents with experience replay.arXiv preprint arXiv:2505.16282, 2025

work page arXiv 2025
[35]

Mobilerl: Advancing mobile use agents with adaptive online reinforcement learning, 2025.URL https://github

Yifan Xu, Xiao Liu, Xinghan Liu, Jiaqi Fu, Hanchen Zhang, Bohao Jing, Shudan Zhang, Yuting Wang, Wenyi Zhao, and Yuxiao Dong. Mobilerl: Advancing mobile use agents with adaptive online reinforcement learning, 2025.URL https://github. com/THUDM/MobileRL

work page 2025
[36]

The rise and potential of large language model based agents: A survey.Science China Information Sciences, 68(2):121101, 2025

Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. The rise and potential of large language model based agents: A survey.Science China Information Sciences, 68(2):121101, 2025

work page 2025
[37]

Dissecting adversarial robustness of multimodal lm agents, 2025

Chen Henry Wu, Rishi Shah, Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried, and Aditi Raghunathan. Dissecting adversarial robustness of multimodal lm agents, 2025

work page 2025
[38]

Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents, 2025

Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hongwei Wang, and Yongfeng Zhang. Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents, 2025

work page 2025
[39]

Advagent: Controllable blackbox red-teaming on web agents, 2025

Chejian Xu, Mintong Kang, Jiawei Zhang, Zeyi Liao, Lingbo Mo, Mengqi Yuan, Huan Sun, and Bo Li. Advagent: Controllable blackbox red-teaming on web agents, 2025

work page 2025
[40]

Agent smith: A single image can jailbreak one million multimodal llm agents exponentially fast.arXiv preprint arXiv:2402.08567, 2024

Xiangming Gu, Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Ye Wang, Jing Jiang, and Min Lin. Agent smith: A single image can jailbreak one million multimodal llm agents exponentially fast.arXiv preprint arXiv:2402.08567, 2024

work page arXiv 2024
[41]

On the robustness of large multimodal models against image adversarial attacks, 2023

Xuanming Cui, Alejandro Aparcedo, Young Kyun Jang, and Ser-Nam Lim. On the robustness of large multimodal models against image adversarial attacks, 2023

work page 2023
[42]

How robust is google’s bard to adversarial image attacks?arXiv preprint arXiv:2309.11751,

Yinpeng Dong, Huanran Chen, Jiawei Chen, Zhengwei Fang, Xiao Yang, Yichi Zhang, Yu Tian, Hang Su, and Jun Zhu. How robust is google’s bard to adversarial image attacks?arXiv preprint arXiv:2309.11751, 2023

work page arXiv 2023
[43]

Eia: Environmental injection attack on generalist web agents for privacy leakage, 2025

Zeyi Liao, Lingbo Mo, Chejian Xu, Mintong Kang, Jiawei Zhang, Chaowei Xiao, Yuan Tian, Bo Li, and Huan Sun. Eia: Environmental injection attack on generalist web agents for privacy leakage, 2025

work page 2025
[44]

Evaluating the robustness of multimodal agents against active environmental injection attacks, 2025

Yurun Chen, Xavier Hu, Keting Yin, Juncheng Li, and Shengyu Zhang. Evaluating the robustness of multimodal agents against active environmental injection attacks, 2025

work page 2025
[45]

The obvious invisible threat: Llm-powered gui agents’ vulnerability to fine-print injections, 2025

Chaoran Chen, Zhiping Zhang, Bingcan Guo, Shang Ma, Ibrahim Khalilov, Simret A Gebreegziabher, Yanfang Ye, Ziang Xiao, Yaxing Yao, Tianshi Li, and Toby Jia-Jun Li. The obvious invisible threat: Llm-powered gui agents’ vulnerability to fine-print injections, 2025

work page 2025
[46]

Attacking vision-language computer agents via pop-ups, 2025

Yanzhe Zhang, Tao Yu, and Diyi Yang. Attacking vision-language computer agents via pop-ups, 2025

work page 2025
[47]

Clip-guided generative networks for transferable targeted adversarial attacks, 2024

Hao Fang, Jiawei Kong, Bin Chen, Tao Dai, Hao Wu, and Shu-Tao Xia. Clip-guided generative networks for transferable targeted adversarial attacks, 2024

work page 2024
[48]

Qava: Query-agnostic visual attack to large vision-language models, 2025

Yudong Zhang, Ruobing Xie, Jiansheng Chen, Xingwu Sun, Zhanhui Kang, and Yu Wang. Qava: Query-agnostic visual attack to large vision-language models, 2025

work page 2025
[49]

Exploring the adversarial robustness of clip for ai-generated image detection

Vincenzo De Rosa, Fabrizio Guillaro, Giovanni Poggi, Davide Cozzolino, and Luisa Verdoliva. Exploring the adversarial robustness of clip for ai-generated image detection. In2024 IEEE International Workshop on Information Forensics and Security (WIFS), pages 1–6. IEEE, 2024

work page 2024
[50]

Badagent: Inserting and activating backdoor attacks in llm agents, 2024

Yifei Wang, Dizhan Xue, Shengjie Zhang, and Shengsheng Qian. Badagent: Inserting and activating backdoor attacks in llm agents, 2024

work page 2024
[51]

Watch out for your agents! investigating backdoor threats to llm-based agents, 2024

Wenkai Yang, Xiaohan Bi, Yankai Lin, Sishuo Chen, Jie Zhou, and Xu Sun. Watch out for your agents! investigating backdoor threats to llm-based agents, 2024. 14 APREPRINT- APRIL14, 2026

work page 2024
[52]

Foot-in-the-door: A multi-turn jailbreak for LLMs

Zixuan Weng, Xiaolong Jin, Jinyuan Jia, and Xiangyu Zhang. Foot-in-the-door: A multi-turn jailbreak for LLMs. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 1939–1950, Suzhou, China, November 2025. Association for Compu...

work page 2025
[53]

Sensor-based continuous authentication of smartphones’ users using behavioral biometrics: A survey.IEEE Access, 5:15226–15257, 2017

Ahmed Mahfouz, Tarek M Mahmoud, and Ahmed Sharaf Eldin. Sensor-based continuous authentication of smartphones’ users using behavioral biometrics: A survey.IEEE Access, 5:15226–15257, 2017

work page 2017
[54]

In27th USENIX Security Symposium (USENIX Security 18), pages 135–150, 2018

Antoine Vastel, Pierre Laperdrix, Walter Rudametkin, and Romain Rouvoy.{Fp-Scanner}: The privacy implica- tions of browser fingerprint inconsistencies. In27th USENIX Security Symposium (USENIX Security 18), pages 135–150, 2018

work page 2018
[55]

Browser fingerprinting: A survey.ACM Transactions on the Web (TWEB), 14(2):1–33, 2020

Pierre Laperdrix, Nataliia Bielova, Benoit Baudry, and Gildas Avoine. Browser fingerprinting: A survey.ACM Transactions on the Web (TWEB), 14(2):1–33, 2020

work page 2020
[56]

Continuous mobile authentication using touchscreen gestures.2012 IEEE Conference on Technologies for Homeland Security (HST), pages 451–456, 2012

Tao Feng, Ziyi Liu, Kyeong-An Kwon, Weidong Larry Shi, Bogdan Carbunar, Jiang Yifei, and Nhung Nguyen. Continuous mobile authentication using touchscreen gestures.2012 IEEE Conference on Technologies for Homeland Security (HST), pages 451–456, 2012

work page 2012
[57]

Kroeze and Katherine Mary Malan

Christina J. Kroeze and Katherine Mary Malan. User authentication based on continuous touch biometrics.South Afr. Comput. J., 28, 2016

work page 2016
[58]

Increauth: Incremental-learning-based behavioral biometric authentication on smartphones.IEEE Internet of Things Journal, 11:1589–1603, 2024

Zhihao Shen, Shun Li, Xi Zhao, and Jianhua Zou. Increauth: Incremental-learning-based behavioral biometric authentication on smartphones.IEEE Internet of Things Journal, 11:1589–1603, 2024

work page 2024
[59]

Mouse dynamics behavioral biometrics: A survey.ACM Computing Surveys, 56(6):1–33, 2024

Simon Khan, Charles Devlen, Michael Manno, and Daqing Hou. Mouse dynamics behavioral biometrics: A survey.ACM Computing Surveys, 56(6):1–33, 2024

work page 2024
[60]

Game bot detection via avatar trajectory analysis.IEEE Transactions on Computational Intelligence and AI in Games, 2(3):162–175, 2010

Hsing-Kuo Pao, Kuan-Ta Chen, and Hong-Chung Chang. Game bot detection via avatar trajectory analysis.IEEE Transactions on Computational Intelligence and AI in Games, 2(3):162–175, 2010

work page 2010
[61]

Forgery-resistant touch-based authentica- tion on mobile devices

Neil Zhenqiang Gong, Mathias Payer, Reza Moazzezi, and Mario Frank. Forgery-resistant touch-based authentica- tion on mobile devices. InProceedings of the 11th ACM on Asia Conference on Computer and Communications Security, ASIA CCS ’16, pages 499–510, New York, NY , USA, 2016. ACM

work page 2016
[62]

Toward robotic robbery on the touch screen.ACM Transactions on Information and System Security (TISSEC), 18(4):1–25, 2016

Abdul Serwadda, Vir V Phoha, Zibo Wang, Rajesh Kumar, and Diksha Shukla. Toward robotic robbery on the touch screen.ACM Transactions on Information and System Security (TISSEC), 18(4):1–25, 2016

work page 2016
[63]

Gantouch: An attack-resilient framework for touch-based continuous authentication system.IEEE Transactions on Biometrics, Behavior, and Identity Science, 4(4):533–543, 2022

Mohit Agrawal, Pragyan Mehrotra, Rajesh Kumar, and Rajiv Ratn Shah. Gantouch: An attack-resilient framework for touch-based continuous authentication system.IEEE Transactions on Biometrics, Behavior, and Identity Science, 4(4):533–543, 2022

work page 2022
[64]

A survey of ai agent protocols, 2025

Yingxuan Yang, Huacan Chai, Yuanyi Song, Siyuan Qi, Muning Wen, Ning Li, Junwei Liao, Haoyi Hu, Jianghao Lin, Gaowei Chang, Weiwen Liu, Ying Wen, Yong Yu, and Weinan Zhang. A survey of ai agent protocols, 2025

work page 2025
[65]

Agentic information retrieval, 2025

Weinan Zhang, Junwei Liao, Ning Li, Kounianhua Du, and Jianghao Lin. Agentic information retrieval, 2025

work page 2025
[66]

A survey of llm-based deep search agents: Paradigm, optimization, evaluation, and challenges, 2025

Yunjia Xi, Jianghao Lin, Yongzhao Xiao, Zheli Zhou, Rong Shan, Te Gao, Jiachen Zhu, Weiwen Liu, Yong Yu, and Weinan Zhang. A survey of llm-based deep search agents: Paradigm, optimization, evaluation, and challenges, 2025

work page 2025
[67]

Evolutionary perspectives on the evaluation of llm-based ai agents: A comprehensive survey, 2025

Jiachen Zhu, Menghui Zhu, Renting Rui, Rong Shan, Congmin Zheng, Bo Chen, Yunjia Xi, Jianghao Lin, Weiwen Liu, Ruiming Tang, Yong Yu, and Weinan Zhang. Evolutionary perspectives on the evaluation of llm-based ai agents: A comprehensive survey, 2025

work page 2025
[68]

Long short-term memory.Supervised sequence labelling with recurrent neural networks, pages 37–45, 2012

Alex Graves. Long short-term memory.Supervised sequence labelling with recurrent neural networks, pages 37–45, 2012

work page 2012
[69]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017
[70]

see” the screen and simulate physical taps. This allows it to execute cross-app workflows without manual input, promising a “zero-touch

Weiwen Liu, Jiarui Qin, Xu Huang, Xingshan Zeng, Yunjia Xi, Jianghao Lin, Chuhan Wu, Yasheng Wang, Lifeng Shang, Ruiming Tang, Defu Lian, Yong Yu, and Weinan Zhang. Position: The real barrier to llm agent usability is agentic roi, 2026. A The Conflict Between GUI Agents and App Platforms A.1 Background To understand the gravity of this incident, it is ess...

work page 2026
[71]

They contend that since the user explicitly authorized the assistant, the AI acts as a legitimate digital proxy for human intent

The OS/Agent Provider (ByteDance/Nubia):They argue forUser AgencyandInnovation. They contend that since the user explicitly authorized the assistant, the AI acts as a legitimate digital proxy for human intent. ByteDance further emphasized that their tool adheres to privacy standards and deliberately avoids sensitive operations like financial transactions

work page
[72]

Turing Test on Screen

The Super-Platform (Tencent/Banks):They citeSecurity and Ecosystem Integrity. Reports indicate that WeChat’s restrictions were not specifically targeted at Doubao but were unintentional triggers of existing risk control measures. They implies that allowing external programs to drive the apps bypasses critical security checks, creating a vulnerability that...

work page 2026
[73]

Swipe (x1, y1), (x2, y2)

work page
[74]

Type (text) / Unable to Type

work page
[75]

Completed contents

Stop ### Output format ### ### Thought ### ### Action ### ### Operation ### F.4 Action Reflection Prompt Used after an operation to verify if the result meets the expected thought. ### Before the current operation ### Screenshot info & Keyboard status... ### After the current operation ### Screenshot info & Keyboard status... ### Current operation ### Ins...

work page 2026
[76]

Observe the current screenshot carefully

work page
[77]

Consider the previous actions and the progress made so far

work page
[78]

If the task is completed, use the "stop" action

Determine the next logical step. If the task is completed, use the "stop" action

work page
[79]

# Action Space - click(x, y): Tap the screen at normalized coordinates (x, y)

All coordinates must be normalized to a range of 0 to 1000. # Action Space - click(x, y): Tap the screen at normalized coordinates (x, y). - swipe(x1, y1, x2, y2): Swipe from (x1, y1) to (x2, y2). - type(text): Type the specified text into the focused input field. - key(name): Press system keys like ’HOME’, ’BACK’, or ’MENU’. - wait(): Wait for the screen...

work page 2026