LaSM: Layer-wise Scaling Mechanism for Defending Pop-up Attack on GUI Agents

arxiv: 2507.10610 · v3 · submitted 2025-07-13 · 💻 cs.CR · cs.AI

LaSM: Layer-wise Scaling Mechanism for Defending Pop-up Attack on GUI Agents

Zihe Yan , Jiaping Gui , Zhuosheng Zhang , Gongshen Liu This is my paper

Pith reviewed 2026-05-19 04:05 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords GUI agentspop-up attacksattention alignmentdefense mechanismmultimodal modelslayer-wise scalingadversarial robustness

0 comments p. Extension

The pith

A layer-wise scaling mechanism defends GUI agents from pop-up attacks by restoring attention alignment without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that pop-up attacks on GUI agents succeed because they cause attention to diverge across layers in multimodal models, pulling saliency away from task-relevant screen regions. It identifies a stable layer-wise pattern that separates correct from incorrect outputs under these attacks. The proposed LaSM approach selects critical layers and scales up their attention and MLP modules to correct the misalignment. This change requires no extra training and leaves the model's ordinary task performance intact. If the claim holds, it means a lightweight post-processing step can make interactive agents safer against visual injections that would otherwise trigger unsafe actions.

Core claim

GUI agents show a consistent layer-wise attention divergence pattern when pop-up attacks produce incorrect outputs, with saliency shifting away from task areas in identifiable layers. Selectively amplifying the attention and MLP modules in those critical layers realigns model saliency with relevant regions, raising defense success rates while preserving general capabilities.

What carries the argument

LaSM, the Layer-wise Scaling Mechanism, which detects layers with attention divergence and amplifies their attention and MLP components to restore task-relevant focus.

If this is right

Defense success rate rises substantially on multiple datasets under pop-up attacks.
The approach remains robust even when attacks include inductive interference.
General capabilities on standard tasks show negligible change.
Attention misalignment is identified as a central vulnerability that selective modulation can address.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar layer-wise adjustments might counter other visual manipulations that divert attention in multimodal agents.
Agent deployments could adopt this scaling step as a standard safety filter at inference time.
The divergence pattern could be monitored in real time to trigger scaling only when attacks are suspected.

Load-bearing premise

The layer-wise attention divergence pattern between correct and incorrect outputs stays stable and identifiable across models, tasks, and attack instances so that critical layers can be chosen reliably for scaling.

What would settle it

Running the method on a new set of GUI agents or pop-up attack variants yields no consistent divergence pattern that allows layer selection, or the scaling step produces no measurable rise in defense success rate.

Figures

Figures reproduced from arXiv: 2507.10610 by Gongshen Liu, Jiaping Gui, Zhuosheng Zhang, Zihe Yan.

**Figure 2.** Figure 2: Each subfigure shows attention heatmaps (left) and layerwise cosine similarities (right) [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of direct scaling applied to layers (highlighted in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: DSR comparison under different layer scaling strategies. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Illustration of progressive layer range narrowing, where the final narrowed range is marked [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Attention response under different layer scaling strategies. Figure (a) illustrates the layer [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Angular difference between hidden states of [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Examples of 12 pop-up variations grouped by size, each combining semantic relevance and [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: An example episode illustrating a complete interaction sequence with an injected popup. [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

read the original abstract

Graphical user interface (GUI) agents built on multimodal large language models (MLLMs) have recently demonstrated strong decision-making abilities in screen-based interaction tasks. However, they remain highly vulnerable to pop-up-based environmental injection attacks, where malicious visual elements divert model attention and lead to unsafe or incorrect actions. Existing defense methods either require costly retraining or perform poorly under inductive interference. In this work, we systematically study how such attacks alter the attention behavior of GUI agents and uncover a layer-wise attention divergence pattern between correct and incorrect outputs. Based on this insight, we propose \textbf{LaSM}, a \textit{Layer-wise Scaling Mechanism} that selectively amplifies attention and MLP modules in critical layers. LaSM improves the alignment between model saliency and task-relevant regions without additional training. Extensive experiments across multiple datasets demonstrate that our method significantly improves the defense success rate and exhibits strong robustness, while having negligible impact on the model's general capabilities. Our findings reveal that attention misalignment is a core vulnerability in MLLM agents and can be effectively addressed through selective layer-wise modulation. Our code can be found in https://github.com/YANGTUOMAO/LaSM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces LaSM, a training-free Layer-wise Scaling Mechanism for defending GUI agents (built on MLLMs) against pop-up environmental injection attacks. It identifies a layer-wise attention divergence pattern between correct and attacked outputs, then selectively scales attention and MLP modules in critical layers at inference to realign saliency with task-relevant regions. Experiments across datasets report improved defense success rates, robustness, and negligible impact on general capabilities.

Significance. If the divergence-based layer selection proves stable, LaSM offers a lightweight, inference-only defense that avoids retraining costs while addressing attention misalignment vulnerabilities in GUI agents. The public code release supports reproducibility.

major comments (2)

[§3] §3 (Layer Identification): The central claim depends on the stability of the attention/MLP divergence signature for selecting critical layers, yet the manuscript provides no cross-model transfer tests, no ablation of the layer-selection heuristic, and no evaluation on held-out attack variants or GUI tasks. If the divergent layers shift under a new pop-up style or different MLLM backbone, the fixed scaling would modulate the wrong modules and the reported defense gains would not hold.
[§4.2] §4.2 (Experimental Controls): The abstract and results claim extensive experiments with improved defense rates, but specific details on statistical measures (e.g., standard deviation across runs), controls for post-hoc layer selection bias, and exact attack injection parameters are not reported, making it difficult to assess whether the robustness is causal or dataset-specific.

minor comments (2)

[§3] Notation for the scaling factor and critical-layer indices should be defined explicitly in the main text rather than only in the appendix or code.
[Figures 3-5] Figure captions for attention visualizations could more clearly indicate which layers are being scaled and how the saliency maps were generated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments on our manuscript. We have carefully reviewed the concerns regarding the stability of the layer-wise divergence pattern and the need for stronger experimental controls. Below we respond point-by-point and indicate the revisions planned for the next version of the paper.

read point-by-point responses

Referee: [§3] §3 (Layer Identification): The central claim depends on the stability of the attention/MLP divergence signature for selecting critical layers, yet the manuscript provides no cross-model transfer tests, no ablation of the layer-selection heuristic, and no evaluation on held-out attack variants or GUI tasks. If the divergent layers shift under a new pop-up style or different MLLM backbone, the fixed scaling would modulate the wrong modules and the reported defense gains would not hold.

Authors: We appreciate the referee’s emphasis on validating the stability of the divergence-based layer selection. In the current experiments the divergence signature was observed consistently across the GUI tasks and attack instances used for evaluation. In the revised manuscript we will add an ablation that varies the divergence threshold used for layer selection and reports the resulting defense success rates, thereby testing the sensitivity of the heuristic. We will also evaluate performance on held-out GUI tasks and introduce new pop-up attack variants (different sizes, positions, and visual styles) that were not used during layer identification. Cross-model transfer experiments on additional MLLM backbones are acknowledged as valuable but lie outside the scope of the present work due to substantial computational cost; we will explicitly note this limitation and flag it for future study. These additions will clarify the conditions under which the reported gains are expected to hold. revision: partial
Referee: [§4.2] §4.2 (Experimental Controls): The abstract and results claim extensive experiments with improved defense rates, but specific details on statistical measures (e.g., standard deviation across runs), controls for post-hoc layer selection bias, and exact attack injection parameters are not reported, making it difficult to assess whether the robustness is causal or dataset-specific.

Authors: We agree that additional statistical reporting and controls are necessary. In the revised version we will report standard deviations computed over at least five independent runs with different random seeds for all main metrics. To address possible post-hoc bias, we will include a control experiment that applies the same scaling magnitude to randomly chosen layers instead of the divergence-selected layers and show that performance gains are substantially lower, supporting the specificity of our selection. We will also expand the experimental setup to list the precise attack injection parameters (pop-up size range, screen coordinates, transparency, and textual content) used in each dataset. These changes will strengthen the claim that the observed robustness arises from the proposed layer-wise mechanism rather than from dataset idiosyncrasies. revision: yes

Circularity Check

0 steps flagged

No circularity: LaSM derives from empirical attention pattern observation validated by experiments

full rationale

The paper identifies a layer-wise attention divergence pattern through systematic study of attack effects on GUI agents, then proposes LaSM to scale modules in selected critical layers at inference. This construction relies on direct observation and experimental validation across datasets rather than any self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations. The defense success rate is measured externally and does not reduce to the input observations by construction. The derivation chain remains self-contained with independent empirical content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on empirical observation of attention behavior in existing MLLM architectures rather than new free parameters, axioms beyond standard transformer assumptions, or invented entities.

axioms (1)

domain assumption Attention and MLP modules in transformer-based MLLMs can be selectively scaled in specific layers to improve saliency alignment.
Invoked when proposing LaSM as a modulation technique without retraining.

pith-pipeline@v0.9.0 · 5743 in / 1118 out tokens · 58974 ms · 2026-05-19T04:05:59.166489+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We therefore adopt a more refined strategy: Layer-wise scaling mechanism, which performs selective scaling on attention and MLP weights with specific layers. ... starts with scaling all layers (Layers 1 to 28) and measuring the proportion of outputs predicted as <icon-cross>.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the angular difference increases notably in the selected scaling layers. This suggests that these layers capture stronger differences in decision behavior

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents
cs.CL 2025-09 unverdicted novelty 6.0

VeriOS-Agent is an OS agent that proactively queries humans in untrustworthy scenarios via a query-driven framework and three-stage training, achieving 19.72% higher step-wise success rate over baselines while preserv...

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

Large Language Model-Brained GUI Agents: A Survey

C. Zhang, S. He, J. Qian, B. Li, L. Li, S. Qin, Y . Kang, M. Ma, G. Liu, Q. Lin, et al., “Large language model-brained gui agents: A survey,”arXiv preprint arXiv:2411.18279, 2024

work page internal anchor Pith review arXiv 2024
[2]

Gui agents: A survey,

D. Nguyen, J. Chen, Y . Wang, G. Wu, N. Park, Z. Hu, H. Lyu, J. Wu, R. Aponte, Y . Xia,et al., “Gui agents: A survey,”arXiv preprint arXiv:2412.13501, 2024

work page arXiv 2024
[3]

Llm-powered gui agents in phone automation: Surveying progress and prospects,

G. Liu, P. Zhao, L. Liu, Y . Guo, H. Xiao, W. Lin, Y . Chai, Y . Han, S. Ren, H. Wang,et al., “Llm-powered gui agents in phone automation: Surveying progress and prospects,” arXiv preprint arXiv:2504.19838, 2025

work page arXiv 2025
[4]

React: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” in International Conference on Learning Representations (ICLR) , 2023

work page 2023
[5]

Webcanvas: Benchmarking web agents in online environments,

Y . Pan, D. Kong, S. Zhou, C. Cui, Y . Leng, B. Jiang, H. Liu, Y . Shang, S. Zhou, T. Wu, and Z. Wu, “Webcanvas: Benchmarking web agents in online environments,” inAgentic Markets Workshop at ICML 2024, 2024

work page 2024
[6]

SeeClick: Harnessing GUI grounding for advanced visual GUI agents,

K. Cheng, Q. Sun, Y . Chu, F. Xu, L. YanTao, J. Zhang, and Z. Wu, “SeeClick: Harnessing GUI grounding for advanced visual GUI agents,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers) , pp. 9313–9332, Association for Computational Linguistics, Aug. 2024

work page 2024
[7]

EIA: ENVIRONMENTAL INJECTION ATTACK ON GENERALIST WEB AGENTS FOR PRIV ACY LEAKAGE,

Z. Liao, L. Mo, C. Xu, M. Kang, J. Zhang, C. Xiao, Y . Tian, B. Li, and H. Sun, “EIA: ENVIRONMENTAL INJECTION ATTACK ON GENERALIST WEB AGENTS FOR PRIV ACY LEAKAGE,” inThe Thirteenth International Conference on Learning Representations , 2025

work page 2025
[8]

WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks

I. Evtimov, A. Zharmagambetov, A. Grattafiori, C. Guo, and K. Chaudhuri, “Wasp: Benchmarking web agent security against prompt injection attacks,” arXiv preprint arXiv:2504.18575, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

H-cot: Hijacking the chain-of-thought safety reasoning mechanism to jailbreak large reasoning models, including openai o1/o3, deepseek-r1, and gemini 2.0 flash thinking,

M. Kuo, J. Zhang, A. Ding, Q. Wang, L. DiValentin, Y . Bao, W. Wei, H. Li, and Y . Chen, “H-cot: Hijacking the chain-of-thought safety reasoning mechanism to jailbreak large reasoning models, including openai o1/o3, deepseek-r1, and gemini 2.0 flash thinking,”arXiv preprint arXiv:2502.12893, 2025

work page arXiv 2025
[10]

Caution for the environment: Multimodal agents are susceptible to environmental distractions,

X. Ma, Y . Wang, Y . Yao, T. Yuan, A. Zhang, Z. Zhang, and H. Zhao, “Caution for the environment: Multimodal agents are susceptible to environmental distractions,”arXiv preprint arXiv:2408.02544, 2024

work page arXiv 2024
[11]

Attacking vision-language computer agents via pop-ups,

Y . Zhang, T. Yu, and D. Yang, “Attacking vision-language computer agents via pop-ups,”arXiv preprint arXiv:2411.02391, 2024

work page arXiv 2024
[12]

Secalign: Defending against prompt injection with preference optimization,

S. Chen, A. Zharmagambetov, S. Mahloujifar, K. Chaudhuri, D. Wagner, and C. Guo, “Secalign: Defending against prompt injection with preference optimization,” arXiv preprint arXiv:2410.05451, 2025

work page arXiv 2025
[13]

Jatmo: Prompt injection defense by task-specific finetuning,

J. Piet, M. Alrashed, C. Sitawarin, S. Chen, Z. Wei, E. Sun, B. Alomair, and D. Wagner, “Jatmo: Prompt injection defense by task-specific finetuning,” inEuropean Symposium on Research in Computer Security , pp. 105–124, Springer, 2024

work page 2024
[14]

Direct preference opti- mization: Your language model is secretly a reward model,

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference opti- mization: Your language model is secretly a reward model,”Advances in Neural Information Processing Systems, vol. 36, pp. 53728–53741, 2023

work page 2023
[15]

In-context defense in computer agents: An empirical study,

P. Yang, H. Ci, and M. Z. Shou, “In-context defense in computer agents: An empirical study,” arXiv preprint arXiv:2503.09241, 2025

work page arXiv 2025
[16]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge,et al., “Qwen2-vl: Enhanc- ing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

A survey on (m)llm-based gui agents,

F. Tang, H. Xu, H. Zhang, S. Chen, X. Wu, Y . Shen, W. Zhang, G. Hou, Z. Tan, Y . Yan,et al., “A survey on (m) llm-based gui agents,” arXiv preprint arXiv:2504.13865, 2025

work page arXiv 2025
[18]

WebArena: A Realistic Web Environment for Building Autonomous Agents

S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y . Bisk, D. Fried,et al., “Webarena: A realistic web environment for building autonomous agents,”arXiv preprint arXiv:2307.13854, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments,

T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei,et al., “Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments,” Advances in Neural Information Processing Systems, vol. 37, pp. 52040–52094, 2024

work page 2024
[20]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Z. Wu, Z. Wu, F. Xu, Y . Wang, Q. Sun, C. Jia, K. Cheng, Z. Ding, L. Chen, P. P. Liang,et al., “Os-atlas: A foundation action model for generalist gui agents,” arXiv preprint arXiv:2410.23218, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Screenspot-pro: Gui grounding for professional high- resolution computer use.arXiv preprint arXiv:2504.07981,

K. Li, Z. Meng, H. Lin, Z. Luo, Y . Tian, J. Ma, Z. Huang, and T.-S. Chua, “Screenspot-pro: Gui grounding for professional high-resolution computer use,” arXiv preprint arXiv:2504.07981, 2025

work page arXiv 2025
[22]

You only look at screens: Multimodal chain-of-action agents,

Z. Zhang and A. Zhang, “You only look at screens: Multimodal chain-of-action agents,” inFindings of the Association for Computational Linguistics: ACL 2024 , pp. 3132–3149, Association for Computational Linguistics, Aug. 2024. 10

work page 2024
[23]

CoCo-agent: A comprehensive cognitive MLLM agent for smartphone GUI automation,

X. Ma, Z. Zhang, and H. Zhao, “CoCo-agent: A comprehensive cognitive MLLM agent for smartphone GUI automation,” inFindings of the Association for Computational Linguistics: ACL 2024 , pp. 9097–9110, Association for Computational Linguistics, Aug. 2024

work page 2024
[24]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Y . Qin, Y . Ye, J. Fang, H. Wang, S. Liang, S. Tian, J. Zhang, J. Li, Y . Li, S. Huang, et al. , “Ui-tars: Pioneering automated gui interaction with native agents,”arXiv preprint arXiv:2501.12326, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Towards trustworthy gui agents: A survey,

Y . Shi, W. Yu, W. Yao, W. Chen, and N. Liu, “Towards trustworthy gui agents: A survey,”arXiv preprint arXiv:2503.23434, 2025

work page arXiv 2025
[26]

The obvious invisible threat: Llm-powered gui agents’ vulnerability to fine-print injections,

C. Chen, Z. Zhang, B. Guo, S. Ma, I. Khalilov, S. A. Gebreegziabher, Y . Ye, Z. Xiao, Y . Yao, T. Li,et al., “The obvious invisible threat: Llm-powered gui agents’ vulnerability to fine-print injections,”arXiv preprint arXiv:2504.11281, 2025

work page arXiv 2025
[27]

Watch out your album! on the inadvertent privacy memorization in multi-modal large language models,

T. Ju, Y . Hua, H. Fei, Z. Shao, Y . Zheng, H. Zhao, M.-L. Lee, W. Hsu, Z. Zhang, and G. Liu, “Watch out your album! on the inadvertent privacy memorization in multi-modal large language models,”arXiv preprint arXiv:2503.01208, 2025

work page arXiv 2025
[28]

Evaluating the robustness of multimodal agents against active environmental injection attacks,

Y . Chen, X. Hu, K. Yin, J. Li, and S. Zhang, “Evaluating the robustness of multimodal agents against active environmental injection attacks,”arXiv preprint arXiv:2502.13053, 2025

work page arXiv 2025
[29]

Visual explanations from deep networks via gradient-based localization,

R. Ramprasaath, M. Selvaraju, and A. Das, “Visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV) , pp. 618– 626, 2019

work page 2019
[30]

Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers,

H. Chefer, S. Gur, and L. Wolf, “Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers,” inProceedings of the IEEE/CVF international conference on computer vision, pp. 397–406, 2021

work page 2021
[31]

Visual explanations via iterated integrated attribu- tions,

O. Barkan, Y . Asher, A. Eshel, N. Koenigstein,et al., “Visual explanations via iterated integrated attribu- tions,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , pp. 2073–2084, 2023

work page 2073
[32]

Where do large vision-language models look at when answering questions?,

X. Xing, C.-W. Kuo, L. Fuxin, Y . Niu, F. Chen, M. Li, Y . Wu, L. Wen, and S. Zhu, “Where do large vision-language models look at when answering questions?,” arXiv preprint arXiv:2503.13891, 2025

work page arXiv 2025
[33]

Mllms know where to look: Training-free perception of small visual details with multimodal llms,

J. Zhang, M. Khayatkhoei, P. Chhikara, and F. Ilievski, “Mllms know where to look: Training-free perception of small visual details with multimodal llms,” arXiv preprint arXiv:2502.17422, 2025

work page arXiv 2025
[34]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems , vol. 30, 2017

work page 2017
[35]

LLaV A-neXT-interleave: Tackling multi-image, video, and 3d in large multimodal models,

F. Li, R. Zhang, H. Zhang, Y . Zhang, B. Li, W. Li, Z. MA, and C. Li, “LLaV A-neXT-interleave: Tackling multi-image, video, and 3d in large multimodal models,” inThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[36]

The Llama 3 Herd of Models

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, and others., “The llama 3 herd of models,” CoRR, vol. abs/2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

On the effects of data scale on ui control agents,

W. Li, W. E. Bishop, A. Li, C. Rawles, F. Campbell-Ajala, D. Tyamagundlu, and O. Riva, “On the effects of data scale on ui control agents,”Advances in Neural Information Processing Systems , vol. 37, pp. 92130–92154, 2024. 11 A Implementation Details A.1 Prompt template In this work, we do not aim to improve the precision of coordinate prediction, instead...

work page 2024

[1] [1]

Large Language Model-Brained GUI Agents: A Survey

C. Zhang, S. He, J. Qian, B. Li, L. Li, S. Qin, Y . Kang, M. Ma, G. Liu, Q. Lin, et al., “Large language model-brained gui agents: A survey,”arXiv preprint arXiv:2411.18279, 2024

work page internal anchor Pith review arXiv 2024

[2] [2]

Gui agents: A survey,

D. Nguyen, J. Chen, Y . Wang, G. Wu, N. Park, Z. Hu, H. Lyu, J. Wu, R. Aponte, Y . Xia,et al., “Gui agents: A survey,”arXiv preprint arXiv:2412.13501, 2024

work page arXiv 2024

[3] [3]

Llm-powered gui agents in phone automation: Surveying progress and prospects,

G. Liu, P. Zhao, L. Liu, Y . Guo, H. Xiao, W. Lin, Y . Chai, Y . Han, S. Ren, H. Wang,et al., “Llm-powered gui agents in phone automation: Surveying progress and prospects,” arXiv preprint arXiv:2504.19838, 2025

work page arXiv 2025

[4] [4]

React: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” in International Conference on Learning Representations (ICLR) , 2023

work page 2023

[5] [5]

Webcanvas: Benchmarking web agents in online environments,

Y . Pan, D. Kong, S. Zhou, C. Cui, Y . Leng, B. Jiang, H. Liu, Y . Shang, S. Zhou, T. Wu, and Z. Wu, “Webcanvas: Benchmarking web agents in online environments,” inAgentic Markets Workshop at ICML 2024, 2024

work page 2024

[6] [6]

SeeClick: Harnessing GUI grounding for advanced visual GUI agents,

K. Cheng, Q. Sun, Y . Chu, F. Xu, L. YanTao, J. Zhang, and Z. Wu, “SeeClick: Harnessing GUI grounding for advanced visual GUI agents,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers) , pp. 9313–9332, Association for Computational Linguistics, Aug. 2024

work page 2024

[7] [7]

EIA: ENVIRONMENTAL INJECTION ATTACK ON GENERALIST WEB AGENTS FOR PRIV ACY LEAKAGE,

Z. Liao, L. Mo, C. Xu, M. Kang, J. Zhang, C. Xiao, Y . Tian, B. Li, and H. Sun, “EIA: ENVIRONMENTAL INJECTION ATTACK ON GENERALIST WEB AGENTS FOR PRIV ACY LEAKAGE,” inThe Thirteenth International Conference on Learning Representations , 2025

work page 2025

[8] [8]

WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks

I. Evtimov, A. Zharmagambetov, A. Grattafiori, C. Guo, and K. Chaudhuri, “Wasp: Benchmarking web agent security against prompt injection attacks,” arXiv preprint arXiv:2504.18575, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

H-cot: Hijacking the chain-of-thought safety reasoning mechanism to jailbreak large reasoning models, including openai o1/o3, deepseek-r1, and gemini 2.0 flash thinking,

M. Kuo, J. Zhang, A. Ding, Q. Wang, L. DiValentin, Y . Bao, W. Wei, H. Li, and Y . Chen, “H-cot: Hijacking the chain-of-thought safety reasoning mechanism to jailbreak large reasoning models, including openai o1/o3, deepseek-r1, and gemini 2.0 flash thinking,”arXiv preprint arXiv:2502.12893, 2025

work page arXiv 2025

[10] [10]

Caution for the environment: Multimodal agents are susceptible to environmental distractions,

X. Ma, Y . Wang, Y . Yao, T. Yuan, A. Zhang, Z. Zhang, and H. Zhao, “Caution for the environment: Multimodal agents are susceptible to environmental distractions,”arXiv preprint arXiv:2408.02544, 2024

work page arXiv 2024

[11] [11]

Attacking vision-language computer agents via pop-ups,

Y . Zhang, T. Yu, and D. Yang, “Attacking vision-language computer agents via pop-ups,”arXiv preprint arXiv:2411.02391, 2024

work page arXiv 2024

[12] [12]

Secalign: Defending against prompt injection with preference optimization,

S. Chen, A. Zharmagambetov, S. Mahloujifar, K. Chaudhuri, D. Wagner, and C. Guo, “Secalign: Defending against prompt injection with preference optimization,” arXiv preprint arXiv:2410.05451, 2025

work page arXiv 2025

[13] [13]

Jatmo: Prompt injection defense by task-specific finetuning,

J. Piet, M. Alrashed, C. Sitawarin, S. Chen, Z. Wei, E. Sun, B. Alomair, and D. Wagner, “Jatmo: Prompt injection defense by task-specific finetuning,” inEuropean Symposium on Research in Computer Security , pp. 105–124, Springer, 2024

work page 2024

[14] [14]

Direct preference opti- mization: Your language model is secretly a reward model,

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference opti- mization: Your language model is secretly a reward model,”Advances in Neural Information Processing Systems, vol. 36, pp. 53728–53741, 2023

work page 2023

[15] [15]

In-context defense in computer agents: An empirical study,

P. Yang, H. Ci, and M. Z. Shou, “In-context defense in computer agents: An empirical study,” arXiv preprint arXiv:2503.09241, 2025

work page arXiv 2025

[16] [16]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge,et al., “Qwen2-vl: Enhanc- ing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

A survey on (m)llm-based gui agents,

F. Tang, H. Xu, H. Zhang, S. Chen, X. Wu, Y . Shen, W. Zhang, G. Hou, Z. Tan, Y . Yan,et al., “A survey on (m) llm-based gui agents,” arXiv preprint arXiv:2504.13865, 2025

work page arXiv 2025

[18] [18]

WebArena: A Realistic Web Environment for Building Autonomous Agents

S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y . Bisk, D. Fried,et al., “Webarena: A realistic web environment for building autonomous agents,”arXiv preprint arXiv:2307.13854, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments,

T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei,et al., “Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments,” Advances in Neural Information Processing Systems, vol. 37, pp. 52040–52094, 2024

work page 2024

[20] [20]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Z. Wu, Z. Wu, F. Xu, Y . Wang, Q. Sun, C. Jia, K. Cheng, Z. Ding, L. Chen, P. P. Liang,et al., “Os-atlas: A foundation action model for generalist gui agents,” arXiv preprint arXiv:2410.23218, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Screenspot-pro: Gui grounding for professional high- resolution computer use.arXiv preprint arXiv:2504.07981,

K. Li, Z. Meng, H. Lin, Z. Luo, Y . Tian, J. Ma, Z. Huang, and T.-S. Chua, “Screenspot-pro: Gui grounding for professional high-resolution computer use,” arXiv preprint arXiv:2504.07981, 2025

work page arXiv 2025

[22] [22]

You only look at screens: Multimodal chain-of-action agents,

Z. Zhang and A. Zhang, “You only look at screens: Multimodal chain-of-action agents,” inFindings of the Association for Computational Linguistics: ACL 2024 , pp. 3132–3149, Association for Computational Linguistics, Aug. 2024. 10

work page 2024

[23] [23]

CoCo-agent: A comprehensive cognitive MLLM agent for smartphone GUI automation,

X. Ma, Z. Zhang, and H. Zhao, “CoCo-agent: A comprehensive cognitive MLLM agent for smartphone GUI automation,” inFindings of the Association for Computational Linguistics: ACL 2024 , pp. 9097–9110, Association for Computational Linguistics, Aug. 2024

work page 2024

[24] [24]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Y . Qin, Y . Ye, J. Fang, H. Wang, S. Liang, S. Tian, J. Zhang, J. Li, Y . Li, S. Huang, et al. , “Ui-tars: Pioneering automated gui interaction with native agents,”arXiv preprint arXiv:2501.12326, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Towards trustworthy gui agents: A survey,

Y . Shi, W. Yu, W. Yao, W. Chen, and N. Liu, “Towards trustworthy gui agents: A survey,”arXiv preprint arXiv:2503.23434, 2025

work page arXiv 2025

[26] [26]

The obvious invisible threat: Llm-powered gui agents’ vulnerability to fine-print injections,

C. Chen, Z. Zhang, B. Guo, S. Ma, I. Khalilov, S. A. Gebreegziabher, Y . Ye, Z. Xiao, Y . Yao, T. Li,et al., “The obvious invisible threat: Llm-powered gui agents’ vulnerability to fine-print injections,”arXiv preprint arXiv:2504.11281, 2025

work page arXiv 2025

[27] [27]

Watch out your album! on the inadvertent privacy memorization in multi-modal large language models,

T. Ju, Y . Hua, H. Fei, Z. Shao, Y . Zheng, H. Zhao, M.-L. Lee, W. Hsu, Z. Zhang, and G. Liu, “Watch out your album! on the inadvertent privacy memorization in multi-modal large language models,”arXiv preprint arXiv:2503.01208, 2025

work page arXiv 2025

[28] [28]

Evaluating the robustness of multimodal agents against active environmental injection attacks,

Y . Chen, X. Hu, K. Yin, J. Li, and S. Zhang, “Evaluating the robustness of multimodal agents against active environmental injection attacks,”arXiv preprint arXiv:2502.13053, 2025

work page arXiv 2025

[29] [29]

Visual explanations from deep networks via gradient-based localization,

R. Ramprasaath, M. Selvaraju, and A. Das, “Visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV) , pp. 618– 626, 2019

work page 2019

[30] [30]

Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers,

H. Chefer, S. Gur, and L. Wolf, “Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers,” inProceedings of the IEEE/CVF international conference on computer vision, pp. 397–406, 2021

work page 2021

[31] [31]

Visual explanations via iterated integrated attribu- tions,

O. Barkan, Y . Asher, A. Eshel, N. Koenigstein,et al., “Visual explanations via iterated integrated attribu- tions,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , pp. 2073–2084, 2023

work page 2073

[32] [32]

Where do large vision-language models look at when answering questions?,

X. Xing, C.-W. Kuo, L. Fuxin, Y . Niu, F. Chen, M. Li, Y . Wu, L. Wen, and S. Zhu, “Where do large vision-language models look at when answering questions?,” arXiv preprint arXiv:2503.13891, 2025

work page arXiv 2025

[33] [33]

Mllms know where to look: Training-free perception of small visual details with multimodal llms,

J. Zhang, M. Khayatkhoei, P. Chhikara, and F. Ilievski, “Mllms know where to look: Training-free perception of small visual details with multimodal llms,” arXiv preprint arXiv:2502.17422, 2025

work page arXiv 2025

[34] [34]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems , vol. 30, 2017

work page 2017

[35] [35]

LLaV A-neXT-interleave: Tackling multi-image, video, and 3d in large multimodal models,

F. Li, R. Zhang, H. Zhang, Y . Zhang, B. Li, W. Li, Z. MA, and C. Li, “LLaV A-neXT-interleave: Tackling multi-image, video, and 3d in large multimodal models,” inThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[36] [36]

The Llama 3 Herd of Models

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, and others., “The llama 3 herd of models,” CoRR, vol. abs/2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

On the effects of data scale on ui control agents,

W. Li, W. E. Bishop, A. Li, C. Rawles, F. Campbell-Ajala, D. Tyamagundlu, and O. Riva, “On the effects of data scale on ui control agents,”Advances in Neural Information Processing Systems , vol. 37, pp. 92130–92154, 2024. 11 A Implementation Details A.1 Prompt template In this work, we do not aim to improve the precision of coordinate prediction, instead...

work page 2024