LaSM: Layer-wise Scaling Mechanism for Defending Pop-up Attack on GUI Agents
Pith reviewed 2026-05-19 04:05 UTC · model grok-4.3
The pith
A layer-wise scaling mechanism defends GUI agents from pop-up attacks by restoring attention alignment without retraining.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GUI agents show a consistent layer-wise attention divergence pattern when pop-up attacks produce incorrect outputs, with saliency shifting away from task areas in identifiable layers. Selectively amplifying the attention and MLP modules in those critical layers realigns model saliency with relevant regions, raising defense success rates while preserving general capabilities.
What carries the argument
LaSM, the Layer-wise Scaling Mechanism, which detects layers with attention divergence and amplifies their attention and MLP components to restore task-relevant focus.
If this is right
- Defense success rate rises substantially on multiple datasets under pop-up attacks.
- The approach remains robust even when attacks include inductive interference.
- General capabilities on standard tasks show negligible change.
- Attention misalignment is identified as a central vulnerability that selective modulation can address.
Where Pith is reading between the lines
- Similar layer-wise adjustments might counter other visual manipulations that divert attention in multimodal agents.
- Agent deployments could adopt this scaling step as a standard safety filter at inference time.
- The divergence pattern could be monitored in real time to trigger scaling only when attacks are suspected.
Load-bearing premise
The layer-wise attention divergence pattern between correct and incorrect outputs stays stable and identifiable across models, tasks, and attack instances so that critical layers can be chosen reliably for scaling.
What would settle it
Running the method on a new set of GUI agents or pop-up attack variants yields no consistent divergence pattern that allows layer selection, or the scaling step produces no measurable rise in defense success rate.
Figures
read the original abstract
Graphical user interface (GUI) agents built on multimodal large language models (MLLMs) have recently demonstrated strong decision-making abilities in screen-based interaction tasks. However, they remain highly vulnerable to pop-up-based environmental injection attacks, where malicious visual elements divert model attention and lead to unsafe or incorrect actions. Existing defense methods either require costly retraining or perform poorly under inductive interference. In this work, we systematically study how such attacks alter the attention behavior of GUI agents and uncover a layer-wise attention divergence pattern between correct and incorrect outputs. Based on this insight, we propose \textbf{LaSM}, a \textit{Layer-wise Scaling Mechanism} that selectively amplifies attention and MLP modules in critical layers. LaSM improves the alignment between model saliency and task-relevant regions without additional training. Extensive experiments across multiple datasets demonstrate that our method significantly improves the defense success rate and exhibits strong robustness, while having negligible impact on the model's general capabilities. Our findings reveal that attention misalignment is a core vulnerability in MLLM agents and can be effectively addressed through selective layer-wise modulation. Our code can be found in https://github.com/YANGTUOMAO/LaSM.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LaSM, a training-free Layer-wise Scaling Mechanism for defending GUI agents (built on MLLMs) against pop-up environmental injection attacks. It identifies a layer-wise attention divergence pattern between correct and attacked outputs, then selectively scales attention and MLP modules in critical layers at inference to realign saliency with task-relevant regions. Experiments across datasets report improved defense success rates, robustness, and negligible impact on general capabilities.
Significance. If the divergence-based layer selection proves stable, LaSM offers a lightweight, inference-only defense that avoids retraining costs while addressing attention misalignment vulnerabilities in GUI agents. The public code release supports reproducibility.
major comments (2)
- [§3] §3 (Layer Identification): The central claim depends on the stability of the attention/MLP divergence signature for selecting critical layers, yet the manuscript provides no cross-model transfer tests, no ablation of the layer-selection heuristic, and no evaluation on held-out attack variants or GUI tasks. If the divergent layers shift under a new pop-up style or different MLLM backbone, the fixed scaling would modulate the wrong modules and the reported defense gains would not hold.
- [§4.2] §4.2 (Experimental Controls): The abstract and results claim extensive experiments with improved defense rates, but specific details on statistical measures (e.g., standard deviation across runs), controls for post-hoc layer selection bias, and exact attack injection parameters are not reported, making it difficult to assess whether the robustness is causal or dataset-specific.
minor comments (2)
- [§3] Notation for the scaling factor and critical-layer indices should be defined explicitly in the main text rather than only in the appendix or code.
- [Figures 3-5] Figure captions for attention visualizations could more clearly indicate which layers are being scaled and how the saliency maps were generated.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments on our manuscript. We have carefully reviewed the concerns regarding the stability of the layer-wise divergence pattern and the need for stronger experimental controls. Below we respond point-by-point and indicate the revisions planned for the next version of the paper.
read point-by-point responses
-
Referee: [§3] §3 (Layer Identification): The central claim depends on the stability of the attention/MLP divergence signature for selecting critical layers, yet the manuscript provides no cross-model transfer tests, no ablation of the layer-selection heuristic, and no evaluation on held-out attack variants or GUI tasks. If the divergent layers shift under a new pop-up style or different MLLM backbone, the fixed scaling would modulate the wrong modules and the reported defense gains would not hold.
Authors: We appreciate the referee’s emphasis on validating the stability of the divergence-based layer selection. In the current experiments the divergence signature was observed consistently across the GUI tasks and attack instances used for evaluation. In the revised manuscript we will add an ablation that varies the divergence threshold used for layer selection and reports the resulting defense success rates, thereby testing the sensitivity of the heuristic. We will also evaluate performance on held-out GUI tasks and introduce new pop-up attack variants (different sizes, positions, and visual styles) that were not used during layer identification. Cross-model transfer experiments on additional MLLM backbones are acknowledged as valuable but lie outside the scope of the present work due to substantial computational cost; we will explicitly note this limitation and flag it for future study. These additions will clarify the conditions under which the reported gains are expected to hold. revision: partial
-
Referee: [§4.2] §4.2 (Experimental Controls): The abstract and results claim extensive experiments with improved defense rates, but specific details on statistical measures (e.g., standard deviation across runs), controls for post-hoc layer selection bias, and exact attack injection parameters are not reported, making it difficult to assess whether the robustness is causal or dataset-specific.
Authors: We agree that additional statistical reporting and controls are necessary. In the revised version we will report standard deviations computed over at least five independent runs with different random seeds for all main metrics. To address possible post-hoc bias, we will include a control experiment that applies the same scaling magnitude to randomly chosen layers instead of the divergence-selected layers and show that performance gains are substantially lower, supporting the specificity of our selection. We will also expand the experimental setup to list the precise attack injection parameters (pop-up size range, screen coordinates, transparency, and textual content) used in each dataset. These changes will strengthen the claim that the observed robustness arises from the proposed layer-wise mechanism rather than from dataset idiosyncrasies. revision: yes
Circularity Check
No circularity: LaSM derives from empirical attention pattern observation validated by experiments
full rationale
The paper identifies a layer-wise attention divergence pattern through systematic study of attack effects on GUI agents, then proposes LaSM to scale modules in selected critical layers at inference. This construction relies on direct observation and experimental validation across datasets rather than any self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations. The defense success rate is measured externally and does not reduce to the input observations by construction. The derivation chain remains self-contained with independent empirical content.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Attention and MLP modules in transformer-based MLLMs can be selectively scaled in specific layers to improve saliency alignment.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We therefore adopt a more refined strategy: Layer-wise scaling mechanism, which performs selective scaling on attention and MLP weights with specific layers. ... starts with scaling all layers (Layers 1 to 28) and measuring the proportion of outputs predicted as <icon-cross>.
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the angular difference increases notably in the selected scaling layers. This suggests that these layers capture stronger differences in decision behavior
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents
VeriOS-Agent is an OS agent that proactively queries humans in untrustworthy scenarios via a query-driven framework and three-stage training, achieving 19.72% higher step-wise success rate over baselines while preserv...
Reference graph
Works this paper leans on
-
[1]
Large Language Model-Brained GUI Agents: A Survey
C. Zhang, S. He, J. Qian, B. Li, L. Li, S. Qin, Y . Kang, M. Ma, G. Liu, Q. Lin, et al., “Large language model-brained gui agents: A survey,”arXiv preprint arXiv:2411.18279, 2024
work page internal anchor Pith review arXiv 2024
-
[2]
D. Nguyen, J. Chen, Y . Wang, G. Wu, N. Park, Z. Hu, H. Lyu, J. Wu, R. Aponte, Y . Xia,et al., “Gui agents: A survey,”arXiv preprint arXiv:2412.13501, 2024
-
[3]
Llm-powered gui agents in phone automation: Surveying progress and prospects,
G. Liu, P. Zhao, L. Liu, Y . Guo, H. Xiao, W. Lin, Y . Chai, Y . Han, S. Ren, H. Wang,et al., “Llm-powered gui agents in phone automation: Surveying progress and prospects,” arXiv preprint arXiv:2504.19838, 2025
-
[4]
React: Synergizing reasoning and acting in language models,
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” in International Conference on Learning Representations (ICLR) , 2023
work page 2023
-
[5]
Webcanvas: Benchmarking web agents in online environments,
Y . Pan, D. Kong, S. Zhou, C. Cui, Y . Leng, B. Jiang, H. Liu, Y . Shang, S. Zhou, T. Wu, and Z. Wu, “Webcanvas: Benchmarking web agents in online environments,” inAgentic Markets Workshop at ICML 2024, 2024
work page 2024
-
[6]
SeeClick: Harnessing GUI grounding for advanced visual GUI agents,
K. Cheng, Q. Sun, Y . Chu, F. Xu, L. YanTao, J. Zhang, and Z. Wu, “SeeClick: Harnessing GUI grounding for advanced visual GUI agents,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers) , pp. 9313–9332, Association for Computational Linguistics, Aug. 2024
work page 2024
-
[7]
EIA: ENVIRONMENTAL INJECTION ATTACK ON GENERALIST WEB AGENTS FOR PRIV ACY LEAKAGE,
Z. Liao, L. Mo, C. Xu, M. Kang, J. Zhang, C. Xiao, Y . Tian, B. Li, and H. Sun, “EIA: ENVIRONMENTAL INJECTION ATTACK ON GENERALIST WEB AGENTS FOR PRIV ACY LEAKAGE,” inThe Thirteenth International Conference on Learning Representations , 2025
work page 2025
-
[8]
WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks
I. Evtimov, A. Zharmagambetov, A. Grattafiori, C. Guo, and K. Chaudhuri, “Wasp: Benchmarking web agent security against prompt injection attacks,” arXiv preprint arXiv:2504.18575, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
M. Kuo, J. Zhang, A. Ding, Q. Wang, L. DiValentin, Y . Bao, W. Wei, H. Li, and Y . Chen, “H-cot: Hijacking the chain-of-thought safety reasoning mechanism to jailbreak large reasoning models, including openai o1/o3, deepseek-r1, and gemini 2.0 flash thinking,”arXiv preprint arXiv:2502.12893, 2025
-
[10]
Caution for the environment: Multimodal agents are susceptible to environmental distractions,
X. Ma, Y . Wang, Y . Yao, T. Yuan, A. Zhang, Z. Zhang, and H. Zhao, “Caution for the environment: Multimodal agents are susceptible to environmental distractions,”arXiv preprint arXiv:2408.02544, 2024
-
[11]
Attacking vision-language computer agents via pop-ups,
Y . Zhang, T. Yu, and D. Yang, “Attacking vision-language computer agents via pop-ups,”arXiv preprint arXiv:2411.02391, 2024
-
[12]
Secalign: Defending against prompt injection with preference optimization,
S. Chen, A. Zharmagambetov, S. Mahloujifar, K. Chaudhuri, D. Wagner, and C. Guo, “Secalign: Defending against prompt injection with preference optimization,” arXiv preprint arXiv:2410.05451, 2025
-
[13]
Jatmo: Prompt injection defense by task-specific finetuning,
J. Piet, M. Alrashed, C. Sitawarin, S. Chen, Z. Wei, E. Sun, B. Alomair, and D. Wagner, “Jatmo: Prompt injection defense by task-specific finetuning,” inEuropean Symposium on Research in Computer Security , pp. 105–124, Springer, 2024
work page 2024
-
[14]
Direct preference opti- mization: Your language model is secretly a reward model,
R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference opti- mization: Your language model is secretly a reward model,”Advances in Neural Information Processing Systems, vol. 36, pp. 53728–53741, 2023
work page 2023
-
[15]
In-context defense in computer agents: An empirical study,
P. Yang, H. Ci, and M. Z. Shou, “In-context defense in computer agents: An empirical study,” arXiv preprint arXiv:2503.09241, 2025
-
[16]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge,et al., “Qwen2-vl: Enhanc- ing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
A survey on (m)llm-based gui agents,
F. Tang, H. Xu, H. Zhang, S. Chen, X. Wu, Y . Shen, W. Zhang, G. Hou, Z. Tan, Y . Yan,et al., “A survey on (m) llm-based gui agents,” arXiv preprint arXiv:2504.13865, 2025
-
[18]
WebArena: A Realistic Web Environment for Building Autonomous Agents
S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y . Bisk, D. Fried,et al., “Webarena: A realistic web environment for building autonomous agents,”arXiv preprint arXiv:2307.13854, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments,
T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei,et al., “Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments,” Advances in Neural Information Processing Systems, vol. 37, pp. 52040–52094, 2024
work page 2024
-
[20]
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Z. Wu, Z. Wu, F. Xu, Y . Wang, Q. Sun, C. Jia, K. Cheng, Z. Ding, L. Chen, P. P. Liang,et al., “Os-atlas: A foundation action model for generalist gui agents,” arXiv preprint arXiv:2410.23218, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
K. Li, Z. Meng, H. Lin, Z. Luo, Y . Tian, J. Ma, Z. Huang, and T.-S. Chua, “Screenspot-pro: Gui grounding for professional high-resolution computer use,” arXiv preprint arXiv:2504.07981, 2025
-
[22]
You only look at screens: Multimodal chain-of-action agents,
Z. Zhang and A. Zhang, “You only look at screens: Multimodal chain-of-action agents,” inFindings of the Association for Computational Linguistics: ACL 2024 , pp. 3132–3149, Association for Computational Linguistics, Aug. 2024. 10
work page 2024
-
[23]
CoCo-agent: A comprehensive cognitive MLLM agent for smartphone GUI automation,
X. Ma, Z. Zhang, and H. Zhao, “CoCo-agent: A comprehensive cognitive MLLM agent for smartphone GUI automation,” inFindings of the Association for Computational Linguistics: ACL 2024 , pp. 9097–9110, Association for Computational Linguistics, Aug. 2024
work page 2024
-
[24]
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Y . Qin, Y . Ye, J. Fang, H. Wang, S. Liang, S. Tian, J. Zhang, J. Li, Y . Li, S. Huang, et al. , “Ui-tars: Pioneering automated gui interaction with native agents,”arXiv preprint arXiv:2501.12326, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Towards trustworthy gui agents: A survey,
Y . Shi, W. Yu, W. Yao, W. Chen, and N. Liu, “Towards trustworthy gui agents: A survey,”arXiv preprint arXiv:2503.23434, 2025
-
[26]
The obvious invisible threat: Llm-powered gui agents’ vulnerability to fine-print injections,
C. Chen, Z. Zhang, B. Guo, S. Ma, I. Khalilov, S. A. Gebreegziabher, Y . Ye, Z. Xiao, Y . Yao, T. Li,et al., “The obvious invisible threat: Llm-powered gui agents’ vulnerability to fine-print injections,”arXiv preprint arXiv:2504.11281, 2025
-
[27]
Watch out your album! on the inadvertent privacy memorization in multi-modal large language models,
T. Ju, Y . Hua, H. Fei, Z. Shao, Y . Zheng, H. Zhao, M.-L. Lee, W. Hsu, Z. Zhang, and G. Liu, “Watch out your album! on the inadvertent privacy memorization in multi-modal large language models,”arXiv preprint arXiv:2503.01208, 2025
-
[28]
Evaluating the robustness of multimodal agents against active environmental injection attacks,
Y . Chen, X. Hu, K. Yin, J. Li, and S. Zhang, “Evaluating the robustness of multimodal agents against active environmental injection attacks,”arXiv preprint arXiv:2502.13053, 2025
-
[29]
Visual explanations from deep networks via gradient-based localization,
R. Ramprasaath, M. Selvaraju, and A. Das, “Visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV) , pp. 618– 626, 2019
work page 2019
-
[30]
Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers,
H. Chefer, S. Gur, and L. Wolf, “Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers,” inProceedings of the IEEE/CVF international conference on computer vision, pp. 397–406, 2021
work page 2021
-
[31]
Visual explanations via iterated integrated attribu- tions,
O. Barkan, Y . Asher, A. Eshel, N. Koenigstein,et al., “Visual explanations via iterated integrated attribu- tions,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , pp. 2073–2084, 2023
work page 2073
-
[32]
Where do large vision-language models look at when answering questions?,
X. Xing, C.-W. Kuo, L. Fuxin, Y . Niu, F. Chen, M. Li, Y . Wu, L. Wen, and S. Zhu, “Where do large vision-language models look at when answering questions?,” arXiv preprint arXiv:2503.13891, 2025
-
[33]
Mllms know where to look: Training-free perception of small visual details with multimodal llms,
J. Zhang, M. Khayatkhoei, P. Chhikara, and F. Ilievski, “Mllms know where to look: Training-free perception of small visual details with multimodal llms,” arXiv preprint arXiv:2502.17422, 2025
-
[34]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems , vol. 30, 2017
work page 2017
-
[35]
LLaV A-neXT-interleave: Tackling multi-image, video, and 3d in large multimodal models,
F. Li, R. Zhang, H. Zhang, Y . Zhang, B. Li, W. Li, Z. MA, and C. Li, “LLaV A-neXT-interleave: Tackling multi-image, video, and 3d in large multimodal models,” inThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[36]
A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, and others., “The llama 3 herd of models,” CoRR, vol. abs/2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
On the effects of data scale on ui control agents,
W. Li, W. E. Bishop, A. Li, C. Rawles, F. Campbell-Ajala, D. Tyamagundlu, and O. Riva, “On the effects of data scale on ui control agents,”Advances in Neural Information Processing Systems , vol. 37, pp. 92130–92154, 2024. 11 A Implementation Details A.1 Prompt template In this work, we do not aim to improve the precision of coordinate prediction, instead...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.