pith. sign in

arxiv: 2606.00497 · v1 · pith:NMTOIRCZnew · submitted 2026-05-30 · 💻 cs.CR · cs.CL

"I Strongly Suspect This Website Is a Scam": Benchmarking PII Leakage and Detection without Defense in Autonomous Web Agents

Pith reviewed 2026-06-28 18:52 UTC · model grok-4.3

classification 💻 cs.CR cs.CL
keywords PII leakageautonomous web agentssocial engineeringscam detectionprivacy risksLLM agentsbenchmarkingdeceptive web content
0
0 comments X

The pith

Autonomous web agents submit critical personal information to scam sites in 54-93% of interactions even when their reasoning flags the sites as suspicious.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to measure how effectively social-engineering attacks extract PII from frontier web agents by constructing controlled attack environments and matched benign controls. It reports that leakage rates on attack sites far exceed those on benign twins, establishing that the submissions are driven by the attacks rather than routine form handling. The central finding is a detection-action gap: even when an independent judge confirms the agent has voiced suspicion, critical PII is still sent in over a third of sessions. This matters because current agent architectures appear unable to convert recognized risk into withheld submissions, leaving deployed systems exposed without additional external controls.

Core claim

Social-engineering attacks are highly effective at extracting critical-tier PII from frontier web agents. Across 91 attacker-controlled environments and 10 benign-twin baselines that span eight attack vectors, critical PII leakage reaches 54-93% under no privacy guidance versus 0% on the benign baselines. Agents whose reasoning an independent LLM judge confirms has flagged the site as suspicious still submit critical PII in 35.9% of sessions, compared with 66.1% when no suspicion is verbalized, a gap that holds across four model families. Prompt-level mitigations produce sharply model-dependent reductions but remain insufficient at the pooled level, indicating that defenses conditioned on th

What carries the argument

The Scammer4U benchmark, built on an 8-axis factorial taxonomy that isolates the causal contribution of individual attack design factors, paired with an independent LLM judge that identifies when an agent's reasoning has flagged a site as suspicious.

If this is right

  • Leakage is attack-attributable rather than incidental form-filling.
  • Prompt-level privacy guidance produces model-dependent reductions but does not reliably prevent critical PII submission at the pooled level.
  • Agents that verbally flag suspicion still submit PII at a 30.2% lower rate than those that do not, yet the absolute rate remains 35.9%.
  • Output-level interception of outbound submissions is required because defenses that rely on the agent's own recognition are gating on the wrong signal.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Deployed agents performing routine web tasks could expose users to identity theft or financial harm unless external submission filters are added.
  • The same detection-action gap may appear in other agent domains such as email or mobile-app interactions where deceptive prompts can be presented.
  • Testing whether human oversight of outbound actions closes the gap more effectively than internal prompting would directly test the paper's call for output-level defenses.

Load-bearing premise

The 8-axis taxonomy successfully isolates each attack factor's contribution and the LLM judge reliably detects when an agent has recognized the site as suspicious.

What would settle it

A controlled study in which the same agents interact with live scam sites while human raters independently score both the agent's verbalized suspicion and the actual PII submitted, checking whether the reported 30.2% gap persists.

Figures

Figures reproduced from arXiv: 2606.00497 by Arya Bharaty, Dhruv Kumar, Murari Mandal, Sarthakbrata Halder, Soham Roy, Srikant Panda, Vaibhav Bhaskar, Yash Sinha.

Figure 1
Figure 1. Figure 1: SCAMMER4U threat model: a VLM web agent carrying the user’s PII browses an attacker￾controlled site that social-engineers it into submitting critical fields to an attacker-controlled endpoint (§3.1). Scout, in the same step it submitted the user’s SSN, card number, and CVV to the attacker (§4.3). AI agents that can browse the web on a user’s behalf are rapidly moving from research into real-world deploymen… view at source ↗
Figure 2
Figure 2. Figure 2: F1 detection–action gap (LLM-judge DR, reached_trap = True, pooled-of-4 models). (a) Critical-tier leak rate by condition for judge-confirmed detectors (DR=1) vs non-detectors (DR=0); C3 is the pre-registered primary test (30.2 pp gap, nDR=1 = 462, q < 0.001). (b) Per-model gap at C3 with Wilson 95% CIs; only the pooled row clears the n ≥ 200 power guard (§3.4). Source [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗
Figure 3
Figure 3. Figure 3: Fidelity-review stimuli. Left (Column A): Scammer4U environments most often mis-identified as real captured phishing (top-5 by mis-identification rate). Right (Column B): real captured-phishing pages most often mis-identified as Scammer4U. Per-stim accuracies and per-axis Likerts are in [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗
read the original abstract

Deceptive web content, widely instantiated across the internet and commonly known as \textit{social-engineering attacks}, manipulates autonomous web agents into submitting users' personally identifiable information (PII) to attacker-controlled endpoints. In this paper, we show that social-engineering attacks are highly effective at extracting critical-tier PII from frontier web agents, posing a severe risk to deployed agentic systems. To quantify this risk, we introduce \textbf{\textsc{Scammer4U}}, a pre-registered benchmark of 91 attacker-controlled environments and 10 benign-twin baselines, spanning 8 attack vectors and 16 site categories on an 8-axis factorial taxonomy that isolates the causal contribution of individual attack design factors. Across frontier agents, we find that critical-tier PII leakage reaches 54--93\% under no privacy guidance, compared to 0\% on benign-twin baselines, confirming that leakage is attack-attributable rather than incidental form-filling. Escalating prompt-level mitigation yields sharply model-dependent reductions across the four families and remains insufficient to reliably prevent critical PII submission at the pooled level. Most critically, we identify a detection--action gap: agents whose reasoning an independent LLM judge confirms has flagged the site as suspicious still submit critical PII in 35.9\% of sessions, versus 66.1\% when no suspicion is verbalized, a 30.2\% gap robust across all four model families. Our findings reveal that defenses conditioned on the agent's own recognition of an attack are gating on the wrong signal, motivating output-level interception of outbound submissions that operates independently of the agent's reasoning loop.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces the Scammer4U benchmark of 91 attacker-controlled environments and 10 benign-twin baselines spanning 8 attack vectors on an 8-axis factorial taxonomy. It reports critical-tier PII leakage rates of 54-93% for frontier web agents with no privacy guidance (0% on benign baselines) and identifies a detection-action gap in which agents submit critical PII in 35.9% of sessions even when an independent LLM judge confirms suspicion has been flagged in reasoning, versus 66.1% when no suspicion is verbalized (30.2 pp gap, robust across four model families).

Significance. If the central empirical claims hold after validation, the work is significant for AI agent security: it provides a pre-registered, controlled benchmark with benign twins that isolates attack factors and demonstrates that prompt-level mitigations are insufficient while output-level interception may be needed. The factorial design and explicit comparison to benign baselines are strengths that allow attribution of leakage to attacks rather than incidental behavior.

major comments (1)
  1. [Abstract (detection-action gap paragraph) and associated methods] The detection-action gap (35.9% vs 66.1% critical PII leakage) is the paper's most load-bearing finding and rests entirely on the independent LLM judge correctly classifying when agent reasoning has flagged the site as suspicious. No judge model, prompt, human validation set, inter-annotator agreement, or error analysis is referenced in the abstract or described methods, so systematic mislabeling could erase or reverse the 30.2 pp gap.
minor comments (1)
  1. [Abstract] The abstract states numerical results but does not specify the exact PII classification rules or the four model families used; adding these details would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the detection-action gap analysis. We agree that additional methodological transparency is required and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract (detection-action gap paragraph) and associated methods] The detection-action gap (35.9% vs 66.1% critical PII leakage) is the paper's most load-bearing finding and rests entirely on the independent LLM judge correctly classifying when agent reasoning has flagged the site as suspicious. No judge model, prompt, human validation set, inter-annotator agreement, or error analysis is referenced in the abstract or described methods, so systematic mislabeling could erase or reverse the 30.2 pp gap.

    Authors: We agree that the current manuscript does not sufficiently describe the independent LLM judge in the methods section or reference it in the abstract. In the revision we will add: the specific judge model and version, the full classification prompt, details of any human validation set (size, sampling, and agreement metrics such as inter-annotator agreement), and an error analysis of the judge's classifications. These additions will be placed in the methods and explicitly cross-referenced from the abstract and results. We believe the reported 30.2 pp gap remains informative given its consistency across four model families, but we accept that full transparency on the judge is necessary for readers to evaluate potential mislabeling. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark study

full rationale

The paper introduces an empirical benchmark (Scammer4U) consisting of 91 attacker-controlled environments and reports direct experimental measurements of PII leakage rates across models and conditions. No equations, fitted parameters, derivations, or predictions are present. All central claims (e.g., 54-93% leakage, 35.9% vs 66.1% detection-action gap) are computed from the new benchmark runs rather than reduced to prior quantities, self-citations, or ansatzes. The LLM judge is used as an analysis tool on the collected reasoning traces but is not derived from or equivalent to the leakage statistics themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central empirical claims rest on the validity of the benchmark environments, the accuracy of critical-tier PII labeling, and the reliability of the independent LLM judge for detecting verbalized suspicion; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The 8-axis factorial taxonomy isolates the causal contribution of individual attack design factors
    Stated directly in the abstract as the design property of the benchmark.

pith-pipeline@v0.9.1-grok · 5861 in / 1460 out tokens · 40775 ms · 2026-06-28T18:52:00.974443+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

151 extracted references · 30 canonical work pages · 5 internal anchors

  1. [1]

    Microsoft Corporation, et al

    The New York Times Company v. Microsoft Corporation, et al. , court =

  2. [2]

    Jay Inslee, et al

    Ryan Scott Adams v. Jay Inslee, et al. , court =

  3. [3]

    Meta Platforms, Inc

    Kadrey v. Meta Platforms, Inc. , author =

  4. [4]

    arXiv preprint arXiv:2503.17514 , year=

    Language Models May Verbatim Complete Text They Were Not Explicitly Trained On , author=. arXiv preprint arXiv:2503.17514 , year=

  5. [5]

    SHIELD : Evaluation and Defense Strategies for Copyright Compliance in LLM Text Generation

    Liu, Xiaoze and Sun, Ting and Xu, Tianyang and Wu, Feijie and Wang, Cunxiang and Wang, Xiaoqian and Gao, Jing. SHIELD : Evaluation and Defense Strategies for Copyright Compliance in LLM Text Generation. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.98

  6. [6]

    2025 , eprint=

    DeepSeek-V3 Technical Report , author=. 2025 , eprint=

  7. [7]

    arXiv preprint arXiv:2502.00406 , year=

    ALU: Agentic LLM Unlearning , author=. arXiv preprint arXiv:2502.00406 , year=

  8. [8]

    2025 , eprint=

    Qwen2.5 Technical Report , author=. 2025 , eprint=

  9. [9]

    2025 , eprint=

    Gemma 3 Technical Report , author=. 2025 , eprint=

  10. [10]

    2024 , eprint=

    GPT-4o System Card , author=. 2024 , eprint=

  11. [11]

    2024 , eprint=

    Phi-4 Technical Report , author=. 2024 , eprint=

  12. [12]

    2020 , eprint=

    Scaling Laws for Neural Language Models , author=. 2020 , eprint=

  13. [13]

    International Conference on Machine Learning , year=

    GLaM: Efficient Scaling of Language Models with Mixture-of-Experts , author=. International Conference on Machine Learning , year=

  14. [14]

    International Conference on Machine Learning , year=

    On Provable Copyright Protection for Generative Models , author=. International Conference on Machine Learning , year=

  15. [15]

    AAAI Conference on Artificial Intelligence , year=

    How to Protect Copyright Data in Optimization of Large Language Models? , author=. AAAI Conference on Artificial Intelligence , year=

  16. [16]

    ArXiv , year=

    Copyright Protection in Generative AI: A Technical Perspective , author=. ArXiv , year=

  17. [17]

    2024 , eprint=

    SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore , author=. 2024 , eprint=

  18. [18]

    2024 , eprint=

    SHIELD: Evaluation and Defense Strategies for Copyright Compliance in LLM Text Generation , author=. 2024 , eprint=

  19. [19]

    2023 , eprint=

    Preventing Verbatim Memorization in Language Models Gives a False Sense of Privacy , author=. 2023 , eprint=

  20. [20]

    Towards Making Systems Forget with Machine Unlearning , year=

    Cao, Yinzhi and Yang, Junfeng , booktitle=. Towards Making Systems Forget with Machine Unlearning , year=

  21. [21]

    International Conference on Learning Representations , year=

    Towards Robust and Parameter-Efficient Knowledge Unlearning for LLMs , author=. International Conference on Learning Representations , year=

  22. [22]

    Annual Meeting of the Association for Computational Linguistics , year=

    Unlearning Traces the Influential Training Data of Language Models , author=. Annual Meeting of the Association for Computational Linguistics , year=

  23. [23]

    out-of-distribution data in LLMs under gradient-based method , author=

    Unlearning in- vs. out-of-distribution data in LLMs under gradient-based method , author=. ArXiv , year=

  24. [24]

    ArXiv , year=

    Strong Copyright Protection for Language Models via Adaptive Model Fusion , author=. ArXiv , year=

  25. [25]

    ArXiv , year=

    Evaluating Copyright Takedown Methods for Language Models , author=. ArXiv , year=

  26. [26]

    Copyright Law of the United States

    United States Copyright Office. Copyright Law of the United States. 2024 , url =

  27. [27]

    2024 , eprint=

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. 2024 , eprint=

  28. [28]

    ROUGE : A Package for Automatic Evaluation of Summaries

    Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004

  29. [29]

    and Hakonen, H

    Bergroth, L. and Hakonen, H. and Raita, T. , booktitle=. A survey of longest common subsequence algorithms , year=

  30. [30]

    2024 , eprint=

    Copyright Traps for Large Language Models , author=. 2024 , eprint=

  31. [31]

    2024 , eprint=

    Evaluating Copyright Takedown Methods for Language Models , author=. 2024 , eprint=

  32. [32]

    2023 , eprint=

    How to Protect Copyright Data in Optimization of Large Language Models? , author=. 2023 , eprint=

  33. [33]

    2025 , url=

    Divyansh Garg and Diego Caples and Andis Draguns and Nikil Ravi and Pranav Putta and Naman Garg and Prannay Hebbar and Youngchul Joo and Jindong Gu and Charles London and Christian Schroeder de Witt and Sumeet Ramesh Motwani , booktitle=. 2025 , url=

  34. [34]

    Berkay Celik , title=

    Devin Ersoy and Brandon Lee and Ananth Shreekumar and Arjun Arunasalam and Muhammad Ibrahim and Antonio Bianchi and Z. Berkay Celik , title=. CoRR , volume=. 2025 , month=

  35. [35]

    ArXiv , year=

    PhreshPhish: A Real-World, High-Quality, Large-Scale Phishing Website Dataset and Benchmark , author=. ArXiv , year=

  36. [36]

    2026 , eprint=

    AgentLeak: A Full-Stack Benchmark for Privacy Leakage in Multi-Agent LLM Systems , author=. 2026 , eprint=

  37. [37]

    2026 , eprint=

    AgentSocialBench: Evaluating Privacy Risks in Human-Centered Agentic Social Networks , author=. 2026 , eprint=

  38. [38]

    AgentDojo: a dynamic environment to evaluate prompt injection attacks and defenses for LLM agents , year =

    Debenedetti, Edoardo and Zhang, Jie and Balunovic, Mislav and Beurer-Kellner, Luca and Fischer, Marc and Tram\`. AgentDojo: a dynamic environment to evaluate prompt injection attacks and defenses for LLM agents , year =. Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =

  39. [39]

    2025 , eprint=

    Machine-Readable Ads: Accessibility and Trust Patterns for AI Web Agents interacting with Online Advertisements , author=. 2025 , eprint=

  40. [40]

    Arman Zharmagambetov and Chuan Guo and Ivan Evtimov and Maya Pavlova and Ruslan Salakhutdinov and Kamalika Chaudhuri , booktitle=. Agent. 2026 , url=

  41. [42]

    The Fourteenth International Conference on Learning Representations , year=

    How Dark Patterns Manipulate Web Agents , author=. The Fourteenth International Conference on Learning Representations , year=

  42. [44]

    Karolina Korgul and Yushi Yang and Arkadiusz Drohomirecki and Piotr Blaszczyk and Will Howard and Lukas Aichberger and Chris Russell and Philip Torr and Adam Mahdi and Adel Bibi , year=. It's a

  43. [45]

    Introducing Operator , year =

  44. [47]

    2024 , eprint=

    Clio: Privacy-Preserving Insights into Real-World AI Use , author=. 2024 , eprint=

  45. [48]

    2025 , eprint=

    Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights , author=. 2025 , eprint=

  46. [49]

    2024 , month =

    Anthropic , title =. 2024 , month =

  47. [50]

    2025 , eprint=

    A Survey of Large Language Model Empowered Agents for Recommendation and Search: Towards Next-Generation Information Retrieval , author=. 2025 , eprint=

  48. [55]

    Proceedings on Privacy Enhancing Technologies , year=

    Tales from the Dark Side: Privacy Dark Strategies and Privacy Dark Patterns , author=. Proceedings on Privacy Enhancing Technologies , year=

  49. [56]

    ArXiv , year=

    WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? , author=. ArXiv , year=

  50. [57]

    2024 , eprint=

    WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. 2024 , eprint=

  51. [58]

    2024 , url=

    VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks , author=. 2024 , url=

  52. [59]

    Proceedings of the 41st International Conference on Machine Learning , articleno =

    Zheng, Boyuan and Gou, Boyu and Kil, Jihyung and Sun, Huan and Su, Yu , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

  53. [60]

    Mind2Web: Towards a Generalist Agent for the Web , url =

    Deng, Xiang and Gu, Yu and Zheng, Boyuan and Chen, Shijie and Stevens, Sam and Wang, Boshi and Sun, Huan and Su, Yu , booktitle =. Mind2Web: Towards a Generalist Agent for the Web , url =

  54. [62]

    ACM SIGOPS Operating Systems Review , volume =

    Hardy, Norm , title =. ACM SIGOPS Operating Systems Review , volume =. 1988 , month =. doi:10.1145/54289.871709 , url =

  55. [63]

    ACM SIGOPS Oper

    The Confused Deputy: (or why capabilities might have been invented) , author=. ACM SIGOPS Oper. Syst. Rev. , year=

  56. [64]

    ArXiv , year=

    From Prompt Injections to Protocol Exploits: Threats in LLM-Powered AI Agents Workflows , author=. ArXiv , year=

  57. [65]

    ArXiv , year=

    Multi-Agent Systems Execute Arbitrary Malicious Code , author=. ArXiv , year=

  58. [66]

    ArXiv , year=

    Control at Stake: Evaluating the Security Landscape of LLM-Driven Email Agents , author=. ArXiv , year=

  59. [67]

    ArXiv , year=

    AgentArmor: Enforcing Program Analysis on Agent Runtime Trace to Defend Against Prompt Injection , author=. ArXiv , year=

  60. [68]

    ArXiv , year=

    MCPGuard : Automatically Detecting Vulnerabilities in MCP Servers , author=. ArXiv , year=

  61. [69]

    ArXiv , year=

    Silent Egress: When Implicit Prompt Injection Makes LLM Agents Leak Without a Trace , author=. ArXiv , year=

  62. [70]

    ArXiv , year=

    CommandSans: Securing AI Agents with Surgical Precision Prompt Sanitization , author=. ArXiv , year=

  63. [71]

    ArXiv , year=

    LLMail-Inject: A Dataset from a Realistic Adaptive Prompt Injection Challenge , author=. ArXiv , year=

  64. [72]

    ArXiv , year=

    RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection , author=. ArXiv , year=

  65. [73]

    ArXiv , year=

    Taming Various Privilege Escalation in LLM-Based Agent Systems: A Mandatory Access Control Framework , author=. ArXiv , year=

  66. [74]

    ArXiv , year=

    Cognitive Control Architecture (CCA): A Lifecycle Supervision Framework for Robustly Aligned AI Agents , author=. ArXiv , year=

  67. [75]

    ArXiv , year=

    Automating Agent Hijacking via Structural Template Injection , author=. ArXiv , year=

  68. [76]

    ArXiv , year=

    Taxonomy, Evaluation and Exploitation of IPI-Centric LLM Agent Defense Frameworks , author=. ArXiv , year=

  69. [77]

    Your AI, My Shell

    "Your AI, My Shell": Demystifying Prompt Injection Attacks on Agentic AI Coding Editors , author=. ArXiv , year=

  70. [78]

    ArXiv , year=

    AgenTRIM: Tool Risk Mitigation for Agentic AI , author=. ArXiv , year=

  71. [79]

    ArXiv , year=

    Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents , author=. ArXiv , year=

  72. [80]

    Annual Meeting of the Association for Computational Linguistics , year=

    The Task Shield: Enforcing Task Alignment to Defend Against Indirect Prompt Injection in LLM Agents , author=. Annual Meeting of the Association for Computational Linguistics , year=

  73. [81]

    North American Chapter of the Association for Computational Linguistics , year=

    Adaptive Attacks Break Defenses Against Indirect Prompt Injection Attacks on LLM Agents , author=. North American Chapter of the Association for Computational Linguistics , year=

  74. [82]

    ArXiv , year=

    Prompt Injection Attack to Tool Selection in LLM Agents , author=. ArXiv , year=

  75. [83]

    International Conference on Machine Learning , year=

    MELON: Provable Defense Against Indirect Prompt Injection Attacks in AI Agents , author=. International Conference on Machine Learning , year=

  76. [84]

    Conference on Empirical Methods in Natural Language Processing , year=

    IPIGuard: A Novel Tool Dependency Graph-Based Defense Against Indirect Prompt Injection in LLM Agents , author=. Conference on Empirical Methods in Natural Language Processing , year=

  77. [85]

    IEEE Transactions on Software Engineering , year=

    Beyond the Protocol: Unveiling Attack Vectors in the Model Context Protocol (MCP) Ecosystem , author=. IEEE Transactions on Software Engineering , year=

  78. [86]

    Conference on Empirical Methods in Natural Language Processing , year=

    WebInject: Prompt Injection Attack to Web Agents , author=. Conference on Empirical Methods in Natural Language Processing , year=

  79. [87]

    ArXiv , year=

    Gandalf the Red: Adaptive Security for LLMs , author=. ArXiv , year=

  80. [88]

    ArXiv , year=

    A Systematization of Security Vulnerabilities in Computer Use Agents , author=. ArXiv , year=

Showing first 80 references.