Recognition: unknown
PhySE: A Psychological Framework for Real-Time AR-LLM Social Engineering Attacks
Pith reviewed 2026-05-08 08:16 UTC · model grok-4.3
The pith
PhySE uses pre-trained visual models and response-driven psychological tactics to remove startup delays in AR-LLM social engineering attacks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PhySE consists of VLM-Based SocialContext Training, which pre-trains a visual language model on social-context data to generate detailed target profiles on the fly without retrieval delays, and an Adaptive Psychological Agent that classifies target responses and deploys matching classes of psychological strategies in real time, moving beyond static handcrafted scripts, as shown in 360 annotated conversations collected from 60 participants in an IRB-approved study spanning multiple social scenarios.
What carries the argument
VLM-Based SocialContext Training for instant on-the-fly profile generation from visual and vocal data, paired with an Adaptive Psychological Agent that selects and applies psychological strategy classes based on live target responses.
If this is right
- Conversation flow remains uninterrupted because profile generation no longer requires retrieval steps in the opening turns.
- Attack tactics can shift in real time to match observed target behavior instead of following predetermined stages.
- Psychological principles supply a structured basis for choosing which influence tactics to apply at each moment.
- The 360 annotated conversations form a reusable dataset for studying how AR-LLM systems interact with people in live social settings.
Where Pith is reading between the lines
- The same pre-training and adaptation techniques could be tested for non-malicious uses such as real-time assistance in interviews or negotiations.
- Widespread adoption would likely prompt new detection methods focused on identifying AR glasses paired with conversational AI during ordinary encounters.
- Performance may vary across cultural or demographic groups not well represented in the original 60-participant study, suggesting a need for broader validation.
Load-bearing premise
Pre-training a visual language model on social-context data will produce accurate profiles fast enough to eliminate real-time delays without new latency or errors, and dynamically selecting psychological strategy classes from target replies will build trust more effectively than fixed tactics.
What would settle it
A direct comparison in the collected conversation data showing that first-turn profiles from the pre-trained VLM match or exceed the quality of delayed RAG profiles, or that adaptive strategy selection produces higher measured trust or compliance rates than fixed-stage tactics.
Figures
read the original abstract
The emerging threat of AR-LLM-based Social Engineering (AR-LLM-SE) attacks (e.g. SEAR) poses a significant risk to real-world social interactions. In such an attack, a malicious actor uses Augmented Reality (AR) glasses to capture a target visual and vocal data. A Large Language Model (LLM) then analyzes this data to identify the individual and generate a detailed social profile. Subsequently, LLM-powered agents employ social engineering strategies, providing real-time conversation suggestions, to gain the target trust and ultimately execute phishing or other malicious acts. Despite its potential, the practical application of AR-LLM-SE faces two major bottlenecks, (1) Cold-start personalization, Current Retrieval-Augmented Generation (RAG) methods introduce critical delays in the earliest turns, slowing initial profile formation and disrupting real-time interaction, (2) Static Attack Strategies, Existing approaches rely on fixed-stage, handcrafted social engineering tactics that lack foundation in established psychological theory. To address these limitations, we propose PhySE, a novel framework with two core innovations, (1) VLM-Based SocialContext Training, To eliminate profiling delays, we efficiently pre-train a Visual Language Model (VLM) with social-context data, enabling rapid, on-the-fly profile generation, (2) Adaptive Psychological Agent, We introduce a psychological LLM that dynamically deploys distinct classes of psychological strategies based on target response, moving beyond static, handcrafted scripts. We evaluated PhySE through an IRB-approved user study with 60 participants, collecting a novel dataset of 360 annotated conversations across diverse social scenarios.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes PhySE, a framework for AR-LLM-based social engineering attacks that targets two bottlenecks: cold-start profiling delays from RAG methods and reliance on static handcrafted strategies. It introduces VLM-Based SocialContext Training to enable rapid on-the-fly profile generation via pre-trained visual-language models and an Adaptive Psychological Agent that selects psychological strategy classes dynamically based on target responses. The approach is evaluated in an IRB-approved user study with 60 participants that collected a dataset of 360 annotated conversations across social scenarios.
Significance. If the VLM pre-training demonstrably eliminates first-turn delays without accuracy or latency penalties and the adaptive agent yields measurable gains in trust-building over static baselines, the work could advance research on emerging AR-LLM threats and provide a reusable annotated dataset for studying real-time social interactions. The explicit grounding in psychological theory for the adaptive component is a positive step beyond purely heuristic tactics.
major comments (3)
- [Evaluation section] Evaluation section: The manuscript states that an IRB-approved study collected 360 annotated conversations but reports none of the required quantitative metrics (first-turn profile latency vs. RAG, profile accuracy, conversation success rates, or statistical comparisons to static-strategy baselines). This is load-bearing for the central claim that the two innovations resolve the identified bottlenecks.
- [VLM-Based SocialContext Training] VLM-Based SocialContext Training subsection: No details are given on the social-context pre-training dataset, the exact VLM architecture or fine-tuning procedure, or any ablation showing that on-the-fly inference is faster and at least as accurate as RAG without introducing new latency.
- [Adaptive Psychological Agent] Adaptive Psychological Agent subsection: The description of response-conditioned strategy selection lacks the concrete mapping from target utterances to psychological strategy classes, the prompting or fine-tuning method used by the LLM, and any reference to specific psychological literature that justifies the chosen classes.
minor comments (2)
- [Abstract] The abstract would be strengthened by a single sentence summarizing the key empirical outcomes (even if high-level) rather than stopping at dataset collection.
- [Introduction] Notation for the two core components is introduced without a consistent acronym or diagram that distinguishes the VLM pre-training stage from the runtime adaptive agent.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. We agree that the evaluation and technical details require expansion to fully support our claims regarding the resolution of the identified bottlenecks. We address each major comment below and will incorporate the requested additions in the revised version.
read point-by-point responses
-
Referee: [Evaluation section] Evaluation section: The manuscript states that an IRB-approved study collected 360 annotated conversations but reports none of the required quantitative metrics (first-turn profile latency vs. RAG, profile accuracy, conversation success rates, or statistical comparisons to static-strategy baselines). This is load-bearing for the central claim that the two innovations resolve the identified bottlenecks.
Authors: The referee is correct that the current manuscript does not report the specific quantitative metrics. We will revise the Evaluation section to include first-turn profile latency comparisons versus RAG, profile accuracy metrics, conversation success rates across the social scenarios, and statistical comparisons (including appropriate tests such as t-tests) to static-strategy baselines. These will be computed directly from the 360 annotated conversations collected in the IRB-approved 60-participant study. revision: yes
-
Referee: [VLM-Based SocialContext Training] VLM-Based SocialContext Training subsection: No details are given on the social-context pre-training dataset, the exact VLM architecture or fine-tuning procedure, or any ablation showing that on-the-fly inference is faster and at least as accurate as RAG without introducing new latency.
Authors: We will expand the VLM-Based SocialContext Training subsection with the requested details: the composition and size of the social-context pre-training dataset, the precise VLM architecture and any modifications, the fine-tuning procedure (including objectives and hyperparameters), and ablation experiments. The ablations will quantify that on-the-fly inference achieves lower first-turn latency than RAG while preserving or improving profile accuracy and without adding pipeline latency. revision: yes
-
Referee: [Adaptive Psychological Agent] Adaptive Psychological Agent subsection: The description of response-conditioned strategy selection lacks the concrete mapping from target utterances to psychological strategy classes, the prompting or fine-tuning method used by the LLM, and any reference to specific psychological literature that justifies the chosen classes.
Authors: We will augment the Adaptive Psychological Agent subsection with a concrete mapping (via decision rules or pseudocode) from target utterances to the psychological strategy classes, a description of the LLM prompting or fine-tuning approach used for selection, and citations to specific psychological literature (e.g., Cialdini’s principles of persuasion and related social influence research) that ground the chosen classes. revision: yes
Circularity Check
No circularity detected; proposals are independent of inputs
full rationale
The paper identifies cold-start delays and static strategies as bottlenecks, then proposes VLM pre-training for on-the-fly profiles and a response-conditioned psychological agent as solutions. No equations, fitted parameters, or derivations are presented that reduce the claimed innovations back to the inputs by construction. The IRB study is described only as data collection; no predictive claims or self-referential definitions appear. The framework remains a set of forward proposals grounded in stated limitations rather than tautological reductions.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Pre-training a VLM with social-context data enables rapid, on-the-fly profile generation that eliminates cold-start delays in real-time AR interactions
- domain assumption Distinct classes of psychological strategies can be dynamically deployed by an LLM agent based on target responses to outperform static handcrafted tactics
invented entities (2)
-
Adaptive Psychological Agent
no independent evidence
-
VLM-Based SocialContext Training procedure
no independent evidence
Reference graph
Works this paper leans on
-
[1]
A. E. Abele, N. Ellemers, S. T. Fiske, A. Koch, and V. Yzerbyt. Navigating the social world: Toward an integrated framework for evaluating self, individuals, and groups.Psychological review, 128(2):290, 2021
2021
-
[2]
Afane, W
K. Afane, W. Wei, Y. Mao, J. Farooq, and J. Chen. Next-generation phishing: How llm agents empower cyber attackers. In2024 IEEE International Conference on Big Data (BigData), pages 2558–2567. IEEE, 2024
2024
-
[3]
T. Bi, C. Ye, Z. Yang, Z. Zhou, C. Tang, Z. Tao, J. Zhang, K. Wang, L. Zhou, Y. Yang, and T. Yu. On the feasibility of using multimodal LLMs to execute AR social engineering attacks. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 38252–38260, 2026
2026
-
[4]
Bilge, T
L. Bilge, T. Strufe, D. Balzarotti, and E. Kirda. All your contacts are belong to us: automated identity theft attacks on social networks. InProceedings of the 18th international conference on World wide web, pages 551–560, 2009
2009
-
[5]
Burda, L
P. Burda, L. Allodi, and N. Zannone. Cognition in social engineering empirical research: a systematic literature review.ACM Transactions on Computer-Human Interaction, 31(2):1–55, 2024
2024
-
[6]
S. Chen, Z. Li, F. Dangelo, C. Gao, and X. Fu. A case study of security and privacy threats from augmented reality (ar). In2018 international conference on computing, networking and communications (ICNC), pages 442–446. IEEE, 2018
2018
-
[7]
Z. Chen, Z. Zhao, W. Qu, Z. Wen, Z. Han, Z. Zhu, J. Zhang, and H. Yao. Pandora: Detailed llm jailbreaking via collaborated phishing agents with decomposed reasoning. InICLR 2024 Workshop on Secure and Trustworthy Large Language Models, 2024
2024
-
[8]
L. Choo. How 2 students used the meta ray-bans to access personal informa- tion. https://www.forbes.com/sites/lindseychoo/2024/10/04/meta-ray-bans-ai- privacy-surveillance/, 2025
2024
- [9]
-
[10]
E. J. Finkel, P. W. Eastwick, B. R. Karney, H. T. Reis, and S. Sprecher. Online dating: A critical analysis from the perspective of psychological science.Psychological Science in the Public interest, 13(1):3–66, 2012
2012
-
[11]
Fuste and C
A. Fuste and C. Schmandt. Artextiles: Promoting social interactions around personal interests through augmented reality. InProceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems, pages 470–470, 2017
2017
-
[12]
S. Granger. Social engineering fundamentals, part i: hacker tactics.Security Focus, December, 18, 2001
2001
-
[13]
Harmon-Jones and J
E. Harmon-Jones and J. Mills. An introduction to cognitive dissonance theory and an overview of current perspectives on the theory. 2019
2019
-
[14]
Hirskyj-Douglas, A
I. Hirskyj-Douglas, A. Kantosalo, A. Monroy-Hernández, J. Zimmermann, M. Nebeling, and M. Gonzalez-Franco. Social ar: Reimagining and interrogating the role of augmented reality in face to face social interactions. InCompanion Publication of the 2020 Conference on Computer Supported Cooperative Work and Social Computing, pages 457–465, 2020
2020
-
[15]
G. Ho, A. Cidon, L. Gavish, M. Schweighauser, V. Paxson, S. Savage, G. M. Voelker, and D. Wagner. Detecting and characterizing lateral phishing at scale. In28th USENIX security symposium (USENIX security 19), pages 1273–1290, 2019
2019
-
[16]
M. Z. Iqbal and A. G. Campbell. Adopting smart glasses responsibly: potential benefits, ethical, and privacy concerns with ray-ban stories.AI and Ethics, 3(1):325–327, 2023
2023
-
[17]
Jansen and F
P. Jansen and F. Fischbach. The social engineer: An immersive virtual reality educational game to raise social engineering awareness. InExtended Abstracts of the 2020 Annual Symposium on Computer-Human Interaction in Play, pages 59–63, 2020
2020
-
[18]
Kawamichi, K
H. Kawamichi, K. Yoshihara, A. T. Sasaki, S. K. Sugawara, H. C. Tanabe, R. Shino- hara, Y. Sugisawa, K. Tokutake, Y. Mochizuki, T. Anme, et al. Perceiving active listening activates the reward system and improves the impression of relevant experiences.Social neuroscience, 10(1):16–26, 2015
2015
-
[19]
Krombholz, H
K. Krombholz, H. Hobel, M. Huber, and E. Weippl. Advanced social engineering attacks.Journal of Information Security and applications, 22:113–122, 2015
2015
-
[20]
J.-S. Lee, S. Kim, and S. Pan. The role of relationship marketing investments in customer reciprocity.International Journal of Contemporary Hospitality Manage- ment, 26(8):1200–1224, 2014
2014
-
[21]
S. M. Lehman, A. S. Alrumayh, K. Kolhe, H. Ling, and C. C. Tan. Hidden in plain sight: Exploring privacy risks of mobile augmented reality applications.ACM Transactions on Privacy and Security, 25(4):1–35, 2022
2022
- [22]
-
[23]
A. Ma, J. J. Paek, F. Liu, and J. Y. Kim. Threats to personal control fuel similarity attraction.Proceedings of the National Academy of Sciences, 121(43):e2321189121, 2024
2024
-
[24]
S. S. Roy, P. Thota, K. V. Naragam, and S. Nilizadeh. From chatbots to phishbots?: Phishing scam generation in commercial large language models. In2024 IEEE Symposium on Security and Privacy (SP), pages 36–54. IEEE, 2024. Tianlong Yu1, Yang Yang1, Ziyi Zhou1, Jiaying Xu1, Siwei Li1, Tong Guan2, Kailong Wang3, Ting Bi3
2024
-
[25]
D. I. Tamir and J. P. Mitchell. Disclosing information about the self is intrinsically rewarding.Proceedings of the National Academy of Sciences, 109(21):8038–8043, 2012
2012
-
[26]
Timko, D
D. Timko, D. H. Castillo, and M. L. Rahman. Understanding influences on sms phishing detection: User behavior, demographics, and message attributes. 2025
2025
-
[27]
H.-R. Tsai, S.-K. Chiu, and B. Wang. Gazenoter: Co-piloted ar note-taking via gaze selection of llm suggestions to match users’ intentions.arXiv preprint arXiv:2407.01161, 2024
-
[28]
Ulqinaku, H
E. Ulqinaku, H. Assal, A. Abdou, S. Chiasson, and S. Capkun. Is real-time phishing eliminated with {FIDO}? social engineering downgrade attacks against {FIDO} protocols. In30th USENIX Security Symposium (USENIX Security 21), pages 3811–3828, 2021
2021
-
[29]
Vadrevu and R
P. Vadrevu and R. Perdisci. What you see is not what you get: Discovering and tracking social engineering attack campaigns. InProceedings of the Internet Measurement Conference, pages 308–321, 2019
2019
-
[30]
T. A. Venema, F. M. Kroese, J. S. Benjamins, and D. T. De Ridder. When in doubt, follow the crowd? responsiveness to social proof nudges in the absence of clear preferences.Frontiers in psychology, 11:1385, 2020
2020
-
[31]
C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li, et al. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023
work page internal anchor Pith review arXiv 2023
-
[32]
I. Wang, J. Smith, and J. Ruiz. Exploring virtual agents for augmented reality. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pages 1–12, 2019
2019
-
[33]
B. Yang, Y. Guo, L. Xu, Z. Yan, H. Chen, G. Xing, and X. Jiang. Socialmind: Llm-based proactive ar social assistive system with human-like perception for in-situ live interactions.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 9(1):1–30, 2025
2025
-
[34]
Z. Yang, J. Allen, M. Landen, R. Perdisci, and W. Lee. {TRIDENT}: Towards detecting and mitigating web-based social engineering attacks. In32nd USENIX Security Symposium (USENIX Security 23), pages 6701–6718, 2023
2023
-
[35]
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023
2023
-
[36]
T. Yu, C. Ye, Z. Yang, Z. Zhou, C. Tang, Z. Tao, J. Zhang, K. Wang, L. Zhou, Y. Yang, and T. Bi. Sear: A multimodal dataset for analyzing ar-llm-driven social engineering behaviors. InProceedings of the 33rd ACM International Conference on Multimedia, pages 12981–12987, 2025
2025
-
[37]
Zhang, C
Y. Zhang, C. Slocum, J. Chen, and N. Abu-Ghazaleh. It’s all in your head (set): Side-channel attacks on {AR/VR} systems. In32nd USENIX Security Symposium (USENIX Security 23), pages 3979–3996, 2023
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.