arxiv: 2604.23148 · v1 · submitted 2026-04-25 · 💻 cs.AI

Recognition: unknown

PhySE: A Psychological Framework for Real-Time AR-LLM Social Engineering Attacks

Tianlong Yu , Yang Yang , Ziyi Zhou , Jiaying Xu , Siwei Li , Tong Guan , Kailong Wang , Ting Bi

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:16 UTC · model grok-4.3

classification 💻 cs.AI

keywords AR-LLM social engineeringvisual language modelsadaptive psychological agentsreal-time profilingsocial context trainingaugmented reality attackspsychological strategy selection

0 comments

The pith

PhySE uses pre-trained visual models and response-driven psychological tactics to remove startup delays in AR-LLM social engineering attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to fix two practical barriers that have kept AR-LLM social engineering attacks from running smoothly in real time. Current retrieval methods create noticeable lags while building a target's profile in the first moments of conversation, and existing tactics rely on fixed, handcrafted scripts instead of established psychological principles. PhySE solves the first problem by pre-training a visual language model on social-context data so profiles form instantly from live visual and vocal input. It solves the second by introducing an agent that selects among distinct psychological strategy classes according to how the target replies. The authors support the approach with data from an IRB-approved study that gathered 360 annotated conversations from 60 participants across varied social settings.

Core claim

PhySE consists of VLM-Based SocialContext Training, which pre-trains a visual language model on social-context data to generate detailed target profiles on the fly without retrieval delays, and an Adaptive Psychological Agent that classifies target responses and deploys matching classes of psychological strategies in real time, moving beyond static handcrafted scripts, as shown in 360 annotated conversations collected from 60 participants in an IRB-approved study spanning multiple social scenarios.

What carries the argument

VLM-Based SocialContext Training for instant on-the-fly profile generation from visual and vocal data, paired with an Adaptive Psychological Agent that selects and applies psychological strategy classes based on live target responses.

If this is right

Conversation flow remains uninterrupted because profile generation no longer requires retrieval steps in the opening turns.
Attack tactics can shift in real time to match observed target behavior instead of following predetermined stages.
Psychological principles supply a structured basis for choosing which influence tactics to apply at each moment.
The 360 annotated conversations form a reusable dataset for studying how AR-LLM systems interact with people in live social settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pre-training and adaptation techniques could be tested for non-malicious uses such as real-time assistance in interviews or negotiations.
Widespread adoption would likely prompt new detection methods focused on identifying AR glasses paired with conversational AI during ordinary encounters.
Performance may vary across cultural or demographic groups not well represented in the original 60-participant study, suggesting a need for broader validation.

Load-bearing premise

Pre-training a visual language model on social-context data will produce accurate profiles fast enough to eliminate real-time delays without new latency or errors, and dynamically selecting psychological strategy classes from target replies will build trust more effectively than fixed tactics.

What would settle it

A direct comparison in the collected conversation data showing that first-turn profiles from the pre-trained VLM match or exceed the quality of delayed RAG profiles, or that adaptive strategy selection produces higher measured trust or compliance rates than fixed-stage tactics.

Figures

Figures reproduced from arXiv: 2604.23148 by Jiaying Xu, Kailong Wang, Siwei Li, Tianlong Yu, Ting Bi, Tong Guan, Yang Yang, Ziyi Zhou.

**Figure 1.** Figure 1: PhySE’s system architecture and comparison with SEAR. view at source ↗

**Figure 2.** Figure 2: Psychological strategy routing in PhySE: the router view at source ↗

**Figure 3.** Figure 3: Trust model for AR-LLM-SE interactions: how perceived credibility and rapport mediate trust formation and enable view at source ↗

**Figure 4.** Figure 4: Comparison of social-engineering effectiveness. view at source ↗

**Figure 6.** Figure 6: Ablation study via social-experience scores. view at source ↗

**Figure 5.** Figure 5: Comparison of subjective experience scores. view at source ↗

read the original abstract

The emerging threat of AR-LLM-based Social Engineering (AR-LLM-SE) attacks (e.g. SEAR) poses a significant risk to real-world social interactions. In such an attack, a malicious actor uses Augmented Reality (AR) glasses to capture a target visual and vocal data. A Large Language Model (LLM) then analyzes this data to identify the individual and generate a detailed social profile. Subsequently, LLM-powered agents employ social engineering strategies, providing real-time conversation suggestions, to gain the target trust and ultimately execute phishing or other malicious acts. Despite its potential, the practical application of AR-LLM-SE faces two major bottlenecks, (1) Cold-start personalization, Current Retrieval-Augmented Generation (RAG) methods introduce critical delays in the earliest turns, slowing initial profile formation and disrupting real-time interaction, (2) Static Attack Strategies, Existing approaches rely on fixed-stage, handcrafted social engineering tactics that lack foundation in established psychological theory. To address these limitations, we propose PhySE, a novel framework with two core innovations, (1) VLM-Based SocialContext Training, To eliminate profiling delays, we efficiently pre-train a Visual Language Model (VLM) with social-context data, enabling rapid, on-the-fly profile generation, (2) Adaptive Psychological Agent, We introduce a psychological LLM that dynamically deploys distinct classes of psychological strategies based on target response, moving beyond static, handcrafted scripts. We evaluated PhySE through an IRB-approved user study with 60 participants, collecting a novel dataset of 360 annotated conversations across diverse social scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PhySE sketches a VLM-plus-adaptive-agent framework for AR social engineering but reports no metrics from its 60-person study, so the claimed fixes for cold-start delay and static tactics stay unshown.

read the letter

PhySE proposes pre-training a VLM on social-context data so an AR device can generate a target profile in the first turn instead of waiting on RAG retrieval. It also adds an LLM agent that picks from psychological strategy classes according to how the target answers, rather than following fixed handcrafted stages. Those two pieces are the concrete new elements relative to the cited prior work on LLM agents and RAG-based attacks. The paper lays out the two bottlenecks clearly and ties the second fix to established psychological distinctions, which is a step beyond pure engineering descriptions. The evaluation plan is also concrete: an IRB-approved collection of 360 annotated conversations across scenarios. The soft spot is straightforward. The manuscript states that the study was run and the conversations were annotated, yet it supplies none of the required numbers: first-turn latency, profile accuracy against ground truth, conversation success rates, or statistical comparisons to the RAG and static baselines. Without those figures the central claim that the two innovations actually remove the bottlenecks cannot be checked. This paper is for researchers who follow AI security threats on wearable devices and want a structured attack model to think about defenses against. A reader who needs demonstrated performance gains will find the current version thin, but someone mapping the attack surface or designing countermeasures could still extract usable ideas from the framework and the dataset description. The thinking is coherent and the proposal is specific enough that a serious referee could give useful feedback on the missing results and any implementation details. I would send it to peer review rather than desk reject.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes PhySE, a framework for AR-LLM-based social engineering attacks that targets two bottlenecks: cold-start profiling delays from RAG methods and reliance on static handcrafted strategies. It introduces VLM-Based SocialContext Training to enable rapid on-the-fly profile generation via pre-trained visual-language models and an Adaptive Psychological Agent that selects psychological strategy classes dynamically based on target responses. The approach is evaluated in an IRB-approved user study with 60 participants that collected a dataset of 360 annotated conversations across social scenarios.

Significance. If the VLM pre-training demonstrably eliminates first-turn delays without accuracy or latency penalties and the adaptive agent yields measurable gains in trust-building over static baselines, the work could advance research on emerging AR-LLM threats and provide a reusable annotated dataset for studying real-time social interactions. The explicit grounding in psychological theory for the adaptive component is a positive step beyond purely heuristic tactics.

major comments (3)

[Evaluation section] Evaluation section: The manuscript states that an IRB-approved study collected 360 annotated conversations but reports none of the required quantitative metrics (first-turn profile latency vs. RAG, profile accuracy, conversation success rates, or statistical comparisons to static-strategy baselines). This is load-bearing for the central claim that the two innovations resolve the identified bottlenecks.
[VLM-Based SocialContext Training] VLM-Based SocialContext Training subsection: No details are given on the social-context pre-training dataset, the exact VLM architecture or fine-tuning procedure, or any ablation showing that on-the-fly inference is faster and at least as accurate as RAG without introducing new latency.
[Adaptive Psychological Agent] Adaptive Psychological Agent subsection: The description of response-conditioned strategy selection lacks the concrete mapping from target utterances to psychological strategy classes, the prompting or fine-tuning method used by the LLM, and any reference to specific psychological literature that justifies the chosen classes.

minor comments (2)

[Abstract] The abstract would be strengthened by a single sentence summarizing the key empirical outcomes (even if high-level) rather than stopping at dataset collection.
[Introduction] Notation for the two core components is introduced without a consistent acronym or diagram that distinguishes the VLM pre-training stage from the runtime adaptive agent.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We agree that the evaluation and technical details require expansion to fully support our claims regarding the resolution of the identified bottlenecks. We address each major comment below and will incorporate the requested additions in the revised version.

read point-by-point responses

Referee: [Evaluation section] Evaluation section: The manuscript states that an IRB-approved study collected 360 annotated conversations but reports none of the required quantitative metrics (first-turn profile latency vs. RAG, profile accuracy, conversation success rates, or statistical comparisons to static-strategy baselines). This is load-bearing for the central claim that the two innovations resolve the identified bottlenecks.

Authors: The referee is correct that the current manuscript does not report the specific quantitative metrics. We will revise the Evaluation section to include first-turn profile latency comparisons versus RAG, profile accuracy metrics, conversation success rates across the social scenarios, and statistical comparisons (including appropriate tests such as t-tests) to static-strategy baselines. These will be computed directly from the 360 annotated conversations collected in the IRB-approved 60-participant study. revision: yes
Referee: [VLM-Based SocialContext Training] VLM-Based SocialContext Training subsection: No details are given on the social-context pre-training dataset, the exact VLM architecture or fine-tuning procedure, or any ablation showing that on-the-fly inference is faster and at least as accurate as RAG without introducing new latency.

Authors: We will expand the VLM-Based SocialContext Training subsection with the requested details: the composition and size of the social-context pre-training dataset, the precise VLM architecture and any modifications, the fine-tuning procedure (including objectives and hyperparameters), and ablation experiments. The ablations will quantify that on-the-fly inference achieves lower first-turn latency than RAG while preserving or improving profile accuracy and without adding pipeline latency. revision: yes
Referee: [Adaptive Psychological Agent] Adaptive Psychological Agent subsection: The description of response-conditioned strategy selection lacks the concrete mapping from target utterances to psychological strategy classes, the prompting or fine-tuning method used by the LLM, and any reference to specific psychological literature that justifies the chosen classes.

Authors: We will augment the Adaptive Psychological Agent subsection with a concrete mapping (via decision rules or pseudocode) from target utterances to the psychological strategy classes, a description of the LLM prompting or fine-tuning approach used for selection, and citations to specific psychological literature (e.g., Cialdini’s principles of persuasion and related social influence research) that ground the chosen classes. revision: yes

Circularity Check

0 steps flagged

No circularity detected; proposals are independent of inputs

full rationale

The paper identifies cold-start delays and static strategies as bottlenecks, then proposes VLM pre-training for on-the-fly profiles and a response-conditioned psychological agent as solutions. No equations, fitted parameters, or derivations are presented that reduce the claimed innovations back to the inputs by construction. The IRB study is described only as data collection; no predictive claims or self-referential definitions appear. The framework remains a set of forward proposals grounded in stated limitations rather than tautological reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The framework rests on domain assumptions about the effectiveness of pre-trained VLMs for instant social profiling and the superiority of adaptive psychological tactics; no free parameters or invented entities with independent evidence are explicitly quantified.

axioms (2)

domain assumption Pre-training a VLM with social-context data enables rapid, on-the-fly profile generation that eliminates cold-start delays in real-time AR interactions
Directly invoked to solve the first bottleneck.
domain assumption Distinct classes of psychological strategies can be dynamically deployed by an LLM agent based on target responses to outperform static handcrafted tactics
Core premise for the second innovation and claimed improvement.

invented entities (2)

Adaptive Psychological Agent no independent evidence
purpose: Dynamically selects and deploys psychological strategy classes in response to target behavior during live conversation
New component introduced to move beyond static scripts; no independent falsifiable evidence provided beyond the study claim.
VLM-Based SocialContext Training procedure no independent evidence
purpose: Efficient pre-training of visual language model on social-context data for instant profiling
Novel training method claimed to solve personalization delays; no external validation details given.

pith-pipeline@v0.9.0 · 5603 in / 1727 out tokens · 122976 ms · 2026-05-08T08:16:23.084385+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 4 canonical work pages · 1 internal anchor

[1]

A. E. Abele, N. Ellemers, S. T. Fiske, A. Koch, and V. Yzerbyt. Navigating the social world: Toward an integrated framework for evaluating self, individuals, and groups.Psychological review, 128(2):290, 2021

2021
[2]

Afane, W

K. Afane, W. Wei, Y. Mao, J. Farooq, and J. Chen. Next-generation phishing: How llm agents empower cyber attackers. In2024 IEEE International Conference on Big Data (BigData), pages 2558–2567. IEEE, 2024

2024
[3]

T. Bi, C. Ye, Z. Yang, Z. Zhou, C. Tang, Z. Tao, J. Zhang, K. Wang, L. Zhou, Y. Yang, and T. Yu. On the feasibility of using multimodal LLMs to execute AR social engineering attacks. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 38252–38260, 2026

2026
[4]

Bilge, T

L. Bilge, T. Strufe, D. Balzarotti, and E. Kirda. All your contacts are belong to us: automated identity theft attacks on social networks. InProceedings of the 18th international conference on World wide web, pages 551–560, 2009

2009
[5]

Burda, L

P. Burda, L. Allodi, and N. Zannone. Cognition in social engineering empirical research: a systematic literature review.ACM Transactions on Computer-Human Interaction, 31(2):1–55, 2024

2024
[6]

S. Chen, Z. Li, F. Dangelo, C. Gao, and X. Fu. A case study of security and privacy threats from augmented reality (ar). In2018 international conference on computing, networking and communications (ICNC), pages 442–446. IEEE, 2018

2018
[7]

Z. Chen, Z. Zhao, W. Qu, Z. Wen, Z. Han, Z. Zhu, J. Zhang, and H. Yao. Pandora: Detailed llm jailbreaking via collaborated phishing agents with decomposed reasoning. InICLR 2024 Workshop on Secure and Trustworthy Large Language Models, 2024

2024
[8]

L. Choo. How 2 students used the meta ray-bans to access personal informa- tion. https://www.forbes.com/sites/lindseychoo/2024/10/04/meta-ray-bans-ai- privacy-surveillance/, 2025

2024
[9]

P. V. Falade. Decoding the threat landscape: Chatgpt, fraudgpt, and wormgpt in social engineering attacks.arXiv preprint arXiv:2310.05595, 2023

work page arXiv 2023
[10]

E. J. Finkel, P. W. Eastwick, B. R. Karney, H. T. Reis, and S. Sprecher. Online dating: A critical analysis from the perspective of psychological science.Psychological Science in the Public interest, 13(1):3–66, 2012

2012
[11]

Fuste and C

A. Fuste and C. Schmandt. Artextiles: Promoting social interactions around personal interests through augmented reality. InProceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems, pages 470–470, 2017

2017
[12]

S. Granger. Social engineering fundamentals, part i: hacker tactics.Security Focus, December, 18, 2001

2001
[13]

Harmon-Jones and J

E. Harmon-Jones and J. Mills. An introduction to cognitive dissonance theory and an overview of current perspectives on the theory. 2019

2019
[14]

Hirskyj-Douglas, A

I. Hirskyj-Douglas, A. Kantosalo, A. Monroy-Hernández, J. Zimmermann, M. Nebeling, and M. Gonzalez-Franco. Social ar: Reimagining and interrogating the role of augmented reality in face to face social interactions. InCompanion Publication of the 2020 Conference on Computer Supported Cooperative Work and Social Computing, pages 457–465, 2020

2020
[15]

G. Ho, A. Cidon, L. Gavish, M. Schweighauser, V. Paxson, S. Savage, G. M. Voelker, and D. Wagner. Detecting and characterizing lateral phishing at scale. In28th USENIX security symposium (USENIX security 19), pages 1273–1290, 2019

2019
[16]

M. Z. Iqbal and A. G. Campbell. Adopting smart glasses responsibly: potential benefits, ethical, and privacy concerns with ray-ban stories.AI and Ethics, 3(1):325–327, 2023

2023
[17]

Jansen and F

P. Jansen and F. Fischbach. The social engineer: An immersive virtual reality educational game to raise social engineering awareness. InExtended Abstracts of the 2020 Annual Symposium on Computer-Human Interaction in Play, pages 59–63, 2020

2020
[18]

Kawamichi, K

H. Kawamichi, K. Yoshihara, A. T. Sasaki, S. K. Sugawara, H. C. Tanabe, R. Shino- hara, Y. Sugisawa, K. Tokutake, Y. Mochizuki, T. Anme, et al. Perceiving active listening activates the reward system and improves the impression of relevant experiences.Social neuroscience, 10(1):16–26, 2015

2015
[19]

Krombholz, H

K. Krombholz, H. Hobel, M. Huber, and E. Weippl. Advanced social engineering attacks.Journal of Information Security and applications, 22:113–122, 2015

2015
[20]

J.-S. Lee, S. Kim, and S. Pan. The role of relationship marketing investments in customer reciprocity.International Journal of Contemporary Hospitality Manage- ment, 26(8):1200–1224, 2014

2014
[21]

S. M. Lehman, A. S. Alrumayh, K. Kolhe, H. Ling, and C. C. Tan. Hidden in plain sight: Exploring privacy risks of mobile augmented reality applications.ACM Transactions on Privacy and Security, 25(4):1–35, 2022

2022
[22]

C. Li, G. Wu, G. Y.-Y. Chan, D. G. Turakhia, S. C. Quispe, D. Li, L. Welch, C. Silva, and J. Qian. Satori: Towards proactive ar assistant with belief-desire-intention user modeling.arXiv preprint arXiv:2410.16668, 2024

work page arXiv 2024
[23]

A. Ma, J. J. Paek, F. Liu, and J. Y. Kim. Threats to personal control fuel similarity attraction.Proceedings of the National Academy of Sciences, 121(43):e2321189121, 2024

2024
[24]

S. S. Roy, P. Thota, K. V. Naragam, and S. Nilizadeh. From chatbots to phishbots?: Phishing scam generation in commercial large language models. In2024 IEEE Symposium on Security and Privacy (SP), pages 36–54. IEEE, 2024. Tianlong Yu1, Yang Yang1, Ziyi Zhou1, Jiaying Xu1, Siwei Li1, Tong Guan2, Kailong Wang3, Ting Bi3

2024
[25]

D. I. Tamir and J. P. Mitchell. Disclosing information about the self is intrinsically rewarding.Proceedings of the National Academy of Sciences, 109(21):8038–8043, 2012

2012
[26]

Timko, D

D. Timko, D. H. Castillo, and M. L. Rahman. Understanding influences on sms phishing detection: User behavior, demographics, and message attributes. 2025

2025
[27]

Tsai, S.-K

H.-R. Tsai, S.-K. Chiu, and B. Wang. Gazenoter: Co-piloted ar note-taking via gaze selection of llm suggestions to match users’ intentions.arXiv preprint arXiv:2407.01161, 2024

work page arXiv 2024
[28]

Ulqinaku, H

E. Ulqinaku, H. Assal, A. Abdou, S. Chiasson, and S. Capkun. Is real-time phishing eliminated with {FIDO}? social engineering downgrade attacks against {FIDO} protocols. In30th USENIX Security Symposium (USENIX Security 21), pages 3811–3828, 2021

2021
[29]

Vadrevu and R

P. Vadrevu and R. Perdisci. What you see is not what you get: Discovering and tracking social engineering attack campaigns. InProceedings of the Internet Measurement Conference, pages 308–321, 2019

2019
[30]

T. A. Venema, F. M. Kroese, J. S. Benjamins, and D. T. De Ridder. When in doubt, follow the crowd? responsiveness to social proof nudges in the absence of clear preferences.Frontiers in psychology, 11:1385, 2020

2020
[31]

C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li, et al. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023

work page internal anchor Pith review arXiv 2023
[32]

I. Wang, J. Smith, and J. Ruiz. Exploring virtual agents for augmented reality. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pages 1–12, 2019

2019
[33]

B. Yang, Y. Guo, L. Xu, Z. Yan, H. Chen, G. Xing, and X. Jiang. Socialmind: Llm-based proactive ar social assistive system with human-like perception for in-situ live interactions.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 9(1):1–30, 2025

2025
[34]

Z. Yang, J. Allen, M. Landen, R. Perdisci, and W. Lee. {TRIDENT}: Towards detecting and mitigating web-based social engineering attacks. In32nd USENIX Security Symposium (USENIX Security 23), pages 6701–6718, 2023

2023
[35]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

2023
[36]

T. Yu, C. Ye, Z. Yang, Z. Zhou, C. Tang, Z. Tao, J. Zhang, K. Wang, L. Zhou, Y. Yang, and T. Bi. Sear: A multimodal dataset for analyzing ar-llm-driven social engineering behaviors. InProceedings of the 33rd ACM International Conference on Multimedia, pages 12981–12987, 2025

2025
[37]

Zhang, C

Y. Zhang, C. Slocum, J. Chen, and N. Abu-Ghazaleh. It’s all in your head (set): Side-channel attacks on {AR/VR} systems. In32nd USENIX Security Symposium (USENIX Security 23), pages 3979–3996, 2023

2023