Disentangling Adversarial Prompts: A Semantic-Graph Defense for Robust LLM Security

Wanlong Fang; Xiang Fang

arxiv: 2605.27823 · v1 · pith:TUYJMRREnew · submitted 2026-05-27 · 💻 cs.CR · cs.AI· cs.CV

Disentangling Adversarial Prompts: A Semantic-Graph Defense for Robust LLM Security

Xiang Fang , Wanlong Fang This is my paper

Pith reviewed 2026-06-29 12:06 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CV

keywords adversarial promptsLLM securityjailbreakingprompt injectionsemantic decompositiongraph-based classificationmutual informationtransformer classifier

0 comments

The pith

The APD framework disentangles adversarial prompts with mutual information and graph analysis to cut harmful LLM outputs by over 85 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Adversarial Prompt Disentanglement framework as a defense that breaks input prompts into separate adversarial and benign parts before an LLM processes them. It does this through mutual information decomposition to create statistically independent components, followed by spectral graph analysis for intent detection and a trained transformer classifier. The goal is to block jailbreaking and prompt injection attacks that exploit semantic ambiguities. A sympathetic reader would care because current LLMs remain open to these bypasses in security-sensitive settings, and the method claims to add protection without slowing normal operation. If the approach holds, it would let LLMs run more safely in applications where harmful outputs carry real costs.

Core claim

The APD framework proactively identifies and neutralizes malicious components in input prompts before they reach the LLM by combining three elements: a mutual information-based semantic decomposition that isolates adversarial and benign parts while ensuring statistical independence, a graph-based intent classification that uses spectral analysis to detect malicious semantic patterns, and a lightweight transformer classifier trained on real-world toxic and jailbreaking prompts. On diverse adversarial datasets the method reduces harmful output generation by over 85 percent while leaving model performance essentially unchanged and supporting real-time use.

What carries the argument

The Adversarial Prompt Disentanglement (APD) framework, which isolates prompt components via mutual information and detects malicious intent via spectral graph analysis before LLM processing.

If this is right

LLMs gain robustness against jailbreaking and prompt injection in security-critical deployments.
The added defense runs efficiently enough for real-time applications without requiring heavy extra hardware.
Normal task performance on clean prompts remains nearly identical to the undefended model.
The approach supplies a scalable, pre-processing layer against prompt-based threats.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The decomposition step could be tested on multimodal inputs if the mutual-information separation generalizes beyond text.
Layering APD with output-side filters might create a two-stage defense whose combined failure rate is lower than either alone.
Performance on attack variants invented after the training data cutoff would need separate measurement to confirm lasting coverage.

Load-bearing premise

Mutual information decomposition can separate adversarial and benign prompt elements into statistically independent parts, and spectral graph analysis can reliably flag the malicious patterns.

What would settle it

Apply APD to a fresh collection of adversarial prompts outside the training and test sets and observe whether the rate of harmful outputs stays above 15 percent.

Figures

Figures reproduced from arXiv: 2605.27823 by Wanlong Fang, Xiang Fang.

**Figure 1.** Figure 1: The end-to-end pipeline of the Adversarial Prompt Disentanglement (APD) framework. An input prompt is first en [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Performance comparison across datasets for APD [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of HOR, showing APD’s superior [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: 2D scatter plot of latent representations, show [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

Large Language Models (LLMs) are increasingly vulnerable to adversarial prompts that exploit semantic ambiguities to bypass safety mechanisms, resulting in harmful or inappropriate outputs. Such attacks, including jailbreaking and prompt injection, pose significant risks to the integrity and availability of LLMs in security-critical applications. This paper proposes the Adversarial Prompt Disentanglement (APD) framework, a novel defense mechanism that proactively identifies and neutralizes malicious components in input prompts before they are processed by the LLM. The APD framework integrates three key innovations: (1) a mutual information-based semantic decomposition method to isolate adversarial and benign prompt components, ensuring statistical independence; (2) a graph-based intent classification approach that leverages spectral analysis to detect malicious patterns in prompt semantics; and (3) a lightweight transformer-based classifier trained on real-world datasets of toxic and jailbreaking prompts, enabling efficient and accurate adversarial intent detection. Evaluated on diverse datasets containing adversarial prompts, APD demonstrates superior robustness, reducing harmful output generation by over 85\% while maintaining negligible impact on model performance. The framework's computational efficiency supports real-time deployment, making it a practical solution for securing LLMs. Our work addresses critical challenges in machine learning security on novel attacks and integrity methods for ML systems, and offers a scalable, ethically grounded defense against prompt-based adversarial threats.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches an APD defense using MI decomposition plus graphs and a transformer but supplies zero experiments or numbers to support the 85% claim.

read the letter

The core pitch is a three-part APD setup: mutual-information decomposition to split adversarial from benign prompt parts (claimed to produce statistical independence), spectral graph analysis for intent detection, and a small transformer classifier. It targets jailbreaks and prompt injection in LLMs and asserts the whole thing cuts harmful outputs by more than 85% with almost no accuracy drop.

Nothing in the abstract or description shows new primitives; it stitches together existing MI estimators, graph spectral methods, and fine-tuned classifiers and applies them to prompt security. That combination in this setting is the only incremental element.

The obvious gap is the complete absence of results. No datasets, no baselines, no ablation on the decomposition step, no measured residual mutual information after splitting, and no error bars. The independence claim is asserted but not checked, and the stress-test note is right that discrete, semantically entangled prompts make clean separation unlikely with standard MI tools. If dependence remains, the downstream graph and classifier cannot fix upstream leakage, which undercuts the performance number.

The work is aimed at applied LLM security researchers who already follow prompt-attack papers. A reader looking for a finished, reproducible defense will find little to use or cite. The central assumption is untested, so the paper does not yet merit referee time.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes the Adversarial Prompt Disentanglement (APD) framework to defend LLMs against adversarial prompts such as jailbreaking and prompt injection. It integrates three components: (1) a mutual information-based semantic decomposition method claimed to isolate adversarial and benign prompt components while ensuring statistical independence, (2) a graph-based intent classification approach using spectral analysis to detect malicious patterns, and (3) a lightweight transformer-based classifier trained on toxic and jailbreaking prompts. The paper asserts that APD reduces harmful output generation by over 85% with negligible impact on model performance and supports real-time deployment.

Significance. If the performance claims hold, the APD framework would represent a meaningful advance in LLM security by providing a proactive, multi-stage defense that addresses semantic ambiguities in adversarial inputs. The combination of information-theoretic decomposition with graph-based and transformer methods could inform future work on prompt-level robustness in generative models deployed in security-critical settings.

major comments (2)

[Abstract] Abstract: The assertion that the mutual information-based semantic decomposition 'ensures statistical independence' between adversarial and benign components is presented without any supporting quantitative evidence, such as post-decomposition MI estimates, ablation results on residual dependence, or a formal argument that the decomposition operator enforces I(A;B)≈0. This independence is load-bearing for the downstream claim of >85% reduction in harmful outputs, as residual dependence would allow malicious intent to reach the LLM.
[Abstract] Abstract: The evaluation claims 'superior robustness' and a reduction in harmful output generation 'by over 85%' with 'negligible impact on model performance,' yet the manuscript supplies no datasets, baselines, metrics, error bars, ablation studies, or result tables to substantiate these figures. Without this evidence the central empirical claim cannot be assessed.

minor comments (1)

[Abstract] Abstract: The description of the three innovations would be clearer if each were tied to a specific section or figure in the main text rather than listed only in the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will incorporate revisions to provide the requested supporting evidence.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that the mutual information-based semantic decomposition 'ensures statistical independence' between adversarial and benign components is presented without any supporting quantitative evidence, such as post-decomposition MI estimates, ablation results on residual dependence, or a formal argument that the decomposition operator enforces I(A;B)≈0. This independence is load-bearing for the downstream claim of >85% reduction in harmful outputs, as residual dependence would allow malicious intent to reach the LLM.

Authors: We agree that the abstract does not currently include quantitative evidence for the independence claim. In the revised manuscript, we will add post-decomposition mutual information estimates, ablation results on residual dependence, and a formal argument or derivation demonstrating how the decomposition operator enforces I(A;B)≈0. These additions will directly support the downstream performance claims. revision: yes
Referee: [Abstract] Abstract: The evaluation claims 'superior robustness' and a reduction in harmful output generation 'by over 85%' with 'negligible impact on model performance,' yet the manuscript supplies no datasets, baselines, metrics, error bars, ablation studies, or result tables to substantiate these figures. Without this evidence the central empirical claim cannot be assessed.

Authors: We acknowledge that the manuscript as submitted lacks the detailed empirical substantiation referenced in the abstract. The revised version will include explicit descriptions of the datasets, baselines, metrics, error bars, ablation studies, and result tables to fully substantiate the reported reductions and performance impacts. revision: yes

Circularity Check

0 steps flagged

No circularity: method assertions are not derived from self-referential inputs

full rationale

The paper presents APD as a framework with three components, including a mutual information-based decomposition asserted to ensure statistical independence. No equations, derivations, or parameter-fitting steps are described that reduce a claimed prediction or result back to the input data or assumptions by construction. The performance claim (>85% reduction) is presented as an empirical outcome on datasets rather than a mathematical consequence of the method definition. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The description is self-contained as an engineering proposal without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The paper introduces a new framework relying on assumptions about decomposition and graph analysis without providing independent evidence or parameters.

axioms (2)

domain assumption Mutual information decomposition can separate adversarial and benign components with statistical independence.
Central to the first innovation described.
domain assumption Spectral analysis on semantic graphs can detect malicious intent patterns.
Basis for the graph-based classification.

invented entities (1)

Adversarial Prompt Disentanglement (APD) framework no independent evidence
purpose: To proactively identify and neutralize malicious prompt components.
Newly introduced defense mechanism.

pith-pipeline@v0.9.1-grok · 5764 in / 1095 out tokens · 48118 ms · 2026-06-29T12:06:08.559211+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Double Self-weighted Multi-view Clustering via Adaptive View Fusion

Imperceptible Beam-Sensitive Adversarial Attacks for LiDAR-based Object Detection in Autonomous Driving. In2025 IEEE International Conference on Multimedia and Expo (ICME), 1–6. IEEE. Cai, X.; Liu, D.; Qu, X.; Fang, X.; Dong, J.; Tang, K.; Zhou, P.; Sun, L.; and Hu, W. 2026. Towards building model/prompt-transferable attackers against large vision- langua...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Gehman, S.; Gururangan, S.; Sap, M.; Choi, Y .; and Smith, N

Springer. Gehman, S.; Gururangan, S.; Sap, M.; Choi, Y .; and Smith, N. A. 2020. RealToxicityPrompts: Evaluating Neural Toxi- city in Language Models. InFindings of the Association for Computational Linguistics: EMNLP 2020, 3356–3369. Jia, X.; Gao, S.; Guo, Q.; Qin, S.; Ma, K.; Huang, Y .; Liu, Y .; Tsang, I.; and Cao, X. 2025a. Semantic-aligned ad- versa...

work page arXiv 2020
[3]

Kethireddy, R

Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models.Advances in Neural Information Processing Systems, 37: 47094–47165. Kethireddy, R. R. 2024. Secure Model Distribution and De- ployment for LLMs.JOURNAL OF RECENT TRENDS IN COMPUTER SCIENCE AND ENGINEERING (JRTCSE), 12(4): 1–14. Kim, J.; Mao, Y .; Hou, R.; Yu, H.; Li...

2024
[4]

Lei, H.; Cai, X.; Liu, D.; Fang, X.; Qu, X.; Dong, J.; Yu, J.; and Jin, K

Dynamic Graph-enhanced Event Refinement for Tem- poral Sentence Grounding of Micro-moments.IEEE Trans- actions on Multimedia. Lei, H.; Cai, X.; Liu, D.; Fang, X.; Qu, X.; Dong, J.; Yu, J.; and Jin, K. 2025. Exploring Disentangled Appearance- Motion Contexts for Temporal Activity Localization. In 2025 International Joint Conference on Neural Networks (IJCN...

work page arXiv 2025

[1] [1]

Double Self-weighted Multi-view Clustering via Adaptive View Fusion

Imperceptible Beam-Sensitive Adversarial Attacks for LiDAR-based Object Detection in Autonomous Driving. In2025 IEEE International Conference on Multimedia and Expo (ICME), 1–6. IEEE. Cai, X.; Liu, D.; Qu, X.; Fang, X.; Dong, J.; Tang, K.; Zhou, P.; Sun, L.; and Hu, W. 2026. Towards building model/prompt-transferable attackers against large vision- langua...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

Gehman, S.; Gururangan, S.; Sap, M.; Choi, Y .; and Smith, N

Springer. Gehman, S.; Gururangan, S.; Sap, M.; Choi, Y .; and Smith, N. A. 2020. RealToxicityPrompts: Evaluating Neural Toxi- city in Language Models. InFindings of the Association for Computational Linguistics: EMNLP 2020, 3356–3369. Jia, X.; Gao, S.; Guo, Q.; Qin, S.; Ma, K.; Huang, Y .; Liu, Y .; Tsang, I.; and Cao, X. 2025a. Semantic-aligned ad- versa...

work page arXiv 2020

[3] [3]

Kethireddy, R

Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models.Advances in Neural Information Processing Systems, 37: 47094–47165. Kethireddy, R. R. 2024. Secure Model Distribution and De- ployment for LLMs.JOURNAL OF RECENT TRENDS IN COMPUTER SCIENCE AND ENGINEERING (JRTCSE), 12(4): 1–14. Kim, J.; Mao, Y .; Hou, R.; Yu, H.; Li...

2024

[4] [4]

Lei, H.; Cai, X.; Liu, D.; Fang, X.; Qu, X.; Dong, J.; Yu, J.; and Jin, K

Dynamic Graph-enhanced Event Refinement for Tem- poral Sentence Grounding of Micro-moments.IEEE Trans- actions on Multimedia. Lei, H.; Cai, X.; Liu, D.; Fang, X.; Qu, X.; Dong, J.; Yu, J.; and Jin, K. 2025. Exploring Disentangled Appearance- Motion Contexts for Temporal Activity Localization. In 2025 International Joint Conference on Neural Networks (IJCN...

work page arXiv 2025