Reflect-Guard: Enhancing LLM Safeguards against Adversarial Prompts via Logical Self-Reflection

Juli You; Lixing Lin; Luyun Lin; Moxuan Zheng; Yiqing Wang; Yue Li; Zhen Zhang

arxiv: 2605.24834 · v1 · pith:OPEKIKOXnew · submitted 2026-05-24 · 💻 cs.CR · cs.AI

Reflect-Guard: Enhancing LLM Safeguards against Adversarial Prompts via Logical Self-Reflection

Lixing Lin , Juli You , Yue Li , Luyun Lin , Yiqing Wang , Zhen Zhang , Moxuan Zheng This is my paper

classification 💻 cs.CR cs.AI

keywords adversarialsafetyclassifiersmodelpromptsreflect-guardintentlogical

0 comments

read the original abstract

Large language model (LLM) safety classifiers such as Llama Guard are effective at detecting overtly harmful prompts but remain vulnerable to adversarial jailbreak attacks that disguise malicious intent through role-play scenarios, fictional framing, and indirect requests. We present Reflect-Guard, a method that augments LLM-based safety classifiers with chain-of-thought self-reflection capabilities through parameter-efficient fine-tuning. Our approach distills analytical reasoning from GPT-4o-mini into structured reflection annotations, then trains Llama-Guard-3-8B via QLoRA to generate logical self-reflections before issuing safety verdicts. Using only 1000 training examples and updating just 0.5% of model parameters (~42M), Reflect-Guard achieves substantial improvements on two challenging benchmarks. On WildGuardTest, F1 score improves from 0.770 to 0.842 (+7.2 pp), with recall on adversarial prompts increasing from 0.513 to 0.921 (+40.8 pp). On JailbreakBench, the attack success rate drops from 10.3% to 1.8%, representing an 82.5% relative reduction. These gains are especially pronounced on adversarial inputs, where the explicit reasoning step enables the model to see through obfuscation techniques that defeat standard pattern-matching approaches. Our results demonstrate that teaching safety classifiers to reason about adversarial intent, rather than simply classify surface patterns, is a promising direction for robust LLM safety.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When AUC 0.998 Is Not Enough: A Candidate Evaluation Protocol for Hidden-State Probes of Indirect Prompt Injection in Multimodal Computer-Use Agents
cs.LG 2026-06 unverdicted novelty 7.0

High AUC from linear probes on model activations for indirect prompt injection does not license an unqualified claim of malicious-content detection, per a Qwen2.5-VL-7B case study with text and visual controls.
TRACER: Token ReAssignment for Concept ERasure in Generative Recommendation
cs.IR 2026-06 unverdicted novelty 7.0

TRACER uses token reassignment for concept-related items plus a coherence regularizer to unlearn specific concepts in generative recommendation while preserving utility better than baselines.
Beyond Accuracy: Measuring Bias Acknowledgment in Chain-of-Thought Reasoning for Responsible AI Evaluation
cs.LG 2026-06 unverdicted novelty 6.0

GPT-4o and Claude Sonnet 4 show similar susceptibility to bias on GSM8K (1.3% vs 1.2%) but differ sharply in acknowledgment rates (13% vs 75%) under a rubric-defined metric.
Energy-Efficient On-Device RAG on a Mobile NPU: System Design and Benchmark on Snapdragon X Elite
cs.CL 2026-06 unverdicted novelty 6.0

First end-to-end RAG on mobile NPU delivers 18.1x faster prefilling, 4x lower latency and energy than CPU on Snapdragon X Elite with equivalent quality.