Reflect-Guard: Enhancing LLM Safeguards against Adversarial Prompts via Logical Self-Reflection

· 2026 · cs.CR · arXiv 2605.24834

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

open full Pith review browse 8 citing papers arXiv PDF

abstract

Large language model (LLM) safety classifiers such as Llama Guard are effective at detecting overtly harmful prompts but remain vulnerable to adversarial jailbreak attacks that disguise malicious intent through role-play scenarios, fictional framing, and indirect requests. We present Reflect-Guard, a method that augments LLM-based safety classifiers with chain-of-thought self-reflection capabilities through parameter-efficient fine-tuning. Our approach distills analytical reasoning from GPT-4o-mini into structured reflection annotations, then trains Llama-Guard-3-8B via QLoRA to generate logical self-reflections before issuing safety verdicts. Using only 1000 training examples and updating just 0.5% of model parameters (~42M), Reflect-Guard achieves substantial improvements on two challenging benchmarks. On WildGuardTest, F1 score improves from 0.770 to 0.842 (+7.2 pp), with recall on adversarial prompts increasing from 0.513 to 0.921 (+40.8 pp). On JailbreakBench, the attack success rate drops from 10.3% to 1.8%, representing an 82.5% relative reduction. These gains are especially pronounced on adversarial inputs, where the explicit reasoning step enables the model to see through obfuscation techniques that defeat standard pattern-matching approaches. Our results demonstrate that teaching safety classifiers to reason about adversarial intent, rather than simply classify surface patterns, is a promising direction for robust LLM safety.

representative citing papers

TRACER: Token ReAssignment for Concept ERasure in Generative Recommendation

cs.IR · 2026-06-05 · unverdicted · novelty 7.0

TRACER uses token reassignment for concept-related items plus a coherence regularizer to unlearn specific concepts in generative recommendation while preserving utility better than baselines.

Relevant Is Not Warranted: Evidence-Force Calibration for Cited RAG

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

FORCEBENCH shows model judges often violate expected ordering on evidence-calibrated vs force-raised claim pairs, with standard support prompting yielding 47.2% MVR and explicit warrant prompting reducing it to 24.5%.

Beyond Accuracy: Measuring Bias Acknowledgment in Chain-of-Thought Reasoning for Responsible AI Evaluation

cs.LG · 2026-06-13 · unverdicted · novelty 6.0

GPT-4o and Claude Sonnet 4 show similar susceptibility to bias on GSM8K (1.3% vs 1.2%) but differ sharply in acknowledgment rates (13% vs 75%) under a rubric-defined metric.

Energy-Efficient On-Device RAG on a Mobile NPU: System Design and Benchmark on Snapdragon X Elite

cs.CL · 2026-06-09 · unverdicted · novelty 6.0

First end-to-end RAG on mobile NPU delivers 18.1x faster prefilling, 4x lower latency and energy than CPU on Snapdragon X Elite with equivalent quality.

When Safe Skills Collide: Measuring Compositional Risk in Agent Skill Ecosystems

cs.SE · 2026-05-30 · unverdicted · novelty 6.0

About 18.2% of structurally flagged skill pairs represent genuine compositional safety risks in agent skill registries, with exploitation gated by host model behavior.

MOSAIC: Orchestrating Collaborative Knowledge Tracing with Hierarchical Semantic Alignment

cs.LG · 2026-06-27 · unverdicted · novelty 5.0

MOSAIC combines frozen-LLM semantic embeddings with hierarchical consistency objectives to report up to 3.4% AUC gains on knowledge-tracing benchmarks including a new MOOC dataset.

DriftGuard: Safety-Aware Multi-Monitor Detection and Selective Adaptation for Evolving Toxicity Moderation

cs.CL · 2026-06-27 · unverdicted · novelty 5.0

DriftGuard introduces multi-monitor safety-aware drift detection paired with hard-mix selective adaptation, reporting toxic recall gains to 0.8777 on Civil Comments and 0.8523 on DynaHate under temporal and cross-dataset shifts.

MedSynapse-V: Bridging Visual Perception and Clinical Intuition via Latent Memory Evolution

cs.CV · 2026-04-29

citing papers explorer

Showing 2 of 2 citing papers after filters.

Energy-Efficient On-Device RAG on a Mobile NPU: System Design and Benchmark on Snapdragon X Elite cs.CL · 2026-06-09 · unverdicted · none · ref 51 · internal anchor
First end-to-end RAG on mobile NPU delivers 18.1x faster prefilling, 4x lower latency and energy than CPU on Snapdragon X Elite with equivalent quality.
DriftGuard: Safety-Aware Multi-Monitor Detection and Selective Adaptation for Evolving Toxicity Moderation cs.CL · 2026-06-27 · unverdicted · none · ref 3 · internal anchor
DriftGuard introduces multi-monitor safety-aware drift detection paired with hard-mix selective adaptation, reporting toxic recall gains to 0.8777 on Civil Comments and 0.8523 on DynaHate under temporal and cross-dataset shifts.

Reflect-Guard: Enhancing LLM Safeguards against Adversarial Prompts via Logical Self-Reflection

fields

years

verdicts

representative citing papers

citing papers explorer