A single safety demonstration appended at inference time mitigates many-shot jailbreak attacks by counteracting implicit malicious fine-tuning on harmful examples.
Internal safety collapse in frontier large language models
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 6roles
background 2polarities
background 2representative citing papers
AgentHazard benchmark shows computer-use agents remain highly vulnerable, with attack success rates reaching 73.63% on models like Qwen3-Coder powering Claude Code.
SMT achieves the highest attack success rate and HarmScore on commercial function-calling LLMs from five providers by using simulated moderation traces in multi-turn trajectories, outperforming baselines with near-minimal queries.
Spore extracts private data from LLM memory with one query in black-box mode or ranked tokens in gray-box, outperforming prior attacks while bypassing defenses.
SafeRedirect reduces average unsafe generation rates in frontier LLMs from 71.2% to 8.0% on Internal Safety Collapse tasks by redirecting task completion with failure permission and deterministic hard stops.
BraveGuard trains guard models on realistic agent trajectories derived from open-world threats, raising detection accuracy on AgentHazard from 38.79% to 82.38%.
citing papers explorer
-
Mitigating Many-shot Jailbreak Attacks with One Single Demonstration
A single safety demonstration appended at inference time mitigates many-shot jailbreak attacks by counteracting implicit malicious fine-tuning on harmful examples.
-
AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents
AgentHazard benchmark shows computer-use agents remain highly vulnerable, with attack success rates reaching 73.63% on models like Qwen3-Coder powering Claude Code.
-
Beyond the Prompt: Jailbreaking Function-Calling LLMs via Simulated Moderation Traces
SMT achieves the highest attack success rate and HarmScore on commercial function-calling LLMs from five providers by using simulated moderation traces in multi-turn trajectories, outperforming baselines with near-minimal queries.
-
Spore: Efficient and Training-Free Privacy Extraction Attack on LLMs via Inference-Time Hybrid Probing
Spore extracts private data from LLM memory with one query in black-box mode or ranked tokens in gray-box, outperforming prior attacks while bypassing defenses.
-
SafeRedirect: Defeating Internal Safety Collapse via Task-Completion Redirection in Frontier LLMs
SafeRedirect reduces average unsafe generation rates in frontier LLMs from 71.2% to 8.0% on Internal Safety Collapse tasks by redirecting task completion with failure permission and deterministic hard stops.
-
BraveGuard: From Open-World Threats to Safer Computer-Use Agents
BraveGuard trains guard models on realistic agent trajectories derived from open-world threats, raising detection accuracy on AgentHazard from 38.79% to 82.38%.