A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
hub
Shadow alignment: The ease of subverting safely-aligned language models
10 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 1polarities
support 1representative citing papers
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
Early mixing of post-training data into pretraining improves retention of acquired capabilities after subsequent fine-tuning in language models.
A truly benign DPO attack using 10 harmless preference pairs jailbreaks frontier LLMs by suppressing refusal behavior, achieving up to 81.73% attack success rate on GPT-4.1-nano at low cost.
Benign fine-tuning of foundation models induces large, heterogeneous, and often contradictory changes in safety metrics across general and domain-specific benchmarks.
REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
Gradient-based selection that drops high-gradient samples during continual fine-tuning preserves safety alignment in LLMs better than standard fine-tuning while keeping task performance competitive.
ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.
Benign fine-tuning collapses safety geometry in guard models like Granite Guardian, dropping refusal to 0%, but Fisher-Weighted Safety Subspace Regularization restores it to 75% while improving robustness.
A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.
citing papers explorer
-
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
-
Refusal in Language Models Is Mediated by a Single Direction
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
-
Early Data Exposure Improves Robustness to Subsequent Fine-Tuning
Early mixing of post-training data into pretraining improves retention of acquired capabilities after subsequent fine-tuning in language models.
-
Few-Shot Truly Benign DPO Attack for Jailbreaking LLMs
A truly benign DPO attack using 10 harmless preference pairs jailbreaks frontier LLMs by suppressing refusal behavior, achieving up to 81.73% attack success rate on GPT-4.1-nano at low cost.
-
Safety Drift After Fine-Tuning: Evidence from High-Stakes Domains
Benign fine-tuning of foundation models induces large, heterogeneous, and often contradictory changes in safety metrics across general and domain-specific benchmarks.
-
Representation-Guided Parameter-Efficient LLM Unlearning
REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
-
Continual Safety Alignment via Gradient-Based Sample Selection
Gradient-based selection that drops high-gradient samples during continual fine-tuning preserves safety alignment in LLMs better than standard fine-tuning while keeping task performance competitive.
-
The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training
ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.
-
When Safety Geometry Collapses: Fine-Tuning Vulnerabilities in Agentic Guard Models
Benign fine-tuning collapses safety geometry in guard models like Granite Guardian, dropping refusal to 0%, but Fisher-Weighted Safety Subspace Regularization restores it to 75% while improving robustness.
-
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.