hub

Shadow alignment: The ease of subverting safely-aligned language models

Xianjun Yang, Xiao Wang, Qi Zhang, Linda R · 2023 · arXiv 2310.02949

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

support 1

representative citing papers

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

Refusal in Language Models Is Mediated by a Single Direction

cs.LG · 2024-06-17 · accept · novelty 7.0

Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

Early Data Exposure Improves Robustness to Subsequent Fine-Tuning

cs.LG · 2026-05-12 · conditional · novelty 6.0

Early mixing of post-training data into pretraining improves retention of acquired capabilities after subsequent fine-tuning in language models.

Few-Shot Truly Benign DPO Attack for Jailbreaking LLMs

cs.CR · 2026-05-09 · unverdicted · novelty 6.0

A truly benign DPO attack using 10 harmless preference pairs jailbreaks frontier LLMs by suppressing refusal behavior, achieving up to 81.73% attack success rate on GPT-4.1-nano at low cost.

Safety Drift After Fine-Tuning: Evidence from High-Stakes Domains

cs.CY · 2026-04-27 · unverdicted · novelty 6.0

Benign fine-tuning of foundation models induces large, heterogeneous, and often contradictory changes in safety metrics across general and domain-specific benchmarks.

Representation-Guided Parameter-Efficient LLM Unlearning

cs.CL · 2026-04-19 · unverdicted · novelty 6.0

REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.

Continual Safety Alignment via Gradient-Based Sample Selection

cs.LG · 2026-04-19 · unverdicted · novelty 6.0

Gradient-based selection that drops high-gradient samples during continual fine-tuning preserves safety alignment in LLMs better than standard fine-tuning while keeping task performance competitive.

The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

cs.CR · 2026-04-09 · unverdicted · novelty 6.0

ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.

When Safety Geometry Collapses: Fine-Tuning Vulnerabilities in Agentic Guard Models

cs.LG · 2026-04-08 · unverdicted · novelty 6.0

Benign fine-tuning collapses safety geometry in guard models like Granite Guardian, dropping refusal to 0%, but Fisher-Weighted Safety Subspace Regularization restores it to 75% while improving robustness.

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

cs.CR · 2024-07-05 · accept · novelty 4.0

A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.

citing papers explorer

Showing 10 of 10 citing papers.

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework cs.CR · 2026-04-25 · unverdicted · none · ref 42
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Refusal in Language Models Is Mediated by a Single Direction cs.LG · 2024-06-17 · accept · none · ref 202
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
Early Data Exposure Improves Robustness to Subsequent Fine-Tuning cs.LG · 2026-05-12 · conditional · none · ref 22
Early mixing of post-training data into pretraining improves retention of acquired capabilities after subsequent fine-tuning in language models.
Few-Shot Truly Benign DPO Attack for Jailbreaking LLMs cs.CR · 2026-05-09 · unverdicted · none · ref 16
A truly benign DPO attack using 10 harmless preference pairs jailbreaks frontier LLMs by suppressing refusal behavior, achieving up to 81.73% attack success rate on GPT-4.1-nano at low cost.
Safety Drift After Fine-Tuning: Evidence from High-Stakes Domains cs.CY · 2026-04-27 · unverdicted · none · ref 69
Benign fine-tuning of foundation models induces large, heterogeneous, and often contradictory changes in safety metrics across general and domain-specific benchmarks.
Representation-Guided Parameter-Efficient LLM Unlearning cs.CL · 2026-04-19 · unverdicted · none · ref 42
REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
Continual Safety Alignment via Gradient-Based Sample Selection cs.LG · 2026-04-19 · unverdicted · none · ref 7
Gradient-based selection that drops high-gradient samples during continual fine-tuning preserves safety alignment in LLMs better than standard fine-tuning while keeping task performance competitive.
The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training cs.CR · 2026-04-09 · unverdicted · none · ref 62
ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.
When Safety Geometry Collapses: Fine-Tuning Vulnerabilities in Agentic Guard Models cs.LG · 2026-04-08 · unverdicted · none · ref 21
Benign fine-tuning collapses safety geometry in guard models like Granite Guardian, dropping refusal to 0%, but Fisher-Weighted Safety Subspace Regularization restores it to 75% while improving robustness.
Jailbreak Attacks and Defenses Against Large Language Models: A Survey cs.CR · 2024-07-05 · accept · none · ref 103
A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.

Shadow alignment: The ease of subverting safely-aligned language models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer