Fine-tuning updates frequently stale activation monitors for language model safety while quantization does not, with degradation predictable and repairable via label-free realignment.
S afe S witch: Steering Unsafe LLM Behavior via Internal Activation Signals
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
years
2026 3verdicts
UNVERDICTED 3representative citing papers
First systematic test shows activation steering robustness drops sharply (up to 64%) under adversarial input perturbations across multiple extraction methods, models, and personas.
Presents HarmAmp benchmark for multi-turn harm amplification in LLMs and TrajSafe proactive monitor that reduces harm while keeping low over-refusal and preserving capabilities.
citing papers explorer
-
Investigating and Alleviating Harm Amplification in LLM Interactions
Presents HarmAmp benchmark for multi-turn harm amplification in LLMs and TrajSafe proactive monitor that reduces harm while keeping low over-refusal and preserving capabilities.