S afe S witch: Steering Unsafe LLM Behavior via Internal Activation Signals

Han, Peixuan, Qian, Cheng, Chen, Xiusi, Zhang, Yuji, Ji, Heng, Zhang, Denghui · 2025 · DOI 10.18653/v1/2025.findings-emnlp.366

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open at publisher browse 3 citing papers

representative citing papers

Do Activation Monitors Survive Model Updates? Benchmarking, Predicting, and Repairing Activation-Monitor Staleness

cs.LG · 2026-06-14 · unverdicted · novelty 8.0

Fine-tuning updates frequently stale activation monitors for language model safety while quantization does not, with degradation predictable and repairable via label-free realignment.

Adversarial Robustness of Activation Steering in Large Language Models

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

First systematic test shows activation steering robustness drops sharply (up to 64%) under adversarial input perturbations across multiple extraction methods, models, and personas.

Investigating and Alleviating Harm Amplification in LLM Interactions

cs.CL · 2026-06-01 · unverdicted · novelty 6.0

Presents HarmAmp benchmark for multi-turn harm amplification in LLMs and TrajSafe proactive monitor that reduces harm while keeping low over-refusal and preserving capabilities.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Investigating and Alleviating Harm Amplification in LLM Interactions cs.CL · 2026-06-01 · unverdicted · none · ref 23
Presents HarmAmp benchmark for multi-turn harm amplification in LLMs and TrajSafe proactive monitor that reduces harm while keeping low over-refusal and preserving capabilities.

S afe S witch: Steering Unsafe LLM Behavior via Internal Activation Signals

fields

years

verdicts

representative citing papers

citing papers explorer