SafeSteer improves safety in multimodal large language models by up to 33.4% via a decoding probe and modal alignment vector without any fine-tuning.
Advances in Neural Information Processing Systems , volume=
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 7representative citing papers
Suppressing one refusal neuron or amplifying one concept neuron bypasses safety alignment in LLMs from 1.7B to 70B parameters without training or prompt engineering.
Geometric Unlearning distills a low-rank safe subspace from reference prompts and applies projection-based alignment on synthetic anchors to suppress target content while preserving non-target utility.
Different LLM jailbreak techniques achieve similar harmful compliance but lead to distinct behavioral side effects and mechanistic changes.
Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming human-characterized alternatives.
Multimodal LLMs suffer Safety Geometry Collapse from modality-induced drift that reduces refusal separability; ReGap corrects drift at inference time using self-rectification signals to restore safety without retraining.
LiSA improves AI guardrails lifelong by inducing conservative policies from sparse noisy failure reports via structured memory, conflict-aware rules, and posterior lower-bound gating.
citing papers explorer
-
SafeSteer: A Decoding-level Defense Mechanism for Multimodal Large Language Models
SafeSteer improves safety in multimodal large language models by up to 33.4% via a decoding probe and modal alignment vector without any fine-tuning.