Advances in Neural Information Processing Systems , volume=

Refusal in language models is mediated by a single direction , author=

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

browse 7 citing papers

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

SafeSteer: A Decoding-level Defense Mechanism for Multimodal Large Language Models

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

SafeSteer improves safety in multimodal large language models by up to 33.4% via a decoding probe and modal alignment vector without any fine-tuning.

A Single Neuron Is Sufficient to Bypass Safety Alignment in Large Language Models

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

Suppressing one refusal neuron or amplifying one concept neuron bypasses safety alignment in LLMs from 1.7B to 70B parameters without training or prompt engineering.

Less is More: Geometric Unlearning for LLMs with Minimal Data Disclosure

cs.CL · 2026-05-03 · unverdicted · novelty 6.0

Geometric Unlearning distills a low-rank safe subspace from reference prompts and applies projection-based alignment on synthetic anchors to suppress target content while preserving non-target utility.

Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks

cs.CR · 2026-04-20 · unverdicted · novelty 6.0

Different LLM jailbreak techniques achieve similar harmful compliance but lead to distinct behavioral side effects and mechanistic changes.

Characterizing Model-Native Skills

cs.AI · 2026-04-19 · conditional · novelty 6.0

Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming human-characterized alternatives.

Safety Geometry Collapse in Multimodal LLMs and Adaptive Drift Correction

cs.AI · 2026-05-18 · unverdicted · novelty 5.0

Multimodal LLMs suffer Safety Geometry Collapse from modality-induced drift that reduces refusal separability; ReGap corrects drift at inference time using self-rectification signals to restore safety without retraining.

LiSA: Lifelong Safety Adaptation via Conservative Policy Induction

cs.LG · 2026-05-14 · unverdicted · novelty 5.0

LiSA improves AI guardrails lifelong by inducing conservative policies from sparse noisy failure reports via structured memory, conflict-aware rules, and posterior lower-bound gating.

citing papers explorer

Showing 1 of 1 citing paper after filters.

SafeSteer: A Decoding-level Defense Mechanism for Multimodal Large Language Models cs.AI · 2026-05-12 · unverdicted · none · ref 7
SafeSteer improves safety in multimodal large language models by up to 33.4% via a decoding probe and modal alignment vector without any fine-tuning.

Advances in Neural Information Processing Systems , volume=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer