Proposes a neuron-level intervention method to locate and control gender-specific neurons across feminine, masculine, and neutral categories in LMs, achieving precise steering with less leakage than prior approaches.
Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
verdicts
UNVERDICTED 2representative citing papers
Prepending a compact learnable prefix to LLMs produces safety gains comparable to next-generation aligned models while preserving fluency and adding negligible parameters.
citing papers explorer
-
Patching LLM Like Software: A Lightweight Method for Improving Safety Policy in Large Language Models
Prepending a compact learnable prefix to LLMs produces safety gains comparable to next-generation aligned models while preserving fluency and adding negligible parameters.