First systematic test shows activation steering robustness drops sharply (up to 64%) under adversarial input perturbations across multiple extraction methods, models, and personas.
arXiv preprint arXiv:2410.17245 , year=
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3representative citing papers
Cascading linear features extracted from graded sycophancy samples form separable subspaces that enable detection, scoring, and steering of sycophantic behavior in LLMs, matching or exceeding LLM-judge and prompting baselines.
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.
citing papers explorer
-
Adversarial Robustness of Activation Steering in Large Language Models
First systematic test shows activation steering robustness drops sharply (up to 64%) under adversarial input perturbations across multiple extraction methods, models, and personas.
-
Detecting and Controlling Sycophancy with Cascading Linear Features
Cascading linear features extracted from graded sycophancy samples form separable subspaces that enable detection, scoring, and steering of sycophantic behavior in LLMs, matching or exceeding LLM-judge and prompting baselines.
-
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.