Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
A Language Model’s Guide Through Latent Space
4 Pith papers cite this work. Polarity classification is still indexing.
4
Pith papers citing it
representative citing papers
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
Proposes a fault-tolerance architecture for AI safety by analogizing unreliable AI artifacts to Byzantine nodes and applying consensus mechanisms.