Activation Steering with a Feedback Controller
Pith reviewed 2026-05-21 21:44 UTC · model grok-4.3
The pith
Activation steering in LLMs corresponds to proportional control, and extending it to full PID yields interpretable error dynamics with stability guarantees.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Popular steering methods correspond to the proportional (P) controllers, with the steering vector serving as the feedback signal. PID Steering leverages the full PID controller for activation steering in LLMs, yielding interpretable error dynamics and connecting to classical stability guarantees.
What carries the argument
The PID controller for activation steering, where the proportional term aligns activations with target semantic directions, the integral term accumulates errors to enforce persistent corrections across layers, and the derivative term mitigates overshoot by counteracting rapid activation changes.
If this is right
- Steering methods acquire theoretical performance guarantees drawn from control theory.
- Error dynamics become explicit, making it possible to diagnose issues such as persistent offset or overshoot.
- The modular design allows PID terms to combine directly with existing steering vectors and methods.
- Behavioral control gains robustness across different layers and model families.
Where Pith is reading between the lines
- The same mapping could let researchers test other classical controllers, such as lead-lag or state-feedback designs, on the same activation signals.
- Checking whether measured activation responses match the closed-loop poles predicted by the PID model would provide a concrete test of the linear approximation.
- Because the controller is lightweight, it could be applied selectively to specific attention heads or layers to isolate their contribution to a target behavior.
Load-bearing premise
The nonlinear, discrete, and high-dimensional activation dynamics inside transformer layers can be usefully approximated by the linear time-invariant plant model assumed in classical PID control.
What would settle it
Directly measuring activation trajectories during PID steering and finding that they deviate substantially from the error accumulation and damping predicted by the linear model would undermine the transfer of stability guarantees.
Figures
read the original abstract
Controlling the behaviors of large language models (LLM) is fundamental to their safety alignment and reliable deployment. However, existing steering methods are primarily driven by empirical insights and lack theoretical performance guarantees. In this work, we develop a control-theoretic foundation for activation steering by showing that popular steering methods correspond to the proportional (P) controllers, with the steering vector serving as the feedback signal. Building on this finding, we propose Proportional-Integral-Derivative (PID) Steering, a principled framework that leverages the full PID controller for activation steering in LLMs. The proportional (P) term aligns activations with target semantic directions, the integral (I) term accumulates errors to enforce persistent corrections across layers, and the derivative (D) term mitigates overshoot by counteracting rapid activation changes. This closed-loop design yields interpretable error dynamics and connects activation steering to classical stability guarantees in control theory. Moreover, PID Steering is lightweight, modular, and readily integrates with state-of-the-art steering methods. Extensive experiments across multiple LLM families and benchmarks demonstrate that PID Steering consistently outperforms existing approaches, achieving more robust and reliable behavioral control. The code is publicly available at: https://github.com/dungnvnus/pid-steering
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops a control-theoretic foundation for activation steering in LLMs by showing that popular methods correspond to proportional (P) controllers with the steering vector as feedback signal. It proposes PID Steering, which adds integral and derivative terms to enforce persistent corrections and mitigate overshoot, yielding interpretable error dynamics and connections to classical stability guarantees. The approach is lightweight and integrates with existing methods; experiments across LLM families and benchmarks claim consistent outperformance.
Significance. If the LTI plant approximation for transformer activations is valid to within bounded residuals, the work provides a principled bridge between activation steering and control theory, enabling design of steering methods with potential stability margins and error-trajectory interpretability. Public code release and modular design are strengths that support reproducibility. The empirical gains, if robust, would be practically useful for reliable behavioral control, though the significance depends on validating the modeling assumptions against nonlinear discrete dynamics.
major comments (2)
- [Abstract and modeling sections (around the PID controller derivation)] The central claim that PID Steering connects to classical stability guarantees requires the layer activations to be modeled as the output of a linear time-invariant plant whose state evolves according to standard PID error dynamics. No section derives approximation error bounds or empirically validates this for the nonlinear (ReLU/GELU, attention softmax), discrete (layer index and token space), and high-dimensional transformer forward passes; without such validation the transfer of stability margins does not follow.
- [Introduction and § on P-controller equivalence] The correspondence between existing steering methods and P controllers is presented as definitional once the steering vector is treated as feedback, but the manuscript does not show that this mapping preserves the closed-loop properties under the true nonlinear dynamics; this makes the extension to I and D terms rest primarily on empirical demonstration rather than the same equations.
minor comments (2)
- [PID Steering framework] Clarify the exact definition of the error signal and how the integral term is accumulated across layers without wind-up issues in the discrete setting.
- [Experiments] Add details on baseline implementations and statistical significance testing for the reported outperformance to strengthen the experimental claims.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment below and outline revisions that will clarify the scope of our modeling assumptions while preserving the core contributions of the work.
read point-by-point responses
-
Referee: [Abstract and modeling sections (around the PID controller derivation)] The central claim that PID Steering connects to classical stability guarantees requires the layer activations to be modeled as the output of a linear time-invariant plant whose state evolves according to standard PID error dynamics. No section derives approximation error bounds or empirically validates this for the nonlinear (ReLU/GELU, attention softmax), discrete (layer index and token space), and high-dimensional transformer forward passes; without such validation the transfer of stability margins does not follow.
Authors: We agree that a strict transfer of classical stability margins would require explicit approximation error bounds between the LTI model and the true nonlinear, discrete transformer dynamics. Deriving such bounds analytically is beyond the current scope. In the revision we will add a dedicated subsection that states the LTI approximation explicitly, discusses its limitations with respect to ReLU/GELU nonlinearities and attention softmax, and supplies additional empirical plots of observed error trajectories under PID steering. This will reframe the stability connection as a principled design heuristic rather than a direct guarantee. revision: partial
-
Referee: [Introduction and § on P-controller equivalence] The correspondence between existing steering methods and P controllers is presented as definitional once the steering vector is treated as feedback, but the manuscript does not show that this mapping preserves the closed-loop properties under the true nonlinear dynamics; this makes the extension to I and D terms rest primarily on empirical demonstration rather than the same equations.
Authors: The P-controller equivalence is introduced by interpreting the steering vector as a feedback signal in activation space. We do not assert that all closed-loop properties are preserved under the actual nonlinear dynamics. The integral and derivative terms are added to mitigate empirically observed shortcomings of proportional-only steering (persistent offset and overshoot). We will revise the introduction and the relevant section to distinguish the definitional mapping from the heuristic PID extension and to emphasize that the primary evidence for the full framework remains the experimental results across model families. revision: partial
Circularity Check
Steering-to-P correspondence is definitional once steering vector is identified as feedback; PID adds empirical terms
specific steps
-
self definitional
[Abstract]
"we develop a control-theoretic foundation for activation steering by showing that popular steering methods correspond to the proportional (P) controllers, with the steering vector serving as the feedback signal."
Once the steering vector is stipulated to be the feedback signal, any method that adds a vector proportional to that signal is a P-controller by definition. The claimed 'showing' therefore reduces to the identification itself rather than a derived equivalence from the transformer equations.
full rationale
The paper's claimed control-theoretic foundation reduces to a re-labeling: existing steering vectors are declared to be the feedback signal, after which the methods are P-controllers by the definition of proportional control. This step is load-bearing for the 'foundation' but contains no independent derivation. The subsequent PID extension introduces I and D terms whose benefit is shown via experiments on real LLMs rather than forced by the same equations, supplying moderate independent content. No self-citations, fitted parameters renamed as predictions, or uniqueness theorems appear in the load-bearing chain. The LTI plant modeling is an unvalidated ansatz but does not make the reported results tautological by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM activation dynamics can be approximated sufficiently well by a linear feedback control model for the purpose of applying PID corrections and invoking classical stability results.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We develop a control-theoretic foundation for activation steering by showing that popular steering methods correspond to the proportional (P) controllers... propose Proportional-Integral-Derivative (PID) Steering... u(k)=K_p r(k)+K_i ∑_{j=0}^{k-1} r(j)+K_d (r(k)−r(k−1))
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control
Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
-
When control meets large language models: From words to dynamics
The paper proposes a bidirectional continuum between LLMs and control systems, covering LLM-assisted controller design, control-based LLM steering, and state-space modeling of LLMs.
Reference graph
Works this paper leans on
-
[1]
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C
URL https://transformer-circuits.pub/2023/monosemantic-features/index.html. Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server,
work page 2023
-
[2]
Microsoft COCO Captions: Data Collection and Evaluation Server
URL https://arxiv.org/abs/1504.00325. Emily Cheng and Carmen Amo Alonso. Linearly controlled language generation with performative guarantees.arXiv preprint arXiv:2405.15454,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Huu-Tien Dang, Tin Pham, Hoang Thanh-Tung, and Naoya Inoue
URL https://arxiv.org/abs/2405.15454. Huu-Tien Dang, Tin Pham, Hoang Thanh-Tung, and Naoya Inoue. On effects of steering latent representation for large language model unlearning. InProceedings of the AAAI Conference on Artificial Intelligence, pp. 23733–23742,
-
[4]
URL ”https://transformer-circuits.pub/2022/toy model/index.html”. L. Euler.Institutionum calculi integralis. Number v. 1 in Institutionum calculi integralis. imp. Acad. imp. Sa`ent.,
work page 2022
-
[5]
Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. RealToxic- ityPrompts: Evaluating neural toxic degeneration in language models. In Trevor Cohn, Yulan He, and Yang Liu (eds.),Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 3356–3369, Online, November
work page 2020
-
[6]
doi: 10.18653/v1/2020.findings-emnlp.301
Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.301. URL https://aclanthology.org/2020.findings-emnlp.301/. Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, and Noah Goodman. Finding alignments between interpretable causal variables and distributed neural representations. InCausal Learning and Reasoning, pp. 1...
-
[7]
Gemma 2: Improving Open Language Models at a Practical Size
Google Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, L´eonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram´e, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
CLIPScore: A reference-free evaluation metric for image captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A reference-free evaluation metric for image captioning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.),Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 7514–7528, Online and Punta Cana, Domin...
work page 2021
-
[9]
doi: 10.18653/v1/2021.emnlp-main.595
Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.595. URL https://aclanthology.org/2021.emnlp-main.595/. Roger A Horn and Charles R Johnson.Matrix analysis. Cambridge university press,
-
[10]
URL https://arxiv.org/abs/2310.06825. Zhong-Ping Jiang, Eduardo Sontag, and Yuan Wang. Input-to-state stability for discrete-time nonlinear systems.IFAC Proceedings Volumes, 32(2):2403–2408,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
doi: https://doi.org/10.1016/S1474-6670(17)56408-3
ISSN 1474-6670. doi: https://doi.org/10.1016/S1474-6670(17)56408-3. URL https://www.sciencedirect.com/science/ article/pii/S1474667017564083. 14th IFAC World Congress 1999, Beijing, Chia, 5-9 July. Kai Konen, Sophie Jentzsch, Diaoul´e Diallo, Peer Sch¨utt, Oliver Bensch, Roxanne El Baff, Dominik Opitz, and Tobias Hecking. Style Vectors for Steering Genera...
-
[13]
Suhas Kotha, Jacob Mitchell Springer, and Aditi Raghunathan
URL https://arxiv.org/abs/2406.05954. Suhas Kotha, Jacob Mitchell Springer, and Aditi Raghunathan. Understanding catastrophic forgetting in language models via implicit inference.arXiv preprint arXiv:2309.10105,
-
[14]
11 Preprint. Bruce W Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar. Programming refusal with conditional activation steering.arXiv preprint arXiv:2409.05907,
-
[15]
SDXL-Lightning: Progressive Adversarial Diffusion Distillation
URL https://arxiv.org/abs/2402.13929. AI @ Meta Llama Team. The llama 3 herd of models,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
URL https://arxiv.org/abs/2407.21783. Varvara Logacheva, Daryna Dementieva, Sergey Ustyantsev, Daniil Moskovskiy, David Dale, Irina Krotova, Nikita Semenov, and Alexander Panchenko. ParaDetox: Detoxification with parallel data. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 6804–6818,...
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang
URL https://arxiv.org/abs/2310.14201. Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing,
-
[19]
tinybenchmarks: evaluating llms with fewer examples.arXiv preprint arXiv:2402.14992,
Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin. tinybenchmarks: evaluating llms with fewer examples.arXiv preprint arXiv:2402.14992,
-
[20]
Steering Llama 2 via Contrastive Activation Addition
Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.828. URL https://aclanthology.org/2024.acl-long.828/. Pau Rodriguez, Arno Blaas, Michal Klein, Luca Zappella, Nicholas Apostoloff, marco cuturi, and Xavier Suau. Controlling language and diffusion models by transporting activations. InThe Thirteenth International Conference on Learn...
-
[21]
URL https://openreview.net/forum?id=l2zFn6TIQi. 12 Preprint. Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. arXiv preprint arXiv:2310.11324,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
URL https://arxiv.org/abs/2305.18449. Xavier Suau, Pieter Delobelle, Katherine Metcalf, Armand Joulin, Nicholas Apostoloff, Luca Zappella, and Pau Rodr´ıguez. Whispering experts: neural interventions for toxicity mitigation in language models. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org,
-
[25]
Steering Language Models With Activation Engineering
URL https://arxiv.org/abs/2308.10248. Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering Language Models With Activation Engineering, October
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Finetuned Language Models Are Zero-Shot Learners
Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Activation Steering with a Feedback Controller
Curran Associates Inc. Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation Engineering: A...
work page 2024
-
[29]
D(t)=K d dr(t) dt . Approximating the derivative by the backward Euler difference withh=1gives D(k)=K d r(k)−r(k−1) .(23) Combining equation 21, equation 22, and equation 23 yields u(k)=K pr(k) +K i k−1X j=0 r(j) +K d r(k)−r(k−1) . □ 15 Preprint. B.2 BACKGROUND ONINPUT-TO-STATESTABILITY& NOTATIONS Background on Input-to-state Stability (ISS)In our proofs,...
work page 1999
-
[30]
See Appendix B.3 for detailed proof and explanations of the terms
Proposition 2 (Error dynamics of activation steering) The error dynamics ¯e(k) in activation steering is of the form: ¯e(k+1)= ¯A(k)¯e(k)− ¯A(k)u(k)+w(k),(20) where ¯A(k) is the mean local Jacobian of f (k) i at x+ i (k) and the disturbance term w(k) collects heterogeneity. See Appendix B.3 for detailed proof and explanations of the terms. Proof.The evolu...
work page 1999
-
[31]
Therefore, the system is ISS. However, there exists a steady-state error due to the disturbancew(k). In the best case, when ¯A(k) converges to ¯A and w(k) converges tow, the error ¯e(k) eventually converges to a steady state given by ¯ess =(I− ¯A(1−pI)) −1w. Therefore, ¯ess ̸=0ifw̸=0.□ Remark 1 (Convergence rate versusKP .) From Ineq. 36, smaller q yields...
work page 2012
-
[32]
This assumption is expected to entail no loss of generality relative to the |a(t)| ≤q <1assumption. 21 Preprint. −20 −15 −10 −5 0 5 −0.20 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60 <e(0), s(k)> <e(0), e(k)> (a) PI 0.1 0.15 0.2 0.25 0.00 0.02 0.04 0.06 0.08 <e(0), s(k)> <e(0), e(k)> (b) PID Figure 6: Scalar errors across time step of randomly initialized...
work page 1999
-
[33]
=M i(k)˜ζPI(k) with Mi(k)= Mp(k)−G(k) I I , being asymptotically stable. Suppose there isQ(k)=Q(k) ⊤ ⪰0 bounded so that the pair (Mi(k), p Q(k)) is observable for all k, hence the difference Lyapunov equation M ⊤ i (k)P(k+1)M i(k)−P(k)=−Q(k) admits a unique positive definite solution P(k) =P ⊤(k)≻0 for all k, and a uniform bound ∥P∥ ∞ :=sup k∥P(k)∥<∞(see ...
work page 2008
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.