Activation Steering with a Feedback Controller

Dung V. Nguyen; Hieu M. Vu; Lei Zhang; Nhi Y. Pham; Tan M. Nguyen

arxiv: 2510.04309 · v3 · pith:TXRRNTSFnew · submitted 2025-10-05 · 💻 cs.LG

Activation Steering with a Feedback Controller

Dung V. Nguyen , Hieu M. Vu , Nhi Y. Pham , Lei Zhang , Tan M. Nguyen This is my paper

Pith reviewed 2026-05-21 21:44 UTC · model grok-4.3

classification 💻 cs.LG

keywords activation steeringPID controllarge language modelsfeedback controlbehavioral controlsafety alignmentcontrol theory

0 comments

The pith

Activation steering in LLMs corresponds to proportional control, and extending it to full PID yields interpretable error dynamics with stability guarantees.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that existing activation steering methods act as proportional controllers, with the steering vector providing the feedback signal to shift activations toward desired directions. It then introduces PID Steering, which adds an integral term to accumulate corrections across layers and a derivative term to reduce overshoot from sudden activation shifts. This closed-loop approach produces explicit error dynamics that link directly to classical control theory and its stability results. Experiments on multiple model families indicate that the added terms produce more reliable behavioral control than standard steering alone.

Core claim

Popular steering methods correspond to the proportional (P) controllers, with the steering vector serving as the feedback signal. PID Steering leverages the full PID controller for activation steering in LLMs, yielding interpretable error dynamics and connecting to classical stability guarantees.

What carries the argument

The PID controller for activation steering, where the proportional term aligns activations with target semantic directions, the integral term accumulates errors to enforce persistent corrections across layers, and the derivative term mitigates overshoot by counteracting rapid activation changes.

If this is right

Steering methods acquire theoretical performance guarantees drawn from control theory.
Error dynamics become explicit, making it possible to diagnose issues such as persistent offset or overshoot.
The modular design allows PID terms to combine directly with existing steering vectors and methods.
Behavioral control gains robustness across different layers and model families.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same mapping could let researchers test other classical controllers, such as lead-lag or state-feedback designs, on the same activation signals.
Checking whether measured activation responses match the closed-loop poles predicted by the PID model would provide a concrete test of the linear approximation.
Because the controller is lightweight, it could be applied selectively to specific attention heads or layers to isolate their contribution to a target behavior.

Load-bearing premise

The nonlinear, discrete, and high-dimensional activation dynamics inside transformer layers can be usefully approximated by the linear time-invariant plant model assumed in classical PID control.

What would settle it

Directly measuring activation trajectories during PID steering and finding that they deviate substantially from the error accumulation and damping predicted by the linear model would undermine the transfer of stability guarantees.

Figures

Figures reproduced from arXiv: 2510.04309 by Dung V. Nguyen, Hieu M. Vu, Lei Zhang, Nhi Y. Pham, Tan M. Nguyen.

**Figure 1.** Figure 1: Our paper connects LLM Behavior Control, Feature Attribution for LLM and Control Theory. Specifically, we apply a PID-Controller to compute the steering vector for activation steering [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 3.** Figure 3: Scalar errors across time step of randomly initialized model after applying P, PI, and PID controller. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative results of activation steering in FLUX-Schnell across two style concepts with the prompt [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: 0-shot and CLIPScore results for ‘cyperpunk‘ and ‘steampunk‘ concept. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Scalar errors across time step of randomly initialized model after applying PI and PID controller. Colors [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: concept cyberpunk. C.2 JAILBREAKING LARGE LANGUAGE MODELS Tab. 3 reports a comprehensive comparison of attack success rate (ASR) and general benchmark performance across multiple instruction-tuned models under different defense methods. Overall, PID consistently achieves the highest ASR among defenses, while maintaining comparable performance on downstream benchmarks. 29 [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗

**Figure 8.** Figure 8: Concept steampunk 30 [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗

read the original abstract

Controlling the behaviors of large language models (LLM) is fundamental to their safety alignment and reliable deployment. However, existing steering methods are primarily driven by empirical insights and lack theoretical performance guarantees. In this work, we develop a control-theoretic foundation for activation steering by showing that popular steering methods correspond to the proportional (P) controllers, with the steering vector serving as the feedback signal. Building on this finding, we propose Proportional-Integral-Derivative (PID) Steering, a principled framework that leverages the full PID controller for activation steering in LLMs. The proportional (P) term aligns activations with target semantic directions, the integral (I) term accumulates errors to enforce persistent corrections across layers, and the derivative (D) term mitigates overshoot by counteracting rapid activation changes. This closed-loop design yields interpretable error dynamics and connects activation steering to classical stability guarantees in control theory. Moreover, PID Steering is lightweight, modular, and readily integrates with state-of-the-art steering methods. Extensive experiments across multiple LLM families and benchmarks demonstrate that PID Steering consistently outperforms existing approaches, achieving more robust and reliable behavioral control. The code is publicly available at: https://github.com/dungnvnus/pid-steering

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper recasts activation steering as PID control with some empirical gains, but the stability claims rest on an unchecked LTI approximation for transformer layers.

read the letter

The main thing to know is that this paper proposes PID Steering as an extension of existing activation steering techniques by framing them through proportional-integral-derivative control. Popular methods are cast as P controllers, and the new approach adds I and D terms for better correction over layers. What stands out is the clean mapping and the practical integration with current methods. The experiments across different LLM families show consistent improvements in behavioral control, which is the kind of evidence that matters for this area. Releasing the code also helps others build on it. Where it gets thin is the theoretical part. The claim that this yields interpretable error dynamics and connects to stability guarantees assumes the activations behave like a linear time-invariant system. But with nonlinear activations and discrete layer processing in transformers, that approximation isn't justified with any error analysis or validation experiments. So the stability connections don't really carry over without additional work. This paper is for folks in LLM alignment who are looking for more structured ways to steer model behavior. Someone already familiar with control theory might see the value in the analogy, while others can still use the PID method empirically. It has enough substance to go to peer review. The empirical results are there to discuss, and the framing is novel enough that referees could help sharpen the theory. I recommend sending it for review.

Referee Report

2 major / 2 minor

Summary. The paper develops a control-theoretic foundation for activation steering in LLMs by showing that popular methods correspond to proportional (P) controllers with the steering vector as feedback signal. It proposes PID Steering, which adds integral and derivative terms to enforce persistent corrections and mitigate overshoot, yielding interpretable error dynamics and connections to classical stability guarantees. The approach is lightweight and integrates with existing methods; experiments across LLM families and benchmarks claim consistent outperformance.

Significance. If the LTI plant approximation for transformer activations is valid to within bounded residuals, the work provides a principled bridge between activation steering and control theory, enabling design of steering methods with potential stability margins and error-trajectory interpretability. Public code release and modular design are strengths that support reproducibility. The empirical gains, if robust, would be practically useful for reliable behavioral control, though the significance depends on validating the modeling assumptions against nonlinear discrete dynamics.

major comments (2)

[Abstract and modeling sections (around the PID controller derivation)] The central claim that PID Steering connects to classical stability guarantees requires the layer activations to be modeled as the output of a linear time-invariant plant whose state evolves according to standard PID error dynamics. No section derives approximation error bounds or empirically validates this for the nonlinear (ReLU/GELU, attention softmax), discrete (layer index and token space), and high-dimensional transformer forward passes; without such validation the transfer of stability margins does not follow.
[Introduction and § on P-controller equivalence] The correspondence between existing steering methods and P controllers is presented as definitional once the steering vector is treated as feedback, but the manuscript does not show that this mapping preserves the closed-loop properties under the true nonlinear dynamics; this makes the extension to I and D terms rest primarily on empirical demonstration rather than the same equations.

minor comments (2)

[PID Steering framework] Clarify the exact definition of the error signal and how the integral term is accumulated across layers without wind-up issues in the discrete setting.
[Experiments] Add details on baseline implementations and statistical significance testing for the reported outperformance to strengthen the experimental claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and outline revisions that will clarify the scope of our modeling assumptions while preserving the core contributions of the work.

read point-by-point responses

Referee: [Abstract and modeling sections (around the PID controller derivation)] The central claim that PID Steering connects to classical stability guarantees requires the layer activations to be modeled as the output of a linear time-invariant plant whose state evolves according to standard PID error dynamics. No section derives approximation error bounds or empirically validates this for the nonlinear (ReLU/GELU, attention softmax), discrete (layer index and token space), and high-dimensional transformer forward passes; without such validation the transfer of stability margins does not follow.

Authors: We agree that a strict transfer of classical stability margins would require explicit approximation error bounds between the LTI model and the true nonlinear, discrete transformer dynamics. Deriving such bounds analytically is beyond the current scope. In the revision we will add a dedicated subsection that states the LTI approximation explicitly, discusses its limitations with respect to ReLU/GELU nonlinearities and attention softmax, and supplies additional empirical plots of observed error trajectories under PID steering. This will reframe the stability connection as a principled design heuristic rather than a direct guarantee. revision: partial
Referee: [Introduction and § on P-controller equivalence] The correspondence between existing steering methods and P controllers is presented as definitional once the steering vector is treated as feedback, but the manuscript does not show that this mapping preserves the closed-loop properties under the true nonlinear dynamics; this makes the extension to I and D terms rest primarily on empirical demonstration rather than the same equations.

Authors: The P-controller equivalence is introduced by interpreting the steering vector as a feedback signal in activation space. We do not assert that all closed-loop properties are preserved under the actual nonlinear dynamics. The integral and derivative terms are added to mitigate empirically observed shortcomings of proportional-only steering (persistent offset and overshoot). We will revise the introduction and the relevant section to distinguish the definitional mapping from the heuristic PID extension and to emphasize that the primary evidence for the full framework remains the experimental results across model families. revision: partial

Circularity Check

1 steps flagged

Steering-to-P correspondence is definitional once steering vector is identified as feedback; PID adds empirical terms

specific steps

self definitional [Abstract]
"we develop a control-theoretic foundation for activation steering by showing that popular steering methods correspond to the proportional (P) controllers, with the steering vector serving as the feedback signal."

Once the steering vector is stipulated to be the feedback signal, any method that adds a vector proportional to that signal is a P-controller by definition. The claimed 'showing' therefore reduces to the identification itself rather than a derived equivalence from the transformer equations.

full rationale

The paper's claimed control-theoretic foundation reduces to a re-labeling: existing steering vectors are declared to be the feedback signal, after which the methods are P-controllers by the definition of proportional control. This step is load-bearing for the 'foundation' but contains no independent derivation. The subsequent PID extension introduces I and D terms whose benefit is shown via experiments on real LLMs rather than forced by the same equations, supplying moderate independent content. No self-citations, fitted parameters renamed as predictions, or uniqueness theorems appear in the load-bearing chain. The LTI plant modeling is an unvalidated ansatz but does not make the reported results tautological by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on treating LLM layer activations as a controllable plant whose dynamics admit a linear feedback approximation; no free parameters are introduced in the abstract, no new physical entities are postulated, and the main axiom is the validity of the control analogy itself.

axioms (1)

domain assumption LLM activation dynamics can be approximated sufficiently well by a linear feedback control model for the purpose of applying PID corrections and invoking classical stability results.
Invoked when the paper states that PID Steering connects activation steering to classical stability guarantees.

pith-pipeline@v0.9.0 · 5754 in / 1321 out tokens · 28747 ms · 2026-05-21T21:44:47.432165+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We develop a control-theoretic foundation for activation steering by showing that popular steering methods correspond to the proportional (P) controllers... propose Proportional-Integral-Derivative (PID) Steering... u(k)=K_p r(k)+K_i ∑_{j=0}^{k-1} r(j)+K_d (r(k)−r(k−1))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control
cs.LG 2026-04 conditional novelty 7.0

Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
When control meets large language models: From words to dynamics
eess.SY 2026-02 unverdicted novelty 3.0

The paper proposes a bidirectional continuum between LLMs and control systems, covering LLM-assisted controller design, control-based LLM steering, and state-space modeling of LLMs.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 2 Pith papers · 9 internal anchors

[1]

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C

URL https://transformer-circuits.pub/2023/monosemantic-features/index.html. Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server,

work page 2023
[2]

Microsoft COCO Captions: Data Collection and Evaluation Server

URL https://arxiv.org/abs/1504.00325. Emily Cheng and Carmen Amo Alonso. Linearly controlled language generation with performative guarantees.arXiv preprint arXiv:2405.15454,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Huu-Tien Dang, Tin Pham, Hoang Thanh-Tung, and Naoya Inoue

URL https://arxiv.org/abs/2405.15454. Huu-Tien Dang, Tin Pham, Hoang Thanh-Tung, and Naoya Inoue. On effects of steering latent representation for large language model unlearning. InProceedings of the AAAI Conference on Artificial Intelligence, pp. 23733–23742,

work page arXiv
[4]

URL ”https://transformer-circuits.pub/2022/toy model/index.html”. L. Euler.Institutionum calculi integralis. Number v. 1 in Institutionum calculi integralis. imp. Acad. imp. Sa`ent.,

work page 2022
[5]

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. RealToxic- ityPrompts: Evaluating neural toxic degeneration in language models. In Trevor Cohn, Yulan He, and Yang Liu (eds.),Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 3356–3369, Online, November

work page 2020
[6]

doi: 10.18653/v1/2020.findings-emnlp.301

Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.301. URL https://aclanthology.org/2020.findings-emnlp.301/. Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, and Noah Goodman. Finding alignments between interpretable causal variables and distributed neural representations. InCausal Learning and Reasoning, pp. 1...

work page doi:10.18653/v1/2020.findings-emnlp.301 2020
[7]

Gemma 2: Improving Open Language Models at a Practical Size

Google Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, L´eonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram´e, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

CLIPScore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A reference-free evaluation metric for image captioning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.),Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 7514–7528, Online and Punta Cana, Domin...

work page 2021
[9]

doi: 10.18653/v1/2021.emnlp-main.595

Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.595. URL https://aclanthology.org/2021.emnlp-main.595/. Roger A Horn and Charles R Johnson.Matrix analysis. Cambridge university press,

work page doi:10.18653/v1/2021.emnlp-main.595 2021
[10]

Mistral 7B

URL https://arxiv.org/abs/2310.06825. Zhong-Ping Jiang, Eduardo Sontag, and Yuan Wang. Input-to-state stability for discrete-time nonlinear systems.IFAC Proceedings Volumes, 32(2):2403–2408,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

doi: https://doi.org/10.1016/S1474-6670(17)56408-3

ISSN 1474-6670. doi: https://doi.org/10.1016/S1474-6670(17)56408-3. URL https://www.sciencedirect.com/science/ article/pii/S1474667017564083. 14th IFAC World Congress 1999, Beijing, Chia, 5-9 July. Kai Konen, Sophie Jentzsch, Diaoul´e Diallo, Peer Sch¨utt, Oliver Bensch, Roxanne El Baff, Dominik Opitz, and Tobias Hecking. Style Vectors for Steering Genera...

work page doi:10.1016/s1474-6670(17)56408-3 1999
[13]

Suhas Kotha, Jacob Mitchell Springer, and Aditi Raghunathan

URL https://arxiv.org/abs/2406.05954. Suhas Kotha, Jacob Mitchell Springer, and Aditi Raghunathan. Understanding catastrophic forgetting in language models via implicit inference.arXiv preprint arXiv:2309.10105,

work page arXiv
[14]

Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar

11 Preprint. Bruce W Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar. Programming refusal with conditional activation steering.arXiv preprint arXiv:2409.05907,

work page arXiv
[15]

SDXL-Lightning: Progressive Adversarial Diffusion Distillation

URL https://arxiv.org/abs/2402.13929. AI @ Meta Llama Team. The llama 3 herd of models,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

The Llama 3 Herd of Models

URL https://arxiv.org/abs/2407.21783. Varvara Logacheva, Daryna Dementieva, Sergey Ustyantsev, Daniil Moskovskiy, David Dale, Irina Krotova, Nikita Semenov, and Alexander Panchenko. ParaDetox: Detoxification with parallel data. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 6804–6818,...

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang

URL https://arxiv.org/abs/2310.14201. Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing,

work page arXiv
[19]

tinybenchmarks: evaluating llms with fewer examples.arXiv preprint arXiv:2402.14992,

Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin. tinybenchmarks: evaluating llms with fewer examples.arXiv preprint arXiv:2402.14992,

work page arXiv
[20]

Steering Llama 2 via Contrastive Activation Addition

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.828. URL https://aclanthology.org/2024.acl-long.828/. Pau Rodriguez, Arno Blaas, Michal Klein, Luca Zappella, Nicholas Apostoloff, marco cuturi, and Xavier Suau. Controlling language and diffusion models by transporting activations. InThe Thirteenth International Conference on Learn...

work page doi:10.18653/v1/2024.acl-long.828 2024
[21]

Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

URL https://openreview.net/forum?id=l2zFn6TIQi. 12 Preprint. Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. arXiv preprint arXiv:2310.11324,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Xavier Suau, Pieter Delobelle, Katherine Metcalf, Armand Joulin, Nicholas Apostoloff, Luca Zappella, and Pau Rodr´ıguez

URL https://arxiv.org/abs/2305.18449. Xavier Suau, Pieter Delobelle, Katherine Metcalf, Armand Joulin, Nicholas Apostoloff, Luca Zappella, and Pau Rodr´ıguez. Whispering experts: neural interventions for toxicity mitigation in language models. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org,

work page arXiv
[25]

Steering Language Models With Activation Engineering

URL https://arxiv.org/abs/2308.10248. Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering Language Models With Activation Engineering, October

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Activation Steering with a Feedback Controller

Curran Associates Inc. Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation Engineering: A...

work page 2024
[29]

D(t)=K d dr(t) dt . Approximating the derivative by the backward Euler difference withh=1gives D(k)=K d r(k)−r(k−1) .(23) Combining equation 21, equation 22, and equation 23 yields u(k)=K pr(k) +K i k−1X j=0 r(j) +K d r(k)−r(k−1) . □ 15 Preprint. B.2 BACKGROUND ONINPUT-TO-STATESTABILITY& NOTATIONS Background on Input-to-state Stability (ISS)In our proofs,...

work page 1999
[30]

See Appendix B.3 for detailed proof and explanations of the terms

Proposition 2 (Error dynamics of activation steering) The error dynamics ¯e(k) in activation steering is of the form: ¯e(k+1)= ¯A(k)¯e(k)− ¯A(k)u(k)+w(k),(20) where ¯A(k) is the mean local Jacobian of f (k) i at x+ i (k) and the disturbance term w(k) collects heterogeneity. See Appendix B.3 for detailed proof and explanations of the terms. Proof.The evolu...

work page 1999
[31]

discharge

Therefore, the system is ISS. However, there exists a steady-state error due to the disturbancew(k). In the best case, when ¯A(k) converges to ¯A and w(k) converges tow, the error ¯e(k) eventually converges to a steady state given by ¯ess =(I− ¯A(1−pI)) −1w. Therefore, ¯ess ̸=0ifw̸=0.□ Remark 1 (Convergence rate versusKP .) From Ineq. 36, smaller q yields...

work page 2012
[32]

21 Preprint

This assumption is expected to entail no loss of generality relative to the |a(t)| ≤q <1assumption. 21 Preprint. −20 −15 −10 −5 0 5 −0.20 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60 <e(0), s(k)> <e(0), e(k)> (a) PI 0.1 0.15 0.2 0.25 0.00 0.02 0.04 0.06 0.08 <e(0), s(k)> <e(0), e(k)> (b) PID Figure 6: Scalar errors across time step of randomly initialized...

work page 1999
[33]

=M i(k)˜ζPI(k) with Mi(k)= Mp(k)−G(k) I I , being asymptotically stable. Suppose there isQ(k)=Q(k) ⊤ ⪰0 bounded so that the pair (Mi(k), p Q(k)) is observable for all k, hence the difference Lyapunov equation M ⊤ i (k)P(k+1)M i(k)−P(k)=−Q(k) admits a unique positive definite solution P(k) =P ⊤(k)≻0 for all k, and a uniform bound ∥P∥ ∞ :=sup k∥P(k)∥<∞(see ...

work page 2008

[1] [1]

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C

URL https://transformer-circuits.pub/2023/monosemantic-features/index.html. Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server,

work page 2023

[2] [2]

Microsoft COCO Captions: Data Collection and Evaluation Server

URL https://arxiv.org/abs/1504.00325. Emily Cheng and Carmen Amo Alonso. Linearly controlled language generation with performative guarantees.arXiv preprint arXiv:2405.15454,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Huu-Tien Dang, Tin Pham, Hoang Thanh-Tung, and Naoya Inoue

URL https://arxiv.org/abs/2405.15454. Huu-Tien Dang, Tin Pham, Hoang Thanh-Tung, and Naoya Inoue. On effects of steering latent representation for large language model unlearning. InProceedings of the AAAI Conference on Artificial Intelligence, pp. 23733–23742,

work page arXiv

[4] [4]

URL ”https://transformer-circuits.pub/2022/toy model/index.html”. L. Euler.Institutionum calculi integralis. Number v. 1 in Institutionum calculi integralis. imp. Acad. imp. Sa`ent.,

work page 2022

[5] [5]

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. RealToxic- ityPrompts: Evaluating neural toxic degeneration in language models. In Trevor Cohn, Yulan He, and Yang Liu (eds.),Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 3356–3369, Online, November

work page 2020

[6] [6]

doi: 10.18653/v1/2020.findings-emnlp.301

Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.301. URL https://aclanthology.org/2020.findings-emnlp.301/. Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, and Noah Goodman. Finding alignments between interpretable causal variables and distributed neural representations. InCausal Learning and Reasoning, pp. 1...

work page doi:10.18653/v1/2020.findings-emnlp.301 2020

[7] [7]

Gemma 2: Improving Open Language Models at a Practical Size

Google Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, L´eonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram´e, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

CLIPScore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A reference-free evaluation metric for image captioning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.),Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 7514–7528, Online and Punta Cana, Domin...

work page 2021

[9] [9]

doi: 10.18653/v1/2021.emnlp-main.595

Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.595. URL https://aclanthology.org/2021.emnlp-main.595/. Roger A Horn and Charles R Johnson.Matrix analysis. Cambridge university press,

work page doi:10.18653/v1/2021.emnlp-main.595 2021

[10] [10]

Mistral 7B

URL https://arxiv.org/abs/2310.06825. Zhong-Ping Jiang, Eduardo Sontag, and Yuan Wang. Input-to-state stability for discrete-time nonlinear systems.IFAC Proceedings Volumes, 32(2):2403–2408,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

doi: https://doi.org/10.1016/S1474-6670(17)56408-3

ISSN 1474-6670. doi: https://doi.org/10.1016/S1474-6670(17)56408-3. URL https://www.sciencedirect.com/science/ article/pii/S1474667017564083. 14th IFAC World Congress 1999, Beijing, Chia, 5-9 July. Kai Konen, Sophie Jentzsch, Diaoul´e Diallo, Peer Sch¨utt, Oliver Bensch, Roxanne El Baff, Dominik Opitz, and Tobias Hecking. Style Vectors for Steering Genera...

work page doi:10.1016/s1474-6670(17)56408-3 1999

[12] [13]

Suhas Kotha, Jacob Mitchell Springer, and Aditi Raghunathan

URL https://arxiv.org/abs/2406.05954. Suhas Kotha, Jacob Mitchell Springer, and Aditi Raghunathan. Understanding catastrophic forgetting in language models via implicit inference.arXiv preprint arXiv:2309.10105,

work page arXiv

[13] [14]

Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar

11 Preprint. Bruce W Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar. Programming refusal with conditional activation steering.arXiv preprint arXiv:2409.05907,

work page arXiv

[14] [15]

SDXL-Lightning: Progressive Adversarial Diffusion Distillation

URL https://arxiv.org/abs/2402.13929. AI @ Meta Llama Team. The llama 3 herd of models,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [16]

The Llama 3 Herd of Models

URL https://arxiv.org/abs/2407.21783. Varvara Logacheva, Daryna Dementieva, Sergey Ustyantsev, Daniil Moskovskiy, David Dale, Irina Krotova, Nikita Semenov, and Alexander Panchenko. ParaDetox: Detoxification with parallel data. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 6804–6818,...

work page internal anchor Pith review Pith/arXiv arXiv

[16] [18]

Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang

URL https://arxiv.org/abs/2310.14201. Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing,

work page arXiv

[17] [19]

tinybenchmarks: evaluating llms with fewer examples.arXiv preprint arXiv:2402.14992,

Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin. tinybenchmarks: evaluating llms with fewer examples.arXiv preprint arXiv:2402.14992,

work page arXiv

[18] [20]

Steering Llama 2 via Contrastive Activation Addition

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.828. URL https://aclanthology.org/2024.acl-long.828/. Pau Rodriguez, Arno Blaas, Michal Klein, Luca Zappella, Nicholas Apostoloff, marco cuturi, and Xavier Suau. Controlling language and diffusion models by transporting activations. InThe Thirteenth International Conference on Learn...

work page doi:10.18653/v1/2024.acl-long.828 2024

[19] [21]

Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

URL https://openreview.net/forum?id=l2zFn6TIQi. 12 Preprint. Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. arXiv preprint arXiv:2310.11324,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [23]

Xavier Suau, Pieter Delobelle, Katherine Metcalf, Armand Joulin, Nicholas Apostoloff, Luca Zappella, and Pau Rodr´ıguez

URL https://arxiv.org/abs/2305.18449. Xavier Suau, Pieter Delobelle, Katherine Metcalf, Armand Joulin, Nicholas Apostoloff, Luca Zappella, and Pau Rodr´ıguez. Whispering experts: neural interventions for toxicity mitigation in language models. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org,

work page arXiv

[21] [25]

Steering Language Models With Activation Engineering

URL https://arxiv.org/abs/2308.10248. Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering Language Models With Activation Engineering, October

work page internal anchor Pith review Pith/arXiv arXiv

[22] [26]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [27]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [28]

Activation Steering with a Feedback Controller

Curran Associates Inc. Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation Engineering: A...

work page 2024

[25] [29]

D(t)=K d dr(t) dt . Approximating the derivative by the backward Euler difference withh=1gives D(k)=K d r(k)−r(k−1) .(23) Combining equation 21, equation 22, and equation 23 yields u(k)=K pr(k) +K i k−1X j=0 r(j) +K d r(k)−r(k−1) . □ 15 Preprint. B.2 BACKGROUND ONINPUT-TO-STATESTABILITY& NOTATIONS Background on Input-to-state Stability (ISS)In our proofs,...

work page 1999

[26] [30]

See Appendix B.3 for detailed proof and explanations of the terms

Proposition 2 (Error dynamics of activation steering) The error dynamics ¯e(k) in activation steering is of the form: ¯e(k+1)= ¯A(k)¯e(k)− ¯A(k)u(k)+w(k),(20) where ¯A(k) is the mean local Jacobian of f (k) i at x+ i (k) and the disturbance term w(k) collects heterogeneity. See Appendix B.3 for detailed proof and explanations of the terms. Proof.The evolu...

work page 1999

[27] [31]

discharge

Therefore, the system is ISS. However, there exists a steady-state error due to the disturbancew(k). In the best case, when ¯A(k) converges to ¯A and w(k) converges tow, the error ¯e(k) eventually converges to a steady state given by ¯ess =(I− ¯A(1−pI)) −1w. Therefore, ¯ess ̸=0ifw̸=0.□ Remark 1 (Convergence rate versusKP .) From Ineq. 36, smaller q yields...

work page 2012

[28] [32]

21 Preprint

This assumption is expected to entail no loss of generality relative to the |a(t)| ≤q <1assumption. 21 Preprint. −20 −15 −10 −5 0 5 −0.20 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60 <e(0), s(k)> <e(0), e(k)> (a) PI 0.1 0.15 0.2 0.25 0.00 0.02 0.04 0.06 0.08 <e(0), s(k)> <e(0), e(k)> (b) PID Figure 6: Scalar errors across time step of randomly initialized...

work page 1999

[29] [33]

=M i(k)˜ζPI(k) with Mi(k)= Mp(k)−G(k) I I , being asymptotically stable. Suppose there isQ(k)=Q(k) ⊤ ⪰0 bounded so that the pair (Mi(k), p Q(k)) is observable for all k, hence the difference Lyapunov equation M ⊤ i (k)P(k+1)M i(k)−P(k)=−Q(k) admits a unique positive definite solution P(k) =P ⊤(k)≻0 for all k, and a uniform bound ∥P∥ ∞ :=sup k∥P(k)∥<∞(see ...

work page 2008