pith. machine review for the scientific record. sign in

arxiv: 2604.19018 · v1 · submitted 2026-04-21 · 💻 cs.LG · cs.AI· cs.SY· eess.SY· math.OC· stat.ML

Recognition: unknown

Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:28 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.SYeess.SYmath.OCstat.ML
keywords activation steeringlinear quadratic regulatorlocal linearitytransformer dynamicsLLM alignmentfeedback controlJacobian approximationsemantic setpoints
0
0 comments X

The pith

Local linearity of LLM layers allows closed-loop linear quadratic regulator control for activation steering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that transformer layer dynamics are sufficiently linear locally to support model-based optimal control. It adapts the linear quadratic regulator to steer activations toward semantic targets using real-time Jacobian feedback. This closed-loop approach outperforms open-loop steering methods on tasks like toxicity reduction and truthfulness adjustment. The method requires no offline training and comes with theoretical tracking error bounds. A sympathetic reader would care because it offers a lightweight way to align LLM behavior at inference time with formal guarantees.

Core claim

Despite the nonlinear structure of transformer blocks, layer-wise dynamics across multiple LLM architectures and scales are well-approximated by locally-linear models. Exploiting this property, we model LLM inference as a linear time-varying dynamical system and adapt the classical linear quadratic regulator to compute feedback controllers using layer-wise Jacobians, steering activations toward desired semantic setpoints in closed-loop with minimal computational overhead and no offline training. We also derive theoretical bounds on setpoint tracking error.

What carries the argument

The locally-linear approximation of layer-wise dynamics, which enables adapting the linear quadratic regulator to provide Jacobian-based feedback controllers for closed-loop activation steering.

If this is right

  • Steering achieves robust, fine-grained behavior control across models, scales, and tasks.
  • Surpasses baseline methods in modulating toxicity, truthfulness, refusal, and arbitrary concepts.
  • Provides formal performance guarantees via derived bounds on setpoint tracking error.
  • Operates with minimal computational overhead and requires no offline training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If local linearity holds more generally, other classical control techniques such as model predictive control could be adapted for LLM internals.
  • The method suggests activation steering can be reframed as an online control problem rather than a static intervention.
  • Success here raises the possibility of designing or fine-tuning models to preserve or strengthen layer-wise linearity for easier control.

Load-bearing premise

Layer-wise dynamics remain sufficiently linear over the full generation trajectory for the LQR feedback to keep tracking errors low and stable.

What would settle it

If the Jacobian-based LQR controller produces higher activation deviation from semantic setpoints than open-loop baselines on a held-out model scale or task, the local linearity is not sufficient for reliable control.

Figures

Figures reproduced from arXiv: 2604.19018 by Glen Chou, Julian Skifstad, Xinyue Annie Yang.

Figure 1
Figure 1. Figure 1: Overview. At each LLM layer k, our method, A-LQR, computes a steering intervention uk that minimizes the deviation between the semantic feature value βk := v ⊤ k zk at the current activation zk and a desired target β ∗ k, where vk encodes the feature direction. To construct uk, we use the linear quadratic regulator (LQR) to efficiently compute steering controllers using linear ap￾proximations of the LLM tr… view at source ↗
Figure 2
Figure 2. Figure 2: A-LQR linearizes each transformer block ϕk and uses this local structure to synthesize control actions uk that steer the activations toward desired setpoints z ′ k. We show that each ϕk can be well approximated as locally linear (Sec. 5). Thus, for two reachable activations z2 and z alt 2 , the corresponding Jacobians are similar, i.e., (∂ϕ2/∂z)|z=z2 ≈ (∂ϕ2/∂z)|z=z alt 2 (orange, bottom). fixing a single β… view at source ↗
Figure 3
Figure 3. Figure 3: Empirical tracking error satisfies the bound (22) with 10 tracking rollouts. At each layer, the bound and error values are normalized by the sampled mean P-norm of the layer activations; hence, these values are relative to the ambient activation norm. 0 500 1000 1500 2000 0.0 0.5 1.0 Normalized σi’s Layer 0 0 500 1000 1500 2000 Layer 6 0 500 1000 1500 2000 Mode index 0.0 0.5 1.0 Normalized σi’s Layer 12 0 … view at source ↗
Figure 6
Figure 6. Figure 6: Concept prevalence (% of generations with relevant output) across λ values. Prevalence is the % of 500 trials which exhibit the specified concept, as scored by an LLM-as-a-judge. all i ∈ {1, . . . , ℓ}; a probabilistic overestimate of the worst￾case Li can be found following Knuth et al. (2021; 2023). With Li estimated, we roll out the bound (22) for a given initial deviation δz1:T . In practice, we evalua… view at source ↗
Figure 7
Figure 7. Figure 7: Jacobian spectral similarity on Qwen-2.5-3B: initial, intermediate, and final linearized-layer alignment. Each plot contains comparisons across 50 matrices (one matrix per row/column), with lighter pixels corresponding to stronger alignment. (a) randomly sampled matrices, (b) randomly sampled nominal Jacobians of differing concepts, and (c) Jacobians corresponding the prompts related to the concept “Cloud.… view at source ↗
Figure 8
Figure 8. Figure 8: Jacobian spectral similarity on Llama-3-8B: initial, intermediate, and final linearized-layer alignment. Each plot contains comparisons across 50 matrices (one matrix per row/column), with lighter pixels corresponding to stronger alignment. (a) randomly sampled matrices, (b) randomly sampled nominal Jacobians of differing concepts, and (c) Jacobians corresponding the prompts related to the concept “Cloud.”… view at source ↗
Figure 9
Figure 9. Figure 9: Jacobian spectral similarity on Llama-3-1B: initial, intermediate, and final linearized-layer alignment. Each plot contains comparisons across 50 matrices (one matrix per row/column), with lighter pixels corresponding to stronger alignment. (a) randomly sampled matrices, (b) randomly sampled nominal Jacobians of differing concepts, and (c) Jacobians corresponding the prompts related to the concept “Cloud.”… view at source ↗
Figure 10
Figure 10. Figure 10: Empirical tracking bounds on two additional models. The empirical tracking error satisfies the bound (22) with 10 tracking rollouts. 0 500 1000 1500 2000 0.0 0.2 0.4 0.6 0.8 1.0 N o r m aliz e d i's Layer 0 0 500 1000 1500 2000 Layer 8 0 500 1000 1500 2000 Layer 16 Mode index [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qwen-2.5-3B spectrum distribution, showing alignment (same layout as [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Llama-8B spectrum distribution, showing alignment (same layout as [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Llama-1B spectrum distribution, showing alignment (same layout as [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Tiles are ordered from left to right and correspond to layers 0, ⌊N/3⌋, and ⌊2N/3⌋, where N is the total number of layers. Each tile shows the alignment between the sampled Jacobians of the specified prompts at the corresponding layer. If N mod 3 = 0, as in (e) and (f), only layers 0, ⌊N/3⌋, and ⌊2N/3⌋ are plotted. (a) Gemma-2-2b Code Prompts Alignment. (b) Gemma-2-2b Law Prompts Alignment. (c) Llama-3-8B… view at source ↗
Figure 15
Figure 15. Figure 15: Gemma-2-2B alignment with heterogeneous data. The top left quadrant corresponds to “code” prompts and the bottom right quadrant corresponds to adversarial prompts. The remaining quadrants are then cross-alignment between these two datasets. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Gemma-2-2B alignment with heterogeneous data. The top left quadrant corresponds to “code” prompts and the bottom right quadrant corresponds to general prompts written in Japanese. The remaining quadrants are then cross-alignment between these two datasets. 0.0 0.2 0.4 0.6 0.8 1.0 [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Qwen-2.5-3B alignment with heterogeneous data. The top left quadrant corresponds to “code” prompts and the bottom right quadrant corresponds to general prompts written in Japanese. The remaining quadrants are then cross-alignment between these two datasets. 0.0 0.2 0.4 0.6 0.8 1.0 [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Llama-3-8B alignment with heterogeneous data. The top left quadrant corresponds to “code” prompts and the bottom right quadrant corresponds to general prompts written in Japanese. The remaining quadrants are then cross-alignment between these two datasets. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Gemma-2-2B empirical tracking bounds under heterogeneous and adversarial inputs. 0 2 4 6 8 10 12 14 16 Model Layer 0 2 4 6 8 10 12 Normalized Error (Lyapunov Norm) Worst-Case Bound vs Error Envelope (a) Llama-3.2-1B tracking bound with heterogeneous data, tracking a code nominal prompt from a Japanese input. 0 2 4 6 8 10 12 14 16 Model Layer 0 2 4 6 8 Normalized Error (Lyapunov Norm) Worst-Case Bound vs E… view at source ↗
Figure 20
Figure 20. Figure 20: Llama-3.2-1B empirical tracking bounds under heterogeneous and adversarial inputs. 0 5 10 15 20 25 30 35 Model Layer 0 10 20 30 40 50 Normalized Error (Lyapunov Norm) Worst-Case Bound vs Error Envelope (a) Qwen-2.5-3B tracking bound with heterogeneous data, tracking a code nominal prompt from a Japanese input. 0 5 10 15 20 25 30 35 Model Layer 0 5 10 15 20 25 30 35 Normalized Error (Lyapunov Norm) Worst-C… view at source ↗
Figure 21
Figure 21. Figure 21: Qwen-2.5-3B empirical tracking bounds under heterogeneous and adversarial inputs. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: LQR and λ parameter sweeps in the task of toxicity mitigation (qf := qT , where QT = qT I). Note that the sampling parameters here are different than what was used in the main work, with a temperature of 0.7, and no repetition penalty or top-p setting. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_22.png] view at source ↗
read the original abstract

Inference-time LLM alignment methods, particularly activation steering, offer an alternative to fine-tuning by directly modifying activations during generation. Existing methods, however, often rely on non-anticipative interventions that ignore how perturbations propagate through transformer layers and lack online error feedback, resulting in suboptimal, open-loop control. To address this, we show empirically that, despite the nonlinear structure of transformer blocks, layer-wise dynamics across multiple LLM architectures and scales are well-approximated by locally-linear models. Exploiting this property, we model LLM inference as a linear time-varying dynamical system and adapt the classical linear quadratic regulator to compute feedback controllers using layer-wise Jacobians, steering activations toward desired semantic setpoints in closed-loop with minimal computational overhead and no offline training. We also derive theoretical bounds on setpoint tracking error, enabling formal guarantees on steering performance. Using a novel adaptive semantic feature setpoint signal, our method yields robust, fine-grained behavior control across models, scales, and tasks, including state-of-the-art modulation of toxicity, truthfulness, refusal, and arbitrary concepts, surpassing baseline steering methods. Our code is available at: https://github.com/trustworthyrobotics/lqr-activation-steering

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that despite the nonlinear structure of transformer blocks, layer-wise dynamics in LLMs are well-approximated by locally linear models across architectures and scales; it models inference as a linear time-varying system, adapts classical LQR to compute Jacobian-based feedback controllers for closed-loop steering of activations to adaptive semantic setpoints, derives theoretical bounds on tracking error, and reports superior empirical performance on toxicity, truthfulness, refusal, and concept modulation tasks with no offline training.

Significance. If the local-linearity assumption holds over full trajectories, the work supplies a control-theoretic framework for activation steering that combines empirical validation, formal error bounds, and reproducible code; this could advance inference-time alignment by replacing heuristic open-loop interventions with stable, low-overhead feedback control and quantifiable guarantees.

major comments (2)
  1. [§3] §3 (empirical linearity checks): the reported approximation quality and Jacobian-based residuals are shown at selected points or short horizons, but the manuscript does not quantify how these residuals evolve or remain bounded over complete generation trajectories; this directly affects whether the LQR closed-loop eigenvalues stay stable and whether the derived tracking-error bounds remain tight, as required by the central claim.
  2. [§4.2] §4.2 (LQR formulation and error bounds): the tracking-error bounds are stated to scale with the linearization residual, yet the experiments do not report a direct comparison between predicted bound values and observed setpoint errors across models or tasks; without this calibration the formal guarantees' practical relevance is difficult to assess.
minor comments (2)
  1. [§4] The choice and sensitivity analysis of the LQR weighting matrices Q and R (free parameters) are not detailed; a brief ablation or default-setting justification would clarify reproducibility.
  2. [§5] Figure captions and the adaptive semantic-feature setpoint definition would benefit from an explicit equation or pseudocode block to make the online adaptation rule unambiguous.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has identified important opportunities to strengthen the empirical support for our local-linearity claims and the practical calibration of our theoretical bounds. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [§3] §3 (empirical linearity checks): the reported approximation quality and Jacobian-based residuals are shown at selected points or short horizons, but the manuscript does not quantify how these residuals evolve or remain bounded over complete generation trajectories; this directly affects whether the LQR closed-loop eigenvalues stay stable and whether the derived tracking-error bounds remain tight, as required by the central claim.

    Authors: We agree that explicit quantification of residual evolution over full trajectories is needed to rigorously support stability of the closed-loop eigenvalues and tightness of the tracking bounds. Our current experiments evaluate residuals at representative points and short segments during generation, which is consistent with the local-linearity hypothesis but does not fully address long-horizon behavior. In the revised manuscript we will add new figures and tables showing the evolution of Jacobian-based residuals (||f(x) - Ax - b||) over complete generation sequences for the toxicity, truthfulness, refusal, and concept-modulation tasks across the evaluated models. We will also report whether residuals remain bounded and discuss any implications for eigenvalue stability within the LQR design. revision: yes

  2. Referee: [§4.2] §4.2 (LQR formulation and error bounds): the tracking-error bounds are stated to scale with the linearization residual, yet the experiments do not report a direct comparison between predicted bound values and observed setpoint errors across models or tasks; without this calibration the formal guarantees' practical relevance is difficult to assess.

    Authors: We appreciate this observation. The derived bounds explicitly depend on the linearization residual, and a side-by-side comparison with observed errors would better demonstrate the tightness and practical utility of the guarantees. In the revision we will include a new table (or supplementary figure) that computes the theoretical tracking-error bounds from the measured residuals for each model and task, then directly compares these predicted values against the empirical setpoint tracking errors reported in our experiments. This calibration will be accompanied by discussion of any observed discrepancies and their relation to the local-linearity assumption. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical linearity check + standard LQR + derived bounds remain independent of fitted inputs

full rationale

The derivation begins with an empirical claim that layer-wise transformer dynamics are locally linear (validated across models and scales), models inference as an LTV system, applies the classical LQR controller using computed Jacobians, and derives tracking-error bounds that explicitly depend on the linearization residual. None of these steps reduce by construction to quantities fitted inside the same experiment; the LQR formulation and error bounds are taken from external control theory, the linearity is presented as an empirical observation rather than a definitional assumption, and no self-citation chain is invoked to justify uniqueness or the ansatz. The central performance claims are therefore supported by external mathematics and separate empirical measurements rather than tautological renaming or self-referential fitting.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption of local linearity (empirically checked) and standard LQR theory; no new entities are postulated and only conventional weighting matrices are introduced.

free parameters (1)
  • LQR state and control weighting matrices Q and R
    Chosen to trade off setpoint tracking versus control effort; values are not derived from first principles and must be selected per task.
axioms (1)
  • domain assumption Layer-wise LLM dynamics admit a locally linear approximation whose Jacobians can be used for control
    Invoked to justify modeling the transformer as a linear time-varying system and applying LQR.

pith-pipeline@v0.9.0 · 5530 in / 1231 out tokens · 39851 ms · 2026-05-10T02:28:05.861216+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

106 extracted references · 62 canonical work pages · 11 internal anchors

  1. [1]

    Pawan , year=

    Webb, Stefan and Rainforth, Tom and Teh, Yee Whye and Kumar, M. Pawan , year=. A Statistical Approach to Assessing Neural Network Robustness , url=. doi:10.48550/arXiv.1811.07209 , abstractNote=

  2. [2]

    and Hilton, J

    Wu, Gabriel and Hilton, Jacob , year=. Estimating the Probabilities of Rare Outputs in Language Models , url=. doi:10.48550/arXiv.2410.13211 , abstractNote=

  3. [3]

    s-nlp/roberta\_toxicity\_classifier Model

    Roberta. s-nlp/roberta\_toxicity\_classifier Model. 2026

  4. [4]

    A Koopman framework for rare event simulation in stochastic differential equations , url=

    Zhang, Benjamin and Sahai, Tuhin and Marzouk, Youssef , year=. A Koopman framework for rare event simulation in stochastic differential equations , url=. doi:10.48550/arXiv.2101.07330 , abstractNote=

  5. [5]

    IEEE Robotics and Automation Letters , volume=

    Planning with learned dynamics: Probabilistic guarantees on safety and reachability via lipschitz constants , author=. IEEE Robotics and Automation Letters , volume=. 2021 , publisher=

  6. [6]

    and Kaiser,

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser,. Attention is all you need , year =. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =

  7. [7]

    Jailbroken: How Does LLM Safety Training Fail?

    Wei, Alexander and Haghtalab, Nika and Steinhardt, Jacob , year=. Jailbroken: How Does LLM Safety Training Fail? , url=. doi:10.48550/arXiv.2307.02483 , abstractNote=

  8. [8]

    Machine Learning: Science and Technology , author=

    Neural scaling laws from large- N field theory: solvable model beyond the ridgeless limit , volume=. Machine Learning: Science and Technology , author=. 2025 , month=jun, pages=. doi:10.1088/2632-2153/adc872 , abstractNote=

  9. [9]

    Rigorous agent evaluation: An adversarial approach to uncover catastrophic failures,

    Uesato, Jonathan and Kumar, Ananya and Szepesvari, Csaba and Erez, Tom and Ruderman, Avraham and Anderson, Keith and Krishmamurthy and Dvijotham and Heess, Nicolas and Kohli, Pushmeet , year=. Rigorous Agent Evaluation: An Adversarial Approach to Uncover Catastrophic Failures , url=. doi:10.48550/arXiv.1812.01647 , abstractNote=

  10. [10]

    Efficient Sampling Methods of, by, and for Stochastic Dynamical Systems , rights=

    Zhang, Benjamin Jiahong , year=. Efficient Sampling Methods of, by, and for Stochastic Dynamical Systems , rights=

  11. [11]

    Journal of Statistical Physics , author=

    Simulating Rare Events in Dynamical Processes , volume=. Journal of Statistical Physics , author=. 2011 , month=nov, pages=. doi:10.1007/s10955-011-0350-4 , abstractNote=

  12. [12]

    and Chen, Mo and Tomlin, Claire J

    Fisac, Jaime F. and Chen, Mo and Tomlin, Claire J. and Sastry, S. Shankar , year=. Reach-avoid problems with time-varying dynamics, targets and constraints , ISBN=. doi:10.1145/2728606.2728612 , booktitle=

  13. [13]

    Variational approach to rare event simulation using least-squares regression , url=

    Hartmann, Carsten and Kebiri, Omar and Neureither, Lara and Richter, Lorenz , year=. Variational approach to rare event simulation using least-squares regression , url=. doi:10.48550/arXiv.1901.09195 , abstractNote=

  14. [14]

    Journal of the Franklin Institute , author=

    An approximate method of stochastic terminal control for nonlinear dynamical systems , volume=. Journal of the Franklin Institute , author=. 1970 , month=mar, pages=. doi:10.1016/0016-0032(70)90286-3 , abstractNote=

  15. [15]

    Oppor- tunities and challenges in deep learning adversarial ro- bustness: A survey.arXiv preprint arXiv:2007.00753, 2020

    Silva, Samuel Henrique and Najafirad, Peyman , year=. Opportunities and Challenges in Deep Learning Adversarial Robustness: A Survey , url=. doi:10.48550/arXiv.2007.00753 , abstractNote=

  16. [16]

    and Schlichting, Marc R

    Kruse, Liam A. and Schlichting, Marc R. and Kochenderfer, Mykel J. , year=. Scalable Importance Sampling in High Dimensions with Low-Rank Mixture Proposals , url=. doi:10.48550/arXiv.2505.13335 , abstractNote=

  17. [17]

    Chaos: An Interdisciplinary Journal of Nonlinear Science , author=

    Rare Event Sampling Methods , volume=. Chaos: An Interdisciplinary Journal of Nonlinear Science , author=. 2019 , month=aug, pages=. doi:10.1063/1.5120509 , number=

  18. [18]

    Curse-of-dimensionality revisited: Collapse of the particle filter in very large scale systems , url=

    Bengtsson, Thomas and Bickel, Peter and Li, Bo , year=. Curse-of-dimensionality revisited: Collapse of the particle filter in very large scale systems , url=. doi:10.48550/arXiv.0805.3034 , abstractNote=

  19. [19]

    BMC Systems Biology , author=

    Quantifying uncertainty, variability and likelihood for ordinary differential equation models , volume=. BMC Systems Biology , author=. 2010 , month=dec, pages=. doi:10.1186/1752-0509-4-144 , number=

  20. [20]

    Computational Doob h-transforms for Online Filtering of Discretely Observed Diffusions , url=

    Chopin, Nicolas , year=. Computational Doob h-transforms for Online Filtering of Discretely Observed Diffusions , url=

  21. [21]

    2019 , url =

    Simon Särkkä and Arno Solin , title =. 2019 , url =

  22. [22]

    Real-time optimal control of high-dimensional parametrized systems by deep learning-based reduced order models , url=

    Tomasetto, Matteo and Manzoni, Andrea and Braghin, Francesco , year=. Real-time optimal control of high-dimensional parametrized systems by deep learning-based reduced order models , url=. doi:10.48550/arXiv.2409.05709 , abstractNote=

  23. [23]

    IEEE Robotics and Automation Letters , author=

    Biased-MPPI: Informing Sampling-Based Model Predictive Control by Fusing Ancillary Controllers , volume=. IEEE Robotics and Automation Letters , author=. 2024 , month=jun, pages=. doi:10.1109/LRA.2024.3397083 , abstractNote=

  24. [24]

    arXiv preprint arXiv:2301.06227 , year=

    Wu, Guangyu and Lindquist, Anders , year=. General Distribution Steering: A Sub-Optimal Solution by Convex Optimization , url=. doi:10.48550/arXiv.2301.06227 , abstractNote=

  25. [25]

    Pilipovsky, Joshua and Sivaramakrishnan, Vignesh and Oishi, Meeko M. K. and Tsiotras, Panagiotis , year=. Probabilistic Verification of ReLU Neural Networks via Characteristic Functions , url=. doi:10.48550/arXiv.2212.01544 , abstractNote=

  26. [26]

    , Köhler , J

    Rapakoulias, George and Tsiotras, Panagiotis , year=. Discrete-Time Maximum Likelihood Neural Distribution Steering , rights=. doi:10.1109/CDC56724.2024.10885992 , booktitle=

  27. [27]

    arXiv preprint arXiv:2405.15454 , year=

    Cheng, Emily and Alonso, Carmen Amo , year=. Linearly Controlled Language Generation with Performative Guarantees , url=. doi:10.48550/arXiv.2405.15454 , abstractNote=

  28. [28]

    Steering Llama 2 via Contrastive Activation Addition

    Rimsky, Nina and Gabrieli, Nick and Schulz, Julian and Tong, Meg and Hubinger, Evan and Turner, Alexander , year=. Steering Llama 2 via Contrastive Activation Addition , url=. doi:10.18653/v1/2024.acl-long.828 , booktitle=

  29. [29]

    Primal-dual ilqr for gpu-accelerated learning and control in legged robots.arXiv preprint arXiv:2506.07823, 2025

    Amatucci, Lorenzo and Sousa-Pinto, João and Turrisi, Giulio and Orban, Dominique and Barasuol, Victor and Semini, Claudio , year=. Primal-Dual iLQR for GPU-Accelerated Learning and Control in Legged Robots , url=. doi:10.48550/arXiv.2506.07823 , abstractNote=

  30. [30]

    Mechanistic interpretability for steering vision-language-action models , url=

    Häon, Bear and Stocking, Kaylene and Chuang, Ian and Tomlin, Claire , year=. Mechanistic interpretability for steering vision-language-action models , url=. doi:10.48550/arXiv.2509.00328 , abstractNote=

  31. [31]

    Refusal in Language Models Is Mediated by a Single Direction

    Arditi, Andy and Obeso, Oscar and Syed, Aaquib and Paleka, Daniel and Panickssery, Nina and Gurnee, Wes and Nanda, Neel , year=. Refusal in Language Models Is Mediated by a Single Direction , url=. doi:10.48550/arXiv.2406.11717 , abstractNote=

  32. [32]

    Preemptive Detection and Steering of LLM Misalignment via Latent Reachability , url=

    Karnik, Sathwik and Bansal, Somil , year=. Preemptive Detection and Steering of LLM Misalignment via Latent Reachability , url=. doi:10.48550/arXiv.2509.21528 , abstractNote=

  33. [33]

    2013 IEEE International Conference on Robotics and Automation , author=

    Optimal sampling-based planning for linear-quadratic kinodynamic systems , ISSN=. 2013 IEEE International Conference on Robotics and Automation , author=. 2013 , month=may, pages=. doi:10.1109/ICRA.2013.6630907 , abstractNote=

  34. [34]

    and Vu, Hieu M

    Nguyen, Dung V. and Vu, Hieu M. and Pham, Nhi Y. and Zhang, Lei and Nguyen, Tan M. , year=. Activation Steering with a Feedback Controller , url=. doi:10.48550/arXiv.2510.04309 , abstractNote=

  35. [35]

    arXiv preprint arXiv:2407.15549 , year=

    Sheshadri, Abhay and Ewart, Aidan and Guo, Phillip and Lynch, Aengus and Wu, Cindy and Hebbar, Vivek and Sleight, Henry and Stickland, Asa Cooper and Perez, Ethan and Hadfield-Menell, Dylan and Casper, Stephen , year=. Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs , url=. doi:10.48550/arXiv.2407.15549 , abstractNote=

  36. [36]

    Monnier, John D. and Jain, Prachet and Gutierrez, Mayra and Han, Chi and Hezi, Sara and Kalluri, Shashank and Kabaria, Hirsh and Kompas, Brennan and Harikumar, Vaishnavi and Skifstad, Julian and Peri, Janani and Hernandez, Emmanuel and Bhaskarapanthula, Ramya and Cutler, James , year=. Prospects for using drones to test formation-flying CubeSat concepts, ...

  37. [37]

    Explaining and Harnessing Adversarial Examples

    Goodfellow, Ian J. and Shlens, Jonathon and Szegedy, Christian , year=. Explaining and Harnessing Adversarial Examples , url=. doi:10.48550/arXiv.1412.6572 , abstractNote=

  38. [38]

    Propagation of Uncertainty with the Koopman Operator , url=

    Servadio, Simone and Lavezzi, Giovanni and Hofmann, Christian and Wu, Di and Linares, Richard , year=. Propagation of Uncertainty with the Koopman Operator , url=. doi:10.48550/arXiv.2407.20170 , abstractNote=

  39. [39]

    arXiv preprint arXiv:2410.23054 (2024)

    Rodriguez, Pau and Blaas, Arno and Klein, Michal and Zappella, Luca and Apostoloff, Nicholas and Cuturi, Marco and Suau, Xavier , year=. Controlling Language and Diffusion Models by Transporting Activations , url=. doi:10.48550/arXiv.2410.23054 , abstractNote=

  40. [40]

    European Symposium on Programming , pages=

    Neural network verification is a programming language challenge , author=. European Symposium on Programming , pages=. 2025 , organization=

  41. [41]

    Steering Language Models With Activation Engineering

    Turner, Alexander Matt and Thiergart, Lisa and Leech, Gavin and Udell, David and Vazquez, Juan J. and Mini, Ulisse and MacDiarmid, Monte , year=. Steering Language Models With Activation Engineering , url=. doi:10.48550/arXiv.2308.10248 , abstractNote=

  42. [42]

    Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , shorttitle =

    Li, Kenneth and Patel, Oam and Viégas, Fernanda and Pfister, Hanspeter and Wattenberg, Martin , year=. Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , url=. doi:10.48550/arXiv.2306.03341 , abstractNote=

  43. [43]

    A language model's guide through latent space

    Rütte, Dimitri von and Anagnostidis, Sotiris and Bachmann, Gregor and Hofmann, Thomas , year=. A Language Model’s Guide Through Latent Space , url=. doi:10.48550/arXiv.2402.14433 , abstractNote=

  44. [44]

    and Bewley, Tom and Mishra, Saumitra and Veloso, Manuela , title =

    Hedström, Anna and Amoukou, Salim I. and Bewley, Tom and Mishra, Saumitra and Veloso, Manuela , title =. Proceedings of the 42nd International Conference on Machine Learning , series =. 2025 , address =

  45. [45]

    arXiv preprint arXiv:2510.26243 (2025) 8

    Vu, Hieu M. and Nguyen, Tan M. , year=. Angular Steering: Behavior Control via Rotation in Activation Space , url=. doi:10.48550/arXiv.2510.26243 , abstractNote=

  46. [46]

    Geiger, Atticus and Wu, Zhengxuan and Potts, Christopher and Icard, Thomas and Goodman, Noah D. , year=. Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations , url=. doi:10.48550/arXiv.2303.02536 , abstractNote=

  47. [47]

    Safe Large-Scale Robust Nonlinear MPC in Milliseconds via Reachability-Constrained System Level Synthesis on the GPU

    Safe Large-Scale Robust Nonlinear MPC in Milliseconds via Reachability-Constrained System Level Synthesis on the GPU , author=. arXiv preprint arXiv:2604.07644 , year=

  48. [48]

    Abdelzaher and Yejin Choi and Manling Li and Huajie Shao , booktitle=

    Hongjue Zhao and Haosen Sun and Jiangtao Kong and Xiaochang Li and Qineng Wang and Liwei Jiang and Qi Zhu and Tarek F. Abdelzaher and Yejin Choi and Manling Li and Huajie Shao , booktitle=. 2026 , url=

  49. [49]

    arXiv preprint arXiv:2301.13729 , year=

    Low-rank LQR Optimal Control Design over Wireless Communication Networks , author=. arXiv preprint arXiv:2301.13729 , year=

  50. [50]

    Safety beyond the training data: Robust out-of- distribution mpc via conformalized system level synthe- sis.arXiv preprint arXiv:2602.12047, 2026

    Safety Beyond the Training Data: Robust Out-of-Distribution MPC via Conformalized System Level Synthesis , author=. arXiv preprint arXiv:2602.12047 , year=

  51. [51]

    Toy Models of Superposition

    Elhage, Nelson and Hume, Tristan and Olsson, Catherine and Schiefer, Nicholas and Henighan, Tom and Kravec, Shauna and Hatfield-Dodds, Zac and Lasenby, Robert and Drain, Dawn and Chen, Carol and Grosse, Roger and McCandlish, Sam and Kaplan, Jared and Amodei, Dario and Wattenberg, Martin and Olah, Christopher , year=. Toy Models of Superposition , url=. do...

  52. [52]

    doi:10.48550/arXiv.2410.02707 , abstract =

    Orgad, Hadas and Toker, Michael and Gekhman, Zorik and Reichart, Roi and Szpektor, Idan and Kotek, Hadas and Belinkov, Yonatan , year=. LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations , url=. doi:10.48550/arXiv.2410.02707 , abstractNote=

  53. [53]

    and Vrabie, Draguna L

    Lewis, Frank L. and Vrabie, Draguna L. and Syrmos, Vassilis L. , title =

  54. [54]

    thought” of LLM by finding the “circuit

    Gemma Team , year=. Gemma , url=. doi:10.34740/KAGGLE/M/3301 , publisher=

  55. [55]

    Qwen2.5: A Party of Foundation Models , url =

    Qwen Team , month =. Qwen2.5: A Party of Foundation Models , url =

  56. [56]

    Realtoxicityprompts: Evaluating neural toxic degenera- tion in language models

    Realtoxicityprompts: Evaluating neural toxic degeneration in language models , author=. arXiv preprint arXiv:2009.11462 , year=

  57. [57]

    Aligning Large Language Models with Representation Editing: A Control Perspective , url=

    Kong, Lingkai and Wang, Haorui and Mu, Wenhao and Du, Yuanqi and Zhuang, Yuchen and Zhou, Yifei and Song, Yue and Zhang, Rongzhi and Wang, Kai and Zhang, Chao , year=. Aligning Large Language Models with Representation Editing: A Control Perspective , url=. doi:10.48550/arXiv.2406.05954 , abstractNote=

  58. [58]

    arXiv preprint arXiv:2308.05374 , year=

    Trustworthy llms: a survey and guideline for evaluating large language models' alignment , author=. arXiv preprint arXiv:2308.05374 , year=

  59. [59]

    Advances in Neural Information Processing Systems , volume=

    Jailbroken: How does llm safety training fail? , author=. Advances in Neural Information Processing Systems , volume=

  60. [60]

    Advances in neural information processing systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

  61. [61]

    Advances in neural information processing systems , volume=

    Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

  62. [62]

    International conference on machine learning , pages=

    Parameter-efficient transfer learning for NLP , author=. International conference on machine learning , pages=. 2019 , organization=

  63. [63]

    arXiv preprint arXiv:2402.01694 , year=

    Args: Alignment as reward-guided search , author=. arXiv preprint arXiv:2402.01694 , year=

  64. [64]

    A General Language Assistant as a Laboratory for Alignment

    A general language assistant as a laboratory for alignment , author=. arXiv preprint arXiv:2112.00861 , year=

  65. [65]

    Meng, K., Bau, D., Andonian, A., and Belinkov, Y

    The unlocking spell on base llms: Rethinking alignment via in-context learning , author=. arXiv preprint arXiv:2312.01552 , year=

  66. [66]

    2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=

    Statistical safety and robustness guarantees for feedback motion planning of unknown underactuated stochastic systems , author=. 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2023 , organization=

  67. [67]

    Mechanistic Interpretability for

    Leonard Bereska and Stratis Gavves , journal=. Mechanistic Interpretability for. 2024 , url=

  68. [68]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Representation engineering: A top-down approach to ai transparency , author=. arXiv preprint arXiv:2310.01405 , year=

  69. [69]

    arXiv preprint arXiv:2305.18449 , year=

    Taming ai bots: Controllability of neural states in large language models , author=. arXiv preprint arXiv:2305.18449 , year=

  70. [70]

    2012 , publisher=

    Optimal control , author=. 2012 , publisher=

  71. [71]

    Yulin Luo, Ruichuan An, Bocheng Zou, Yiming Tang, Jiaming Liu, and Shanghang Zhang

    Prompt engineering through the lens of optimal control , author=. arXiv preprint arXiv:2310.14201 , year=

  72. [72]

    Contributions to the theory of optimal control , author=. Bol. soc. mat. mexicana , volume=

  73. [73]

    arXiv preprint arXiv:2310.04444 , year=

    What's the magic word? a control theory of llm prompting , author=. arXiv preprint arXiv:2310.04444 , year=

  74. [74]

    Advances in Neural Information Processing Systems , volume=

    Analysing the generalisation and reliability of steering vectors , author=. Advances in Neural Information Processing Systems , volume=

  75. [75]

    Dong Shu, Xuansheng Wu, Haiyan Zhao, Daking Rai, Ziyu Yao, Ninghao Liu, and Mengnan Du

    Multi-property steering of large language models with dynamic activation composition , author=. arXiv preprint arXiv:2406.17563 , year=

  76. [76]

    arXiv preprint arXiv:2408.15625 , year=

    Cbf-llm: Safe control for llm alignment , author=. arXiv preprint arXiv:2408.15625 , year=

  77. [77]

    Advances in neural information processing systems , volume=

    Learning to summarize with human feedback , author=. Advances in neural information processing systems , volume=

  78. [78]

    LLaMA: Open and Efficient Foundation Language Models

    Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

  79. [79]

    Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation.arXiv preprint arXiv:2401.08417, 2024

    Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation , author=. arXiv preprint arXiv:2401.08417 , year=

  80. [80]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Preference ranking optimization for human alignment , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Showing first 80 references.