Recognition: 2 theorem links
· Lean TheoremSteered LLM Activations are Non-Surjective
Pith reviewed 2026-05-11 00:55 UTC · model grok-4.3
The pith
Activation steering in LLMs moves residual stream states off the manifold reachable by any prompt.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under practical assumptions, activation steering is non-surjective: the steered residual-stream activation lies outside the image of the forward pass from discrete prompts, so almost surely no prompt can reproduce the same internal state.
What carries the argument
The manifold of residual-stream activations reachable from sequences of discrete tokens; additive steering displaces the activation vector off this manifold.
Load-bearing premise
The practical assumptions about residual-stream geometry and additive steering that make the set of prompt-reachable states a lower-dimensional submanifold of the full activation space.
What would settle it
An explicit prompt whose residual-stream activation vector exactly equals a given steered vector, or a demonstration that the reachable set is the entire space under the model's actual forward pass.
Figures
read the original abstract
Activation steering is a popular white-box control technique that modifies model activations to elicit an abstract change in its behavior. It has also become a standard tool in interpretability (e.g., probing truthfulness, or translating activations into human-readable explanations) and safety research (e.g., jailbreakability). However, it is unclear whether steered behavior is realizable by any textual prompt. In this work, we cast this question as a surjectivity problem: for a fixed model, does every steered activation admit a preimage under the model's natural forward pass? Under practical assumptions, we prove that activation steering pushes the residual stream off the manifold of states reachable from discrete prompts. Almost surely, no prompt can reproduce the same internal behavior induced by steering. We also illustrate this finding empirically across three widely used LLMs. Our results establish a formal separation between white-box steerability and black-box prompting. We therefore caution against interpreting the ease and success of activation steering as evidence of prompt-based interpretability or vulnerability, and argue for evaluation protocols that explicitly decouple white-box and black-box interventions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that activation steering in LLMs is non-surjective: under practical assumptions, additive steering displaces the residual stream off the manifold of activations reachable from any discrete token sequence. It proves that almost surely no prompt can reproduce the internal state induced by steering, using a mathematical analysis of manifolds and the forward pass, and illustrates the finding with empirical checks on three widely used LLMs. The authors conclude that this creates a formal separation between white-box steering and black-box prompting, cautioning against equating steering success with prompt-based interpretability or vulnerability.
Significance. If the central non-surjectivity result holds, it is significant for interpretability and safety research because it supplies a theoretical reason why steered behaviors may not be realizable by any natural-language prompt. The combination of a manifold-theoretic argument with multi-model empirical illustrations provides a concrete basis for decoupling white-box and black-box interventions, which could influence evaluation protocols in probing, safety, and mechanistic interpretability.
major comments (2)
- [Abstract / proof] Abstract and proof section: the central claim rests on 'practical assumptions' about the residual-stream manifold and forward-pass properties, yet these assumptions are never stated explicitly (e.g., whether the reachable set has positive codimension, whether the forward map is analytic or merely continuous, or whether token embeddings are in general position). Without this list the measure-zero argument cannot be verified for real transformers.
- [Empirical illustration] Empirical section: the experiments on three LLMs show statistical differences between steered and prompt-induced activations, but finite sampling cannot establish the 'almost surely' non-existence result; the paper acknowledges this limitation yet still presents the empirical work as supporting the almost-sure claim.
minor comments (1)
- [Throughout] The phrase 'manifold of states reachable from discrete prompts' is used repeatedly without a formal definition or pointer to prior literature on activation geometry in transformers; a short definitional paragraph would improve clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the presentation of our assumptions and the scope of the empirical results. We address each point below and will incorporate revisions to improve verifiability and framing.
read point-by-point responses
-
Referee: [Abstract / proof] Abstract and proof section: the central claim rests on 'practical assumptions' about the residual-stream manifold and forward-pass properties, yet these assumptions are never stated explicitly (e.g., whether the reachable set has positive codimension, whether the forward map is analytic or merely continuous, or whether token embeddings are in general position). Without this list the measure-zero argument cannot be verified for real transformers.
Authors: We agree that the assumptions underlying the measure-zero argument should be stated explicitly to enable verification. In the revised manuscript we will add a dedicated paragraph in the proof section that enumerates them: (1) the reachable activation set is a lower-dimensional submanifold of positive codimension in the residual stream; (2) the forward-pass map is continuous (and analytic on the interior of its domain); and (3) token embeddings are in general position so that their linear combinations do not fill the ambient space. These clarifications will make the application of the measure-zero result transparent for real transformers. revision: yes
-
Referee: [Empirical illustration] Empirical section: the experiments on three LLMs show statistical differences between steered and prompt-induced activations, but finite sampling cannot establish the 'almost surely' non-existence result; the paper acknowledges this limitation yet still presents the empirical work as supporting the almost-sure claim.
Authors: We concur that finite sampling supplies only illustrative evidence and cannot prove the almost-sure claim, which rests on the theoretical argument. The empirical checks are intended to demonstrate that the predicted statistical separation is observable on standard models. We will revise the text to frame the experiments explicitly as supportive illustrations, strengthen the limitations discussion, and avoid any phrasing that could be read as treating the empirical results as confirmatory of the measure-zero statement. revision: yes
Circularity Check
No circularity: non-surjectivity follows from manifold analysis under stated assumptions
full rationale
The central result is a mathematical proof that additive steering maps the residual stream outside the image of the discrete-prompt forward pass, under practical assumptions on the residual-stream manifold and forward-pass properties. This is not obtained by fitting parameters to data, renaming an empirical pattern, or reducing to a self-citation chain; the proof is self-contained once the assumptions are granted. The three-LLM empirical illustrations are presented separately as corroboration and do not enter the derivation. No load-bearing step equates the claimed non-surjectivity to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Practical assumptions on LLM residual stream and steering operation
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearUnder practical assumptions, we prove that activation steering pushes the residual stream off the manifold of states reachable from discrete prompts. Almost surely, no prompt can reproduce the same internal behavior induced by steering.
Forward citations
Cited by 1 Pith paper
-
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
Reference graph
Works this paper leans on
-
[1]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
@esa (Ref
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[3]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[4]
The Annals of Mathe- matical Statistics22(1), 79–86 (1951) https://doi.org/10.1214/aoms/1177729694
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.