FishBack: Pullback Fisher Geometry for Optimal Activation Steering in Transformers
Pith reviewed 2026-05-20 14:49 UTC · model grok-4.3
The pith
The pullback Fisher metric provides a closed-form optimal direction for steering activations in transformers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that starting from the pullback Fisher metric, a closed-form steering equation can be derived that identifies the minimum-distortion direction for any target concept at each point, which can be applied iteratively without requiring manifold fitting or data-driven geometry estimation. This framework, called FishBack, also reveals that existing methods implicitly use different approximate metrics whose relative performance is predicted by a spectral diagnostic comparing their cost to the Fisher-optimal cost.
What carries the argument
The pullback Fisher metric obtained by pulling the softmax layer's Fisher information back through the Jacobians of subsequent layers, which defines the geometry used to find minimum-distortion steering directions.
If this is right
- Existing methods such as CAA, ActAdd, and ITI each implicitly adopt a particular approximate metric.
- Their performance gaps are quantitatively predicted by the ratio of their implicit metric's cost to the Fisher-optimal cost.
- Iterative pullback steering outperforms all Euclidean baselines across three verb-morphology concepts and four layers on GPT-2.
- Off-target KL reductions reach 1.3x to 2.5x relative to Euclidean gradient ascent.
Where Pith is reading between the lines
- The layer-wise recursive decomposition suggests the metric can be computed efficiently even in deeper transformer stacks without full Jacobian materialization.
- The low effective dimensionality implies that steering success depends on aligning directions with a small number of dominant modes in the pulled-back geometry.
Load-bearing premise
The local geometry relevant for activation steering is accurately captured by the Fisher information metric of the softmax layer pulled back through the Jacobian of subsequent layers.
What would settle it
Applying FishBack steering to a new concept and checking whether the resulting change in output distribution matches the predicted minimum-distortion path better than Euclidean methods on GPT-2 layers.
Figures
read the original abstract
Activation steering methods modify intermediate representations of language models to control output behavior, but universally assume the activation space is Euclidean. We show this assumption fails drastically: the local geometry induced by the model's own output behavior -- the Fisher information metric of the softmax layer, pulled back through the Jacobian of subsequent layers -- deviates from the Euclidean metric by over 97% in relative spectral norm on GPT-2, with an effective dimensionality of only 2--17% of the ambient space. From this pullback Fisher metric, we derive a closed-form steering equation that identifies the minimum-distortion direction for any target concept, yielding a closed-form optimal direction at each point that can be applied iteratively without manifold fitting or data-driven geometry estimation. We call the resulting framework FishBack. The metric admits a layer-wise recursive decomposition, which reveals that existing methods -- CAA, ActAdd, ITI, and others -- each implicitly adopt a particular approximate metric, and that their performance gaps are quantitatively predicted by a single spectral diagnostic: the ratio of their implicit metric's cost to the Fisher-optimal cost. On GPT-2, iterative pullback steering consistently outperforms all Euclidean baselines across three verb-morphology concepts and four layers, with off-target KL reductions of $1.3\times$--$2.5\times$ relative to Euclidean gradient ascent and $1.5\times$ relative to CAA at matched concept probability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces FishBack, a framework for activation steering in transformers that replaces the Euclidean assumption on activation spaces with the pullback of the softmax-layer Fisher information metric through the Jacobian of subsequent layers. It reports that this geometry deviates from Euclidean by over 97% in relative spectral norm on GPT-2 with effective dimensionality 2-17% of ambient space. From the pullback metric the authors derive a closed-form minimum-distortion steering direction for any target concept that can be applied iteratively without manifold fitting. A layer-wise recursive decomposition is used to interpret existing methods (CAA, ActAdd, ITI) as implicit approximations to this metric, with their performance gaps predicted by a single spectral ratio of implicit to Fisher-optimal cost. Experiments on GPT-2 for three verb-morphology concepts across four layers show 1.3-2.5x off-target KL reduction relative to Euclidean gradient ascent and 1.5x relative to CAA at matched concept probability.
Significance. If the central derivations hold, the work supplies a principled information-geometric foundation for activation steering that directly incorporates the model's output distribution, potentially unifying and improving heuristic methods while reducing off-target effects. The closed-form character and recursive decomposition are notable strengths that could guide more reliable concept editing in large language models.
major comments (2)
- [Abstract] Abstract: the claim of a 'closed-form steering equation' yielding a unique 'optimal direction at each point' is undermined by the singularity of the pullback metric. The softmax Fisher matrix F = diag(p) - p p^T has a one-dimensional kernel spanned by the all-ones vector, so G(a) = J(a)^T F(p(a)) J(a) is positive semi-definite with a non-trivial null space. The minimum-distortion problem arg min_v v^T G v subject to a linear concept constraint therefore requires either the Moore-Penrose pseudo-inverse G^+ or explicit regularization; neither is mentioned in the abstract's description of the steering equation.
- [Section on spectral diagnostic] Section describing the spectral diagnostic: the ratio of an implicit metric's cost to the Fisher-optimal cost is defined directly from the same quantity used to declare optimality and to evaluate empirical superiority. This construction risks making the 'quantitative prediction' of performance gaps partly tautological rather than an independent diagnostic, weakening the cross-method comparison claim.
minor comments (2)
- The abstract states '97% spectral deviation' and 'effective dimensionality of only 2--17%' without defining the precise norm, baseline Euclidean metric, or the method used to compute effective dimension; a short clarifying sentence or reference to the relevant equation would improve readability.
- All acronyms (CAA, ActAdd, ITI) should be expanded on first use in the main text even if defined in the abstract.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the presentation of our geometric framework. We address each major point below and indicate revisions where appropriate to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of a 'closed-form steering equation' yielding a unique 'optimal direction at each point' is undermined by the singularity of the pullback metric. The softmax Fisher matrix F = diag(p) - p p^T has a one-dimensional kernel spanned by the all-ones vector, so G(a) = J(a)^T F(p(a)) J(a) is positive semi-definite with a non-trivial null space. The minimum-distortion problem arg min_v v^T G v subject to a linear concept constraint therefore requires either the Moore-Penrose pseudo-inverse G^+ or explicit regularization; neither is mentioned in the abstract's description of the steering equation.
Authors: We agree that the abstract should explicitly note the handling of the pullback metric's semi-definiteness. The full derivation in Section 3 uses the Moore-Penrose pseudo-inverse G^+ to obtain the minimum-distortion direction under the linear concept constraint, which is well-defined on the range of G and yields a unique solution in the quotient space orthogonal to the kernel. We will revise the abstract to state that the closed-form steering equation employs the pseudo-inverse of the pullback Fisher metric. revision: yes
-
Referee: [Section on spectral diagnostic] Section describing the spectral diagnostic: the ratio of an implicit metric's cost to the Fisher-optimal cost is defined directly from the same quantity used to declare optimality and to evaluate empirical superiority. This construction risks making the 'quantitative prediction' of performance gaps partly tautological rather than an independent diagnostic, weakening the cross-method comparison claim.
Authors: The spectral ratio is computed a priori from the eigenvalues of the implicit versus Fisher metrics alone, without reference to task-specific performance data. It quantifies the relative distortion cost of each method's implicit geometry and is used to predict the ordering of empirical gaps before any steering experiments are run. We will add explicit language in the relevant section clarifying this separation and noting that the subsequent correlation with observed KL reductions serves as empirical validation rather than part of the definition. revision: partial
Circularity Check
No significant circularity; derivation is self-contained from standard pullback construction
full rationale
The paper begins with the standard Fisher information matrix at the softmax layer and pulls it back via the Jacobian of subsequent layers to obtain the activation-space metric G(a) = J^T F J. This is a direct, first-principles definition from information geometry and is not defined in terms of any steering outcome or performance gap. The closed-form optimal direction is obtained by solving the quadratic minimization problem induced by this metric under a linear concept constraint, which is a standard Lagrange-multiplier or pseudo-inverse step and does not presuppose the final steering vector. The layer-wise decomposition and the spectral diagnostic (ratio of implicit-metric cost to Fisher-optimal cost) are post-hoc explanatory devices that rank existing Euclidean methods; the reported performance advantages are measured by independent quantities (concept probability and off-target KL divergence) rather than by the diagnostic itself. No load-bearing step reduces to a self-citation, fitted input renamed as prediction, or ansatz smuggled from prior work. The derivation chain therefore remains independent of its claimed outputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The local geometry induced by the model's output behavior is given by the Fisher information metric of the softmax layer pulled back through the Jacobian of subsequent layers.
invented entities (1)
-
FishBack framework
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DKL(Pλ0 ∥ Pf(h0+δh)) = ½ δh^T (J^T H J) δh + O(∥δh∥^3); δh* = ρ G^+ q / (q^T G^+ q)
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Fisher-Pythagorean excess cost identity: CG(δ̃) − CG(δ*_G) = ½ (δ̃ − δ*_G)^T G (δ̃ − δ*_G)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [2]
-
[3]
Representation Engineering: A Top-Down Approach to
Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and Guo, Phillip and Ren, Richard and Pan, Alexander and Yin, Xuwang and Mazeika, Mantas and Dombrowski, Ann-Kathrin and others , journal=. Representation Engineering: A Top-Down Approach to
-
[4]
Advances in Neural Information Processing Systems , volume=
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. Advances in Neural Information Processing Systems , volume=
-
[6]
Proceedings of the 41st International Conference on Machine Learning , year=
Representation Surgery: Theory and Practice of Affine Steering , author=. Proceedings of the 41st International Conference on Machine Learning , year=
-
[7]
and Potts, Christopher , booktitle=
Wu, Zhengxuan and Arora, Aryaman and Geiger, Atticus and Huang, Jing and Wang, Zheng and Manning, Christopher D. and Potts, Christopher , booktitle=
-
[8]
Nora Belrose and David Schneider-Joseph and Shauli Ravfogel and Ryan Cotterell and Edward Raff and Stella Biderman , booktitle=. 2023 , url=
work page 2023
-
[10]
2nd Workshop on Models of Human Feedback for AI Alignment , year=
Angular Steering: Behavior Control via Rotation in Activation Space , author=. 2nd Workshop on Models of Human Feedback for AI Alignment , year=
-
[14]
Bulletin of the Calcutta Mathematical Society , volume=
Information and the Accuracy Attainable in the Estimation of Statistical Parameters , author=. Bulletin of the Calcutta Mathematical Society , volume=
-
[15]
Natural Gradient Works Efficiently in Learning , author=. Neural Computation , volume=
-
[16]
Information Geometry and Its Applications , author=
-
[17]
Banerjee, Arindam and Merugu, Srujana and Dhillon, Inderjit S. and Ghosh, Joydeep , journal=. Clustering with
-
[18]
International Conference on Learning Representations , year=
Latent Space Oddity: On the Curvature of Deep Generative Models , author=. International Conference on Learning Representations , year=
-
[19]
International Conference on Artificial Intelligence and Statistics , year=
Pulling Back Information Geometry , author=. International Conference on Artificial Intelligence and Statistics , year=
-
[20]
Journal of Machine Learning Research , volume=
New Insights and Perspectives on the Natural Gradient Method , author=. Journal of Machine Learning Research , volume=
-
[22]
Language Models are Unsupervised Multitask Learners , author=. OpenAI Blog , year=
-
[23]
Journal of Machine Learning Research , volume=
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. Journal of Machine Learning Research , volume=
-
[26]
Amari, S.-i. 1998. Natural Gradient Works Efficiently in Learning. Neural Computation, 10(2): 251--276
work page 1998
-
[27]
Amari, S.-i. 2016. Information Geometry and Its Applications. Springer
work page 2016
-
[28]
Arditi, A.; Obeso, O.; Syed, A.; Paleka, D.; Panickssery, N.; Gurnee, W.; and Nanda, N. 2024. Refusal in Language Models Is Mediated by a Single Direction. arXiv preprint arXiv:2406.11717
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Arvanitidis, G.; Gonz \'a lez-Duque, M.; Pouplin, A.; Kalatzis, D.; and Hauberg, S. 2022. Pulling Back Information Geometry. In International Conference on Artificial Intelligence and Statistics
work page 2022
-
[30]
Arvanitidis, G.; Hansen, L. K.; and Hauberg, S. 2018. Latent Space Oddity: On the Curvature of Deep Generative Models. In International Conference on Learning Representations
work page 2018
-
[31]
Ba, J. L.; Kiros, J. R.; and Hinton, G. E. 2016. Layer Normalization. arXiv preprint arXiv:1607.06450
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[32]
Banerjee, A.; Merugu, S.; Dhillon, I. S.; and Ghosh, J. 2005. Clustering with Bregman Divergences. Journal of Machine Learning Research, 6: 1705--1749
work page 2005
-
[33]
Belrose, N.; Schneider-Joseph, D.; Ravfogel, S.; Cotterell, R.; Raff, E.; and Biderman, S. 2023. LEACE : Perfect linear concept erasure in closed form. In Thirty-seventh Conference on Neural Information Processing Systems
work page 2023
- [34]
- [35]
-
[36]
Li, K.; Patel, O.; Vi \'e gas, F.; Pfister, H.; and Wattenberg, M. 2023. Inference-Time Intervention: Eliciting Truthful Answers from a Language Model. In Advances in Neural Information Processing Systems, volume 36
work page 2023
-
[37]
Martens, J. 2020. New Insights and Perspectives on the Natural Gradient Method. Journal of Machine Learning Research, 21(146): 1--76
work page 2020
-
[38]
Park, K.; Nief, T.; Choe, Y. J.; and Veitch, V. 2026. The Information Geometry of Softmax: Probing and Steering. arXiv preprint arXiv:2602.15293
-
[39]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and Sutskever, I. 2019. Language Models are Unsupervised Multitask Learners. OpenAI Blog
work page 2019
-
[40]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, 21(140): 1--67
work page 2020
-
[41]
Rao, C. R. 1945. Information and the Accuracy Attainable in the Estimation of Statistical Parameters. Bulletin of the Calcutta Mathematical Society, 37: 81--91
work page 1945
-
[42]
J.; Wu, L.; Harrasse, A.; Phillips, J
Raval, S.; Song, H. J.; Wu, L.; Harrasse, A.; Phillips, J. M.; Barez, F.; and Abdullah, A. 2026. Curveball Steering: The Right Direction To Steer Isn't Always Linear. arXiv preprint arXiv:2603.09313
-
[43]
Rimsky, N.; Gabrieli, N.; Schulz, J.; Tong, M.; Hubinger, E.; and Turner, A. 2024. Steering Llama 2 via Contrastive Activation Addition. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 15504--15522. Association for Computational Linguistics
work page 2024
-
[44]
Singh, S.; Ravfogel, S.; Herzig, J.; Aharoni, R.; Cotterell, R.; and Kumaraguru, P. 2024. Representation Surgery: Theory and Practice of Affine Steering. In Proceedings of the 41st International Conference on Machine Learning
work page 2024
-
[45]
Steering Language Models With Activation Engineering
Turner, A. M.; Thiergart, L.; Leech, G.; Udell, D.; Vazquez, J. J.; Mini, U.; and MacDiarmid, M. 2024. Steering Language Models With Activation Engineering. arXiv preprint arXiv:2308.10248
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [46]
-
[47]
Vu, H. M.; and Nguyen, T. M. 2025. Angular Steering: Behavior Control via Rotation in Activation Space. In 2nd Workshop on Models of Human Feedback for AI Alignment
work page 2025
-
[48]
Wu, Z.; Arora, A.; Geiger, A.; Huang, J.; Wang, Z.; Manning, C. D.; and Potts, C. 2025. AxBench : Steering LLMs ? Benchmark and Decompose the Steering Ability of Representation Intervention Methods. In Proceedings of the 42nd International Conference on Machine Learning
work page 2025
-
[49]
Wurgaft, D.; et al. 2026. Manifold Steering: Turning, Not Transplanting, Activations for Controllable Generation. arXiv preprint arXiv:2605.05115
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[50]
Zou, A.; Phan, L.; Chen, S.; Campbell, J.; Guo, P.; Ren, R.; Pan, A.; Yin, X.; Mazeika, M.; Dombrowski, A.-K.; et al. 2023. Representation Engineering: A Top-Down Approach to AI Transparency. arXiv preprint arXiv:2310.01405
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.