arxiv: 2605.03160 · v1 · submitted 2026-05-04 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Pairwise matrices for sparse autoencoders: single-feature inspection mislabels causal axes

Birk Sebastian Frostelid Torpmann-Hagen, Michael A. Riegler

Pith reviewed 2026-05-08 18:46 UTC · model grok-4.3

classification 💻 cs.LG

keywords sparse autoencodersinterpretabilityfeature steeringcausal axeslanguage modelsjoint effectspairwise protocol

0 comments

The pith

Single-feature inspection of sparse autoencoder features mislabels the causal axes they steer in language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the standard SAE interpretability method of labeling features from top contexts and testing them one at a time fails to capture how those features actually influence model outputs when they interact. Using a pairwise matrix protocol that jointly varies steering coefficients across multiple features, it identifies cases where three near-orthogonal features each produce a philosophy-of-mind register alone but together damage grounded composition on recipes, engine explanations, and introspective prompts. It also shows an inverted-U response in a self-disclaimer feature and distinct output regimes under matched-geometry single, joint, and random perturbations. These results indicate that current single-feature protocols can assign incorrect causal roles to SAE features. If correct, the findings mean many existing feature interpretations and steering applications rest on incomplete evidence.

Core claim

The paper claims that single-feature inspection mislabels causal axes because joint suppression of three near-orthogonal cluster-specific features at c=-500 damages grounded composition on recipes and engine explanations as well as introspective prompts, while single-feature suppression at the same magnitude leaves controls intact. A feature labeled AI self-disclaimer produces an inverted U-shape under coefficient sweep, substituting a contemplative-philosopher voice at c=+500. Matched-geometry comparisons of single-feature, joint, and random-direction perturbations of norm ~1.55 and cosine ~0.64 produce three distinct output regimes, with coherence loss depending on direction pattern rather

What carries the argument

The pairwise matrix protocol, which co-varies the steering coefficient with the joint condition across multiple features to expose interaction effects missed by one-at-a-time tests.

If this is right

Joint suppression can produce placeholder text and coherence loss where single-feature suppression substitutes only strategy filler.
The protocol identifies a top causally responsible feature in Llama-3.1-8B-Instruct in addition to the Qwen and Gemma results.
Coherence loss under steering is direction-pattern-dependent rather than magnitude-dependent.
All three findings reproduce on Gemma-2-2B-it with model-specific damage signatures and CI separation of ~10x on the matched-geometry control.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach could be extended to test whether other common SAE features in safety or capability editing also show hidden joint effects that single steering misses.
It suggests that feature editing pipelines may need to include pairwise or higher-order checks to avoid side effects on unrelated tasks.
Neighbouring interpretability methods that rely on feature isolation might similarly benefit from systematic joint perturbation tests.

Load-bearing premise

That observed differences in output regimes between single-feature, joint, and random-direction perturbations of matched norm and cosine similarity are caused by mislabeled causal axes rather than other uncontrolled factors in the steering process or model-specific artifacts.

What would settle it

If joint suppression of the three features at c=-500 produces no greater damage to grounded composition than single-feature suppression of matched magnitude, or if the three perturbation regimes collapse to identical outputs under the matched-geometry controls.

Figures

Figures reproduced from arXiv: 2605.03160 by Birk Sebastian Frostelid Torpmann-Hagen, Michael A. Riegler.

**Figure 1.** Figure 1: Three orthogonal probes of one SAE-feature matrix on Qwen3-1.7B. view at source ↗

**Figure 2.** Figure 2: Gemma matrix-level replication. (a) Control-prompt regex-degen rate vs steering coefficient for joint {#3997,#13700,#11444} vs single #3997. At c=+200 joint amplification damages controls 58% where single steering at the same scalar coefficient leaves them at 0%. (b) Matched-geometry random-direction control. Random at c=+345 matches joint c=+200 on perturbation magnitude via joint sum-norm 1.724; Wilson 9… view at source ↗

read the original abstract

The standard sparse-autoencoder (SAE) interpretability protocol labels each feature from its top-activating contexts and validates by single-feature steering. We propose the pairwise matrix protocol, co-varying steering coefficient with joint condition, and report three findings the standard one-corner protocol misses on Qwen3-1.7B-Instruct, replicated on Gemma-2-2B-it. First, a feature labelled "AI self-disclaimer" from its top contexts produces an inverted U-shape under a coefficient sweep: at c=+500 the model substitutes a fluent contemplative-philosopher voice for the disclaimer. Two further features anchor the criterion (one monotonic, one pure breakdown). Second, three near-orthogonal cluster-specific features that individually steer a philosophy-of-mind register, jointly suppressed at c=-500, damage grounded composition on recipes and engine explanations as well as introspective prompts; single-feature suppression at the same magnitude leaves controls intact. Third, a matched-geometry comparison of single-feature, joint, and random-direction perturbations (norm ~1.55, cosine ~0.64) yields three distinct output regimes: single-feature substitutes strategy filler, random direction substitutes diverse content, joint suppression alone produces placeholder text. Coherence loss is direction-pattern-dependent, not magnitude-dependent. All three findings reproduce on Gemma with model-specific damage signatures; the matched-geometry control is CI-separated by ~10x. The pipeline also locates a top causally responsible feature in Llama-3.1-8B-Instruct.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a pairwise steering protocol that surfaces joint feature effects missed by single-feature SAE checks, with decent controls and replications, though the mislabeling claim still needs tighter isolation from model interactions.

read the letter

The main thing to know is that single-feature top-context labeling plus single steering in SAEs can miss how features behave when combined, and the authors supply a simple pairwise matrix method to check for those gaps across coefficient sweeps and joint conditions. They demonstrate this on Qwen3 and replicate on Gemma, plus a pipeline hit on Llama-3.1. The concrete findings include an inverted U-shape on one feature that flips from disclaimer to contemplative voice at high positive strength, three near-orthogonal features whose joint negative suppression at c=-500 breaks grounded composition on recipes and explanations while single suppression leaves those intact, and a matched-geometry test where single-feature, joint, and random-direction perturbations at similar norm and cosine produce three different output regimes. Coherence loss tracks the direction pattern rather than magnitude alone, and the control shows clear CI separation. These patterns hold across models with some model-specific signatures. The protocol and the joint-suppression demonstrations are the actual new pieces; prior SAE work has not used this co-variation of coefficient and joint condition in the same way. The replications and the geometry-matched random controls are better than the usual SAE interpretability note that stops at top contexts and one steer. The soft spot sits in the causal reading. The distinct regimes under joint suppression could arise from how the summed vector hits transformer nonlinearities or residual SAE correlations, even with the first-order geometry control in place. The paper does not report higher-order statistics on the joint vector or test the same setup with a second SAE dictionary on one model, so alternative explanations remain open. This is aimed at people already running SAEs for feature labeling and steering in LLM interpretability. Readers who want practical ways to strengthen validation will find the protocol worth trying on their own setups. It has enough new method, replicated evidence, and thoughtful controls to deserve a serious referee, though the referees will likely press for expanded methods details and extra checks on the joint vector. I would send it to peer review.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a pairwise matrix protocol for sparse autoencoders that co-varies steering coefficients with joint feature conditions. It claims that single-feature inspection mislabels causal axes, as shown by three findings replicated on Qwen3-1.7B-Instruct and Gemma-2-2B-it: an inverted U-shape under coefficient sweep for a feature labeled 'AI self-disclaimer'; joint suppression at c=-500 of three near-orthogonal cluster-specific features damaging grounded composition on recipes, engine explanations, and introspective prompts while single-feature suppression at the same magnitude leaves controls intact; and distinct output regimes (single-feature substitutes strategy filler, random direction substitutes diverse content, joint produces placeholder text) in a matched-geometry comparison (norm ~1.55, cosine ~0.64) where coherence loss is direction-pattern-dependent rather than magnitude-dependent. A pipeline also identifies a top causally responsible feature in Llama-3.1-8B-Instruct.

Significance. If the results hold, this work is significant for mechanistic interpretability because it provides concrete evidence that standard top-context labeling plus single-feature steering can miss causal structure, and it demonstrates the utility of joint analysis via explicit controls (random directions, single vs. joint, coefficient sweeps) and cross-model replication with CI separation by ~10x on the matched-geometry control. The direction-pattern dependence of coherence loss offers a falsifiable distinction that could improve steering experiment design.

major comments (2)

[matched-geometry comparison] The central claim that distinct output regimes demonstrate mislabeling of causal axes rests on the matched-geometry comparison of single-feature, joint, and random-direction perturbations. This control matches only first-order geometry (norm ~1.55, cosine ~0.64) but does not report higher-order statistics of the joint vector or test survival of the effect under a different SAE dictionary on the same model, so the observed differences could arise from SAE-model interactions or residual correlations rather than mislabeled axes.
[Abstract and replication statements] The abstract states that replication across Qwen3-1.7B-Instruct and Gemma-2-2B-it plus a pipeline result on Llama-3.1-8B-Instruct shows CI separation on the matched-geometry control, yet no methods, data, error bars, or exclusion criteria are visible. This absence is load-bearing for evaluating whether the three findings (inverted U-shape, joint-damage effect, and regime separation) are robust.

minor comments (1)

[Abstract] The exact construction of the 'pairwise matrices' (how coefficients are co-varied with joint conditions) is referenced but not formalized in the abstract, which may hinder immediate reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important areas for strengthening the evidence on the limitations of single-feature SAE inspection. We address each major comment below and outline planned revisions.

read point-by-point responses

Referee: [matched-geometry comparison] The central claim that distinct output regimes demonstrate mislabeling of causal axes rests on the matched-geometry comparison of single-feature, joint, and random-direction perturbations. This control matches only first-order geometry (norm ~1.55, cosine ~0.64) but does not report higher-order statistics of the joint vector or test survival of the effect under a different SAE dictionary on the same model, so the observed differences could arise from SAE-model interactions or residual correlations rather than mislabeled axes.

Authors: We agree that matching only first-order geometry leaves room for alternative explanations such as residual correlations or SAE-specific interactions. However, the observed regimes are qualitatively distinct (strategy-filler substitution for single-feature, diverse content for random, placeholder text for joint) and the coherence loss is explicitly pattern-dependent rather than magnitude-dependent, which is harder to attribute to first-order residuals alone. The ~10x CI separation on the matched-geometry control and replication on Gemma-2-2B-it (different model and SAE) provide additional support. We will add higher-order statistics of the joint vector (e.g., kurtosis, pairwise feature correlations within the cluster) and a limitations paragraph discussing the absence of a same-model different-SAE test. We did not perform the latter because our design prioritized cross-model replication. revision: partial
Referee: [Abstract and replication statements] The abstract states that replication across Qwen3-1.7B-Instruct and Gemma-2-2B-it plus a pipeline result on Llama-3.1-8B-Instruct shows CI separation on the matched-geometry control, yet no methods, data, error bars, or exclusion criteria are visible. This absence is load-bearing for evaluating whether the three findings (inverted U-shape, joint-damage effect, and regime separation) are robust.

Authors: We accept that the current manuscript does not make the replication methods, raw data summaries, error bars, and exclusion criteria sufficiently explicit. The abstract condenses results that appear in the main text and figures, but we will expand the Methods section with a dedicated replication subsection, include tables reporting per-model CI values and sample sizes, specify exclusion criteria for prompt sets, and add error bars to all relevant plots. The Llama pipeline will also receive expanded description. These changes will be incorporated in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical protocol with independent controls

full rationale

The paper reports direct experimental measurements of steering effects under single-feature, joint, and random-direction perturbations on Qwen3-1.7B-Instruct (replicated on Gemma-2-2B-it and Llama-3.1-8B-Instruct). All load-bearing claims rest on observed output regime differences at matched norm ~1.55 and cosine ~0.64, with CI separation and coefficient sweeps. No equations, fitted parameters, or self-citations are invoked to derive the reported effects; the pairwise matrix protocol is presented as a new measurement procedure whose results are compared against explicit controls rather than reduced to prior inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. Implicit domain assumptions include that activation steering produces interpretable causal effects and that top-activating contexts provide valid feature labels.

axioms (1)

domain assumption Activation steering with scalar coefficients on SAE features produces measurable and interpretable changes in model output.
Foundational to all reported steering experiments and coefficient sweeps.

pith-pipeline@v0.9.0 · 5579 in / 1393 out tokens · 41644 ms · 2026-05-08T18:46:01.705512+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation (J-cost) / Constants (φ-ladder) washburn_uniqueness_aczel unclear
matched-geometry random-direction control (norm ≈1.55, cosine ≈0.64)

Reference graph

Works this paper leans on

15 extracted references · 12 canonical work pages · 5 internal anchors

[1]

Language models as agent models

Jacob Andreas. Language models as agent models. InFindings of the Association for Computational Linguistics: EMNLP 2022,

2022
[2]

Language models as agent models

URLhttps://arxiv.org/abs/2212.01681. Anthropic. Claude’s character. Anthropic research blog,

work page arXiv
[4]

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

URLhttps://arxiv.org/abs/2507.21509. Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen- coders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600,

work page internal anchor Pith review arXiv
[5]

URLhttps://arxiv.org/abs/2309.08600. Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposi- tion. Transformer Circuits Thread,

work page Pith review arXiv
[6]

Representation Engineering: A Top-Down Approach to AI Transparency

Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to AI transparency.arXiv preprint arXiv:2310.01405, 2023. URL https://arxiv.org/abs/2310. 01405. A Detailed tables for §4–5 coef−1500−1000−500 0 +500 +1000 cluster on intros (n=36) 0 % 0...

work page internal anchor Pith review arXiv 2023
[7]

Scaling and evaluating sparse autoencoders

URLhttps://arxiv.org/abs/2406.04093. Goodfire AI. Understanding and steering Llama 3 with sparse autoencoders. Goodfire research blog,

work page internal anchor Pith review arXiv
[9]

Stefan Heimersheim and Neel Nanda

URLhttps://arxiv.org/abs/2410.20526. Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model.Advances in Neural Information Processing Systems,

work page arXiv
[10]

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , shorttitle =

URLhttps://arxiv.org/abs/2306.03341. Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, and Neel Nanda. Gemma scope: Open sparse autoencoders everywhere all at once on Gemma 2.arXiv preprint arXiv:2408.05147,

work page arXiv
[11]

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on

URL https://arxiv.org/abs/2408.05147. Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. InProceedings of the 41st International Conference on Machine Learning,

work page arXiv
[12]

URLhttps://arxiv.org/abs/2311.03658. Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Ben Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Joh...

work page internal anchor Pith review arXiv
[13]

arXiv preprint arXiv:2212.09251 , year =

URL https://arxiv.org/ abs/2212.09251. Qwen Team. Qwen-Scope: Open-source sparse autoencoders for the Qwen3 and Qwen3.5 families. Hugging Face model collection,

work page arXiv
[14]

URL https://aclanthology.org/2024.acl-long.828/

Association for Computational Linguistics. URL https://aclanthology.org/2024.acl-long.828/. Nishant Subramani, Nivedita Suresh, and Matthew E. Peters. Extracting latent steering vectors from pretrained language models. InFindings of the Association for Computational Linguistics: ACL 2022,

2024
[15]

arXiv preprint arXiv:2205.05124 , year=

URLhttps://arxiv.org/abs/2205.05124. Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L. Turner, Callum McDougall, Monte MacDiarmid, Alex Tamkin, Esin Durmus, Tristan Hume, Francesco Mosconi, C. Daniel Freeman, Theodore R. Sumers, Edw...

work page arXiv
[17]

Steering Language Models With Activation Engineering

URLhttps://arxiv.org/abs/2308.10248. Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation ...

work page internal anchor Pith review arXiv
[18]

TIGHTER TAN (TIGHTER TAN)

Table 7: Rate on introspective prompts, Qwen3-1.7B-Base (raw prompts) versus Qwen3-1.7B-Instruct (chat-formatted prompts). The philosophy-of-mind lemmas are below the ∼9% noise floor in the base; after post-training they reach 25–34%. The broader meta-cognitive lemmas are present at 15–16%in the base, consistent with post-training amplifying rather than i...

2023