Interpretability Transfer from Language to Vision via Sparse Autoencoders

Alexey Kravets; Chuan Li; Da Chen; Da Li; Vinay P. Namboodiri

arxiv: 2605.24946 · v1 · pith:36R35NS7new · submitted 2026-05-24 · 💻 cs.CV

Interpretability Transfer from Language to Vision via Sparse Autoencoders

Alexey Kravets , Da Li , Chuan Li , Da Chen , Vinay P. Namboodiri This is my paper

Pith reviewed 2026-06-30 11:43 UTC · model grok-4.3

classification 💻 cs.CV

keywords sparse autoencodersinterpretabilityvision-language modelsconcept alignmentmultimodal modelsfeature interventionobject editing

0 comments

The pith

A visual projector can be regularized to map image tokens into an LLM's existing textual sparse autoencoder space, transferring labeled concepts for interpretation and editing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VISTA, a method that regularizes the visual projector in a vision-language model to place image tokens inside the space of a pre-trained textual sparse autoencoder. This transfers human-labeled concepts from language without the need to train separate vision autoencoders. The alignment produces a threefold increase in the rate at which the most active textual concepts match semantic parts of the input image. It also supports precise edits where specific objects are removed or replaced in the model's output, with 35 percent and 47 percent gains over baselines that do not use the shared space. The framework further shows that DINOv2 vision features support better spatial localization for these edits than other encoders.

Core claim

VISTA constrains the visual projector with the LLM's SAE reconstruction loss so that visual tokens inhabit the text SAE manifold. This yields a threefold increase in matching rate between activating textual concepts and image semantics. Localized interventions then remove objects 35 percent more effectively and replace them 47 percent more effectively than vision-only baselines, with the effect holding across multiple LLM architectures.

What carries the argument

The visual projector regularized by textual SAE reconstruction loss to align tokens with labeled language concepts.

If this is right

Textual SAEs provide interpretability for vision without dedicated vision training.
DINOv2 encoders yield stronger localization than alternatives in the aligned space.
Concept-level edits can target specific objects while preserving the rest of the scene.
The method applies to multiple LLM backbones using the same textual SAE.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If visual tokens reliably occupy the text SAE manifold, the same alignment could be tested on other input types such as audio.
Scaling the approach might reveal limits on how well the manifold captures fine visual details.
Interventions could be combined with other editing techniques to achieve more complex scene changes.

Load-bearing premise

The textual SAE features correspond to distinct semantic visual elements after the projector is regularized, so that altering those features changes only the intended part of the visual output.

What would settle it

A case where deactivating an SAE feature matched to a particular object in the image fails to remove that object from the generated response would falsify the claim that visual tokens inhabit the text SAE manifold.

Figures

Figures reproduced from arXiv: 2605.24946 by Alexey Kravets, Chuan Li, Da Chen, Da Li, Vinay P. Namboodiri.

**Figure 1.** Figure 1: VISTA Framework. Training: A trainable projector maps tokens from a frozen vision encoder to a frozen LLM, optimized via Cross-Entropy and an auxiliary SAE Reconstruction Loss to force alignment of the visual embeddings with the manifold defined by pre-trained Text SAEs. Visual Interpretability: Once aligned, highly activating visual tokens can be interpreted via the Text SAE. Activating latents are search… view at source ↗

**Figure 2.** Figure 2: Spatial Location Analysis of activating tokens. Top Row: DINOv2 precisely localizes visual concepts (cat and cookie). Bottom Row: CLIP activations exhibit spatial confusion, often activating tokens far from the relevant object. This high spatial fidelity in DINOv2 is a prerequisite for localized concept steering. bone shows encoder-dependent behavior where DINOv2 maintains competitive performance both with… view at source ↗

**Figure 3.** Figure 3: DINOv2 Match Rate for visual tokens with and without SAE constraints with Gemma-2-2B-it LLM model. we hypothesize that visual tokens should exhibit similar sparsity. We analyze these metrics across different LLM backbones. Our results reveal a disparity in how different architectures achieve cross-modal alignment. For the Gemma models (2B and 9B) trained without SAE constraints, we replicate the findings o… view at source ↗

**Figure 5.** Figure 5: Visual steering with DINOv2, selecting precise patches and changing model’s understanding of what is contained in them. ingful in the text domain, should steer the model’s output towards the concept of intervention when applied to visual tokens. The experiments in this section serve as this causal validation. We use object removal and replacement as a controlled testbed because they admit clear success cr… view at source ↗

**Figure 6.** Figure 6: Qualitative results. We steer (a) Sadness, (b) Sleeping, and (c) Danger. In all cases, the steering vector inverts the model’s interpretation (Steered) vs the original (Baseline). 5.3. Manipulating High-Level Concepts We also target high-level concepts. Though abstract notions like emotion or threat lack bounding boxes, they remain anchored to identifiable subjects. We select patches covering the subject … view at source ↗

**Figure 8.** Figure 8: Matching rate for DINOv2 visual encoder. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Matching rate for CLIP visual encoder. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Matching rate for I-JEPA visual encoder. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: CLIP with Gemma-2-2B-it reconstruction and sparsity. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: I-JEPA with Gemma-2-2B-it reconstruction and sparsity. 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Layer 0 500 1000 1500 2000 2500 Reconstruction Loss (MSE) SAE Reconstruction Loss Across Layers Token Type VLM Image Positions VLM Text Positions 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Layer 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 L0 Density SAE Active Feature Fraction Across Layers Token Type VLM… view at source ↗

**Figure 13.** Figure 13: DINOv2 with LLaMA-3.1-8B-Instruct reconstruction and sparsity. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗

**Figure 14.** Figure 14: CLIP with LLaMA-3.1-8B-Instruct reconstruction and sparsity. 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Layer 0 250 500 750 1000 1250 1500 1750 Reconstruction Loss (MSE) SAE Reconstruction Loss Across Layers Token Type VLM Image Positions VLM Text Positions 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Layer 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 L0 Density SAE Active Feature Fraction Across Layers Token Ty… view at source ↗

**Figure 15.** Figure 15: I-JEPA with LLaMA-3.1-8B-Instruct reconstruction and sparsity. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗

**Figure 16.** Figure 16: DINOv2 with Gemma-2-9B-it reconstruction and sparsity. 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 Layer 0 100 200 300 400 500 Reconstruction Loss (MSE) SAE Reconstruction Loss Across Layers Token Type VLM Image Positions VLM Text Positions 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 Layer 0.00 0.05 0.10 0.15 0.20 L0 Density SAE Active Feature Fraction Across Layers Token Type … view at source ↗

**Figure 17.** Figure 17: CLIP with Gemma-2-9B-it reconstruction and sparsity. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗

**Figure 18.** Figure 18: JEPA with Gemma-2-9B-it reconstruction and sparsity. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_18.png] view at source ↗

**Figure 20.** Figure 20: Qualitative example of visual steering with CLIP visual encoder and Gemma-2-2B-it LLM, selecting precise patches in the image and changing model’s understanding of what is contained in them. Local steering with CLIP visual encoder is not effective. Q: What is shown in this image? Original: The image shows a cat sitting on a couch with a pillow. The cat is looking at the camera. Remove “cat”: The image sho… view at source ↗

**Figure 21.** Figure 21: Qualitative example of visual steering with DINOv2 visual encoder and Gemma-2-2B-it LLM, selecting precise patches in the image and changing model’s understanding of what is contained in them. Local steering with DINOv2 visual encoder is effective. H. Performance - Interpretability Trade-off We compare how additional constraints on SAE layers affect both interpretability and benchmark performance. To aggr… view at source ↗

**Figure 22.** Figure 22: Performance-Interpretability Trade-off. Increasing the number of SAE-constrained layers improves interpretability up to a threshold. The closest performance to not including any SAE constraints is achieved including the SAEs from layer 0 to 4. 0 2 4 6 8 10 12 14 16 18 20 22 24 Layer 0.0 0.2 0.4 0.6 0.8 Match Rate SAE Feature Key Concept Match Rate Across Layers (Averaged across samples) No SAE SAE (L0) SA… view at source ↗

**Figure 23.** Figure 23: Matching Rate across layers for different SAE constraints. Increasing the number of SAE-constrained layers improves interpretability up to a threshold. Including all layers for SAE constraints reduces the matching rate in the last layers. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_23.png] view at source ↗

read the original abstract

Recent advances in language model interpretability using sparse autoencoders (SAEs) have yet to effectively translate to the visual domain, mainly due to the difficulty and ambiguity of labeling visual concepts. In this paper, we introduce Visual Interpretability via SAE Transfer Alignment (VISTA), a framework that transfers interpretability from language to vision in a LLaVA-style vision-language model by constraining a visual projector to map visual tokens into an LLM's pre-existing, labeled textual SAE space. This approach enables visual interpretability without training dedicated vision SAEs. By regularizing the projector using the LLM's SAE reconstruction loss, VISTA achieves a threefold increase in the matching rate, which measures how accurately the most activating textual concepts in the SAE space correspond to semantic elements in the image. Using this framework, we further analyze spatial localization properties of different vision encoders and show that DINOv2 features have stronger localization abilities than other encoders. Leveraging this precision, we validate VISTA's cross-modal alignment through fine-grained, localized concept interventions, where specific objects are removed or replaced in the model's perception while preserving the surrounding scene. This results in improvements of 35% in object removal and 47% in object replacement tasks over vision-only baselines, providing causal evidence that visual tokens inhabit the text SAE manifold. These contributions are validated across multiple LLM architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VISTA's intervention gains probably come from training the projector with the SAE loss rather than showing that visual tokens sit on the pre-trained text SAE manifold.

read the letter

The main point is that adding the text SAE reconstruction loss to the visual projector training produces better object removal and replacement results than plain vision baselines. That is the concrete thing the paper shows.

What works is the basic transfer setup: they constrain the projector so visual tokens reconstruct well under an existing labeled text SAE, then use those features for interventions. This avoids training a separate vision SAE and gives a practical way to reuse language concepts. The encoder comparison is also useful; DINOv2 coming out ahead on localization is a clear side finding that stands on its own.

The soft spot is the baseline design. The vision-only controls lack the SAE term entirely, so the 35% and 47% gains could simply reflect the extra training signal making features more editable, without the tokens actually landing on the specific labeled manifold. The threefold matching-rate jump is consistent with improved alignment but does not rule out the projector learning some other mapping whose top SAE activations happen to line up with image content. No control with a matched-strength auxiliary loss is described, which leaves the causal claim under-supported.

The paper is aimed at people already working on SAE interpretability in multimodal models who want a lightweight way to get concept-level edits. A reader focused on practical reuse of language tools would find the method description and the DINOv2 result worth looking at.

It deserves peer review. The idea is straightforward, the experiments try to demonstrate utility, and the citation pattern looks reasonable. A referee would mainly need to press on whether the manifold claim can be isolated from the training difference.

Referee Report

2 major / 2 minor

Summary. The paper introduces VISTA, a framework that transfers interpretability from pre-trained textual sparse autoencoders (SAEs) to vision in LLaVA-style vision-language models. It does so by regularizing the visual projector with the LLM's SAE reconstruction loss to map visual tokens into the labeled textual SAE space. The work claims a threefold increase in matching rate between top-activating textual SAE concepts and image semantics, stronger spatial localization for DINOv2 encoders, and 35% / 47% gains in object removal and replacement interventions over vision-only baselines, which is presented as causal evidence that visual tokens inhabit the text SAE manifold. Results are reported across multiple LLM architectures.

Significance. If the alignment and causal claims hold after appropriate controls, the approach would provide a practical route to visual interpretability that reuses existing language SAEs rather than training dedicated vision SAEs, while enabling localized concept interventions in multimodal models.

major comments (2)

[Intervention experiments] Intervention experiments (object removal/replacement tasks): the vision-only baselines omit the SAE reconstruction loss used to train the VISTA projector. Consequently the reported 35% and 47% gains cannot be attributed specifically to visual tokens residing on the pre-trained textual SAE manifold rather than to the presence of the additional regularization term; this directly undermines the causal-evidence claim.
[Matching-rate evaluation] Matching-rate evaluation: the threefold increase is reported without an explicit definition of the metric, without ablations that isolate manifold habitation from general projector quality, and without controls that apply equivalent regularization without the SAE manifold constraint.

minor comments (2)

[Abstract] The abstract states that results are validated across multiple LLM architectures but supplies no per-architecture quantitative breakdowns or tables.
[Method] No equations or pseudocode are shown for the projector regularization term or the matching-rate computation, making it difficult to verify that the method is parameter-free with respect to the SAE labels.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below, acknowledging where additional controls are needed to support the causal claims, and outline the revisions we will implement.

read point-by-point responses

Referee: [Intervention experiments] Intervention experiments (object removal/replacement tasks): the vision-only baselines omit the SAE reconstruction loss used to train the VISTA projector. Consequently the reported 35% and 47% gains cannot be attributed specifically to visual tokens residing on the pre-trained textual SAE manifold rather than to the presence of the additional regularization term; this directly undermines the causal-evidence claim.

Authors: We agree that the vision-only baselines lack the SAE reconstruction loss, which prevents cleanly attributing the reported gains to manifold alignment rather than the regularization itself. To address this, we will add new control experiments that apply equivalent regularization to the projector using non-SAE objectives (e.g., direct MSE on visual features or a randomly initialized autoencoder). These results, along with updated discussion of the causal evidence, will be included in the revised manuscript. revision: yes
Referee: [Matching-rate evaluation] Matching-rate evaluation: the threefold increase is reported without an explicit definition of the metric, without ablations that isolate manifold habitation from general projector quality, and without controls that apply equivalent regularization without the SAE manifold constraint.

Authors: We acknowledge that an explicit mathematical definition of the matching rate was not provided in the main text (though described qualitatively in the abstract). We will insert a precise definition and formula in the methods section. We will also add the requested ablations, including projectors trained with matched regularization strength but without the textual SAE constraint, to isolate the manifold effect from general projector improvements. These changes will be incorporated in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with externally defined metrics and no self-referential derivations

full rationale

The manuscript presents VISTA as an empirical method that regularizes a visual projector via an existing textual SAE reconstruction loss and reports comparative performance gains on matching rate, object removal, and replacement tasks. No equations, derivations, or parameter-fitting steps are described that would reduce the central claim (visual tokens inhabiting the SAE manifold) to a tautology or fitted input renamed as prediction. The matching rate and intervention metrics are defined externally rather than by construction from the regularization term. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing way. This is the common honest case of a self-contained empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no free parameters, axioms, or invented entities are described or can be extracted.

pith-pipeline@v0.9.1-grok · 5779 in / 1042 out tokens · 27223 ms · 2026-06-30T11:43:24.294483+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 1 canonical work pages

[1]

Featured Certification

ISSN 2835-8856. Featured Certification. Pach, M., Karthik, S., Bouniot, Q., Belongie, S., and Akata, Z. Sparse autoencoders learn monosemantic features in vision-language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. Peng, X., Bai, Q., Xia, X., Huang, Z., Saenko, K., and Wang, B. Moment matching for multi-sou...

2025
[2]

overcomplete

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.828. Sharma, P., Ding, N., Goodman, S., and Soricut, R. Con- ceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. InProceedings of ACL, 2018. Turner, A. M., Thiergart, L., Udell, D., Leech, G., Mini, U., and MacDiarmid, M. Activation additi...

work page doi:10.18653/v1/2024.acl-long.828 2024

[1] [1]

Featured Certification

ISSN 2835-8856. Featured Certification. Pach, M., Karthik, S., Bouniot, Q., Belongie, S., and Akata, Z. Sparse autoencoders learn monosemantic features in vision-language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. Peng, X., Bai, Q., Xia, X., Huang, Z., Saenko, K., and Wang, B. Moment matching for multi-sou...

2025

[2] [2]

overcomplete

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.828. Sharma, P., Ding, N., Goodman, S., and Soricut, R. Con- ceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. InProceedings of ACL, 2018. Turner, A. M., Thiergart, L., Udell, D., Leech, G., Mini, U., and MacDiarmid, M. Activation additi...

work page doi:10.18653/v1/2024.acl-long.828 2024