Extraction and Analysis of Multimodal Concepts in Vision Language Models through Sparse Autoencoders

Jae Hee Lee; Sergio Lanza; Stefan Wermter

arxiv: 2606.21197 · v1 · pith:NEVAX5SDnew · submitted 2026-06-19 · 💻 cs.CV · cs.AI· cs.LG

Extraction and Analysis of Multimodal Concepts in Vision Language Models through Sparse Autoencoders

Sergio Lanza , Jae Hee Lee , Stefan Wermter This is my paper

Pith reviewed 2026-06-26 14:20 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords sparse autoencodersvision language modelsmultimodal conceptsconcept extractionmodel interpretabilityVQALLaVA

0 comments

The pith

A sparse autoencoder framework extracts visual, textual, and multimodal concepts from vision-language models while raising visual concept quality by up to 45 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to extract and categorize concepts encoded inside vision-language models by applying sparse autoencoders to both image and text activations. Prior SAE work examined only one modality at a time, so multimodal concepts that draw on both were either missed or mislabeled, and visual descriptions were often vague. The new method generates a candidate human-interpretable label for each neuron and scores its alignment to dataset examples with cosine similarity, producing an explicit classification into visual, textual, or multimodal. On the LLaVA-NeXT VQA dataset the approach yields markedly clearer visual concepts without degrading textual ones and surfaces multimodal concepts in a repeatable way.

Core claim

For each SAE neuron the framework proposes a candidate human-interpretable concept and computes cosine-similarity alignment to samples in a VQA dataset; the resulting scores classify the neuron as encoding a visual, textual, or multimodal concept. Experiments on LLaVA-NeXT show this classification improves visual concept quality by up to 45 percent relative to earlier SAE baselines while preserving high textual quality and systematically revealing multimodal concepts.

What carries the argument

Neuron-wise candidate concept proposal followed by cosine-similarity alignment to dataset samples, used to assign each neuron to visual, textual, or multimodal category.

If this is right

Multimodal concepts that integrate image and text can now be isolated rather than misclassified as purely visual or textual.
Visual concept descriptions become concrete enough to support detailed tracing of model reasoning in image-text tasks.
The same neuron-level classification procedure can be applied to other VLMs without changing the core alignment step.
Systematic separation of concept types supplies a clearer map of how VLMs combine modalities inside their representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be extended to measure how often multimodal neurons participate in correct versus incorrect VQA answers.
If the alignment scores prove stable across datasets, they might serve as a lightweight probe for concept drift when models are fine-tuned.
Applying the framework to larger or more diverse VLMs would test whether the reported quality gain generalizes beyond LLaVA-NeXT.

Load-bearing premise

The human-proposed candidate labels plus their cosine-similarity scores to data samples correctly reflect the concepts actually encoded by the neurons.

What would settle it

Running the same pipeline on a different VLM or VQA dataset and finding that visual concept quality does not rise or that multimodal neurons cannot be distinguished from unimodal ones would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2606.21197 by Jae Hee Lee, Sergio Lanza, Stefan Wermter.

**Figure 2.** Figure 2: Distribution of the CLIP and ALIGN scores for the proposed model [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative example for intervention experiments using the prompt “ [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Examples of multimodal concepts. The textual tokens that trigger the [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Examples of visual concepts [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Examples of textual concepts.The textual tokens that trigger the SAE [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Labelling Prompt (LLaVA 72B) The last two prompts are used to query Llama 3.1 70B, which evaluates textual concepts (Figures 10 and 11) based on [17] [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Visual Generation Guidelines (LLaVA 72B) [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: A4: Textual Generation Guidelines (LLaVA 72B) [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Detection score prompt (Llama 3.1 70B) [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Fuzzing score prompt (Llama 3.1 70B) [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

read the original abstract

Vision Language Models (VLMs) have demonstrated impressive performance in tasks requiring joint understanding of images and text, such as image captioning and Visual Question Answering (VQA), but our understanding of their internal processes remains limited. Recently, Sparse Autoencoders (SAEs) have emerged as a promising tool to support the interpretation of concepts encoded in VLMs. However, most SAE-based approaches focus only on textual or visual concepts separately, ignoring multimodal concepts. This limitation hinders a comprehensive understanding of VLMs, since concepts that integrate both modalities can be misclassified. Moreover, previous visual approaches often produce low-quality visual concept descriptions that are vague or incomplete, limiting their usefulness for understanding model reasoning. We propose a framework based on SAEs to extract and analyze visual, textual, and multimodal concepts from VLMs. For each neuron, we propose a candidate human-interpretable concept and compute the alignment between the concept and the dataset samples using cosine similarity scores. Experiments on a VQA dataset (LLaVA-NeXT) demonstrate that our framework improves visual concept quality by up to 45\% compared to existing SAE-based methods, while maintaining high textual concept quality and enabling systematic identification of multimodal concepts. This work contributes new insights into the conceptual space of VLMs, providing a structured approach to distinguish between visual, textual, and multimodal concepts. The code is available at https://github.com/PHDLanza/Multidata_SAE

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper extends SAEs to jointly extract multimodal concepts in VLMs via human-proposed labels and cosine alignment, but the 45% visual quality gain rests on thin evidence with no details on measurement or validation.

read the letter

The abstract describes a framework that runs SAEs on a VLM, proposes a human-interpretable concept for each neuron, and uses cosine similarity to dataset samples to label the neuron as visual, textual, or multimodal. On LLaVA-NeXT they report up to 45% higher visual concept quality than earlier SAE methods while keeping textual quality high, and they release the code.

The joint treatment of multimodal concepts is the clearest new element. Prior SAE work on VLMs handled one modality at a time, so this addresses a gap where mixed concepts could be misclassified. The public code is a practical plus for anyone who wants to run the method.

The soft spots are in the evaluation. The abstract gives no information on how visual concept quality was scored, what the baseline SAE methods were, or whether any error bars or tests were used. The modality labels themselves come from the human proposals plus the similarity step. The stress-test concern is on target here: without ground-truth checks, ablations, or tests against actual neuron behavior, the labels could be unreliable on mixed activations, which would undermine both the quality numbers and the count of multimodal concepts.

This is aimed at researchers doing interpretability work on vision-language models. Someone already running SAEs on VLMs could get a usable starting point from the code and the alignment idea, even if the results need more backing.

It deserves peer review. The limitation it targets is real and the method is simple enough to test further. A referee could reasonably ask for clearer metrics and validation without rejecting the direction.

Referee Report

2 major / 1 minor

Summary. The paper proposes an SAE-based framework for extracting visual, textual, and multimodal concepts from VLMs. For each neuron, a human-interpretable concept is proposed and aligned to dataset samples via cosine similarity to assign modality categories. On the LLaVA-NeXT VQA dataset, the method is claimed to improve visual concept quality by up to 45% over prior SAE approaches while preserving textual quality and enabling systematic multimodal identification. Code is released at the provided GitHub link.

Significance. If the quality metric and modality assignments prove robust, the work would offer a practical method for distinguishing multimodal concepts in VLMs, addressing a gap in current SAE interpretability literature. The public code release is a clear strength that supports reproducibility and follow-up experiments.

major comments (2)

[Experiments (and abstract)] The 45% visual-quality improvement and the multimodal counts both depend on the neuron-to-modality assignment procedure (human-proposed concept + cosine similarity to samples). No ground-truth modality labels, ablation on the similarity threshold, or downstream-task correlation is reported to confirm that the assigned category matches the neuron’s actual encoding behavior. This is load-bearing for the central quantitative claim.
[Abstract and Experiments] The abstract states a quantitative improvement of 45% but supplies no definition of the visual-concept-quality metric, the exact baseline SAE methods, error bars, number of runs, or statistical test. Without these, the reported gain cannot be evaluated.

minor comments (1)

[Methods] Notation for cosine similarity and the exact SAE reconstruction loss should be stated explicitly in the methods section rather than left implicit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on experimental validation and clarity. We address each point below and have revised the manuscript accordingly where feasible.

read point-by-point responses

Referee: [Experiments (and abstract)] The 45% visual-quality improvement and the multimodal counts both depend on the neuron-to-modality assignment procedure (human-proposed concept + cosine similarity to samples). No ground-truth modality labels, ablation on the similarity threshold, or downstream-task correlation is reported to confirm that the assigned category matches the neuron’s actual encoding behavior. This is load-bearing for the central quantitative claim.

Authors: We acknowledge that the modality assignment procedure is central to the claims and that ground-truth labels are unavailable in the LLaVA-NeXT dataset. The cosine-similarity alignment to human-proposed concepts follows standard practice in SAE interpretability when direct labels do not exist. In the revised manuscript we have added an ablation varying the similarity threshold (0.4–0.9) and show that the reported visual-quality gains remain stable. We have also expanded the discussion of assignment limitations and note downstream-task validation as future work; a full correlation study exceeds the scope of the current submission. revision: partial
Referee: [Abstract and Experiments] The abstract states a quantitative improvement of 45% but supplies no definition of the visual-concept-quality metric, the exact baseline SAE methods, error bars, number of runs, or statistical test. Without these, the reported gain cannot be evaluated.

Authors: We agree the abstract and experimental reporting lacked necessary detail. The revised abstract and Section 4 now define visual-concept-quality as the mean human interpretability rating (1–5 scale) assigned by three annotators to the top-10 activating samples per neuron. Baselines are the SAE variants from the cited prior works on VLMs. Results are reported as means over three independent runs with standard-deviation error bars; a paired t-test (p < 0.05) is included to support the 45 % gain. These clarifications have been incorporated throughout the manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method applies external cosine similarity to dataset samples for categorization.

full rationale

The paper describes a framework that proposes human-interpretable concepts per neuron and computes cosine similarity alignments to VQA dataset samples to assign visual/textual/multimodal labels. The reported 45% visual quality improvement is presented as an empirical comparison against prior SAE methods using the same external alignment process. No equations, derivations, or self-citations reduce the central claims to definitional equivalence or fitted inputs by construction. The approach is self-contained against the described external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract alone supplies no concrete information on free parameters, background axioms, or new postulated entities; all such elements remain unknown.

pith-pipeline@v0.9.1-grok · 5799 in / 1132 out tokens · 19094 ms · 2026-06-26T14:20:27.189881+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 9 canonical work pages · 4 internal anchors

[1]

In: International Conference on Computer Vision (ICCV) (2015)

Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: VQA: Visual Question Answering. In: International Conference on Computer Vision (ICCV) (2015)

2015
[2]

In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Bau, D., Zhou, B., Khosla, A., Oliva, A., Torralba, A.: Network dissection: Quanti- fying interpretability of deep visual representations. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3319–3327. Honolulu, HI (Jul 2017). https://doi.org/10.1109/CVPR.2017.354, https://ieeexplore.ieee. org/document/8099837

work page doi:10.1109/cvpr.2017.354 2017
[3]

Trans- actions on Machine Learning Research (2024),https://openreview.net/forum?id= ePUVetPKu6

Bereska, L., Gavves, S.: Mechanistic interpretability for AI safety - a review. Trans- actions on Machine Learning Research (2024),https://openreview.net/forum?id= ePUVetPKu6

2024
[4]

Bricken, Trenton, Templeton, Adly, Batson, Joshua, Chen Brian, Jermyn Adam: Towards monosemanticity: decomposing language models with dictionary learning (Oct 2023),https://transformer-circuits.pub/2023/monosemantic-features

2023
[5]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Cunningham, H., Ewart, A., Riggs, L., Huben, R., Sharkey, L.: Sparse autoen- coders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600 (Oct2023). https://doi.org/10.48550/arXiv.2309.08600, https: //arxiv.org/abs/2309.08600

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.08600
[6]

Toy Models of Superposition

Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, e.a.: Toy models of superposition. arXiv preprint arXiv:2209.10652 (Sep 2022).https://doi.org/10.48550/arXiv.2209.10652, https: //arxiv.org/abs/2209.10652 12 S. Lanza et al

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2209.10652 2022
[7]

Scaling and evaluating sparse autoencoders

Gao, L., la Tour, T.D., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J., Wu, J.: Scaling and evaluating sparse autoencoders. arXiv preprint arXiv:2406.04093(Jun2024). https://doi.org/10.48550/arXiv.2406.04093, https: //arxiv.org/abs/2406.04093

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.04093
[8]

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., et al.: The llama 3 herd of models (2024),https: //arxiv.org/abs/2407.21783

Pith/arXiv arXiv 2024
[9]

In: Meila, M., Zhang, T

Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. vol. 139, pp. 4904–4916 (18–24 Jul 2021),https://proceedings.m...

2021
[10]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Kaduri, O., Bagon, S., Dekel, T.: What’s in the image? a deep-dive into the vision of vision language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 14549–14558 (June 2025), https://openaccess.thecvf.com/content/CVPR2025/html/Kaduri_Whats_in_the_ Image_A_Deep-Dive_into_the_Vision_of_CVPR_2025_paper.html

2025
[11]

com/system/files/nai-paper-697.pdf

Lee, J.H., Lanza, S., Wermter, S.: From neural activations to concepts: A survey on explaining concepts in neural networks (2024),https://neurosymbolic-ai-journal. com/system/files/nai-paper-697.pdf

2024
[12]

arXiv preprint arXiv:2412.05276 (Mar 2025)

Lim, H., Choi, J., Choo, J., Schneider, S.: Sparse autoencoders reveal selective remapping of visual concepts during adaptation. arXiv preprint arXiv:2412.05276 (Mar 2025). https://doi.org/10.48550/arXiv.2412.05276, https://arxiv.org/ abs/2412.05276

work page doi:10.48550/arxiv.2412.05276 2025
[13]

Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., Lee, Y.J.: Llava-next: Improved reasoning, ocr, and world knowledge (January 2024),https://llava-vl.github.io/ blog/2024-01-30-llava-next/

2024
[14]

32638,https://doi.org/10.1609/aaai.v39i6.32638

Ma, P., Rietdorf, L., Kotovenko, D., Hu, V.T., Ommer, B.: Does vlm classification benefit from llm description semantics? In: Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence (2025).https://doi.org/10.1609/aaai.v39i6. 32638,https://doi.org/10.1609/aaai.v39i6.32638

work page doi:10.1609/aaai.v39i6 2025
[15]

2107.00135

Pach,M.,Karthik,S.,Bouniot,Q.,Belongie,S.,Akata,Z.: Sparse autoencoders learn monosemantic Features in vision-language models. Poster on Neural Information Processing Systems (NeurIPS 2025) (Apr 2025).https://doi.org/10.48550/arXiv. 2504.02821,https://neurips.cc/virtual/2025/loc/san-diego/poster/119210

work page internal anchor Pith review doi:10.48550/arxiv 2025
[16]

CoRR (2018),http://arxiv.org/abs/1802.08129

Park, D.H., Hendricks, L.A., Akata, Z., Rohrbach, A., Schiele, B., Darrell, T., Rohrbach, M.: Multimodal explanations: Justifying decisions and pointing to the evidence. CoRR (2018),http://arxiv.org/abs/1802.08129

Pith/arXiv arXiv 2018
[17]

arXiv preprint arXiv:2410.13928 (Dec 2024)

Paulo, G., Mallen, A., Juang, C., Belrose, N.: Automatically interpreting mil- lions of features in large language models. arXiv preprint arXiv:2410.13928 (Dec 2024). https://doi.org/10.48550/arXiv.2410.13928, https://arxiv.org/ abs/2410.13928

work page doi:10.48550/arxiv.2410.13928 2024
[18]

In: Proceedings of the 38th International Conference on Machine Learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning. vol. 139, pp. 8748–8763 (18–24 Jul 2021),https://proceedings.mlr....

2021
[19]

In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T

Sim,M.Y.,Zhang,W.E.,Dai,X.,Fang,B.: Can VLMsactuallyseeandread? asurvey on modality collapse in vision-language models. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Findings of the Association for Computational Linguistics: Extraction and Analysis of Multimodal Concepts in VLMs through SAEs 13 ACL 2025. pp. 24452–24470. Association for Com...

work page doi:10.18653/v1/2025.findings-acl.1256 2025
[20]

Transactions on Machine Learning Research (2025),https://openreview.net/forum?id=Vq0wMFBjo2

Zang, Y., Yun, T., Tan, H., Bui, T., Sun, C.: Pre-trained vision-language models learn discoverable visual concepts. Transactions on Machine Learning Research (2025),https://openreview.net/forum?id=Vq0wMFBjo2

2025
[21]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (Nov 2025),https://openaccess.thecvf

Zhang, K., Shen, Y., Li, B., Liu, Z.: Large multimodal models can interpret features in large multimodal models. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (Nov 2025),https://openaccess.thecvf. com/content/ICCV2025/papers/Zhang_Large_Multi-modal_Models_Can_Interpret_ Features_in_Large_Multi-modal_Models_ICCV_2025_paper....

2025
[22]

However, the final concept should be predominantly based on visual patterns

Consider Text Context: While maintaining primary focus on the highlighted regions in images, you may marginally consider the associated text (questions and answers) to support or refine your visual observations. However, the final concept should be predominantly based on visual patterns
[23]

mesh-like structures

Concise Description Only: Provide a short, direct description of the common features within the highlighted regions. Avoid any interpretive language—simply state what you see, such as “mesh-like structures” or “actions related to joy or happiness”
[24]

Concept: ‘No visual concept‘

Describe Only the Highlighted Regions: Generate captions solely based on the highlighted regions. If no meaningful pattern is visible, or if only a few scattered spots are highlighted, output: "Concept: ‘No visual concept‘" Fig.8: Visual Generation Guidelines (LLaVA 72B) 18 S. Lanza et al. [REQUIREMENTS] Focus only on the text content provided with each e...
[25]

Only use the text, and in particular the word between parentheses, to identify the shared concept

You will receive a series of text snippets, sometimes accompanied by images. Only use the text, and in particular the word between parentheses, to identify the shared concept. Images should not be considered in your analysis. These examples are derived from a Visual Question Answering dataset, so each text is in the form of a question or an answer
[26]

vehicles,

Concise Description Only: Provide a short, direct description of the common concept emerging from the texts. Avoid speculation or abstract interpreta- tion—simply state what is explicitly or implicitly repeated, especially in relation to the highlighted words (e.g., “vehicles,” “cooking actions,” “types of animals”). Use the image only for reference if ab...
[27]

A tennis match

If no clear concept emerges from the texts (e.g., if they are too diverse or vague), write: No textual concept [OUTPUT EXAMPLES] Concept: "A tennis match" Concept: "Descriptions of birds" Concept: "No textual concept" Remember, Write always only one Concept for the entire set of inputs Fig.9: A4: Textual Generation Guidelines (LLaVA 72B) Extraction and An...

2015

[1] [1]

In: International Conference on Computer Vision (ICCV) (2015)

Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: VQA: Visual Question Answering. In: International Conference on Computer Vision (ICCV) (2015)

2015

[2] [2]

In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Bau, D., Zhou, B., Khosla, A., Oliva, A., Torralba, A.: Network dissection: Quanti- fying interpretability of deep visual representations. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3319–3327. Honolulu, HI (Jul 2017). https://doi.org/10.1109/CVPR.2017.354, https://ieeexplore.ieee. org/document/8099837

work page doi:10.1109/cvpr.2017.354 2017

[3] [3]

Trans- actions on Machine Learning Research (2024),https://openreview.net/forum?id= ePUVetPKu6

Bereska, L., Gavves, S.: Mechanistic interpretability for AI safety - a review. Trans- actions on Machine Learning Research (2024),https://openreview.net/forum?id= ePUVetPKu6

2024

[4] [4]

Bricken, Trenton, Templeton, Adly, Batson, Joshua, Chen Brian, Jermyn Adam: Towards monosemanticity: decomposing language models with dictionary learning (Oct 2023),https://transformer-circuits.pub/2023/monosemantic-features

2023

[5] [5]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Cunningham, H., Ewart, A., Riggs, L., Huben, R., Sharkey, L.: Sparse autoen- coders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600 (Oct2023). https://doi.org/10.48550/arXiv.2309.08600, https: //arxiv.org/abs/2309.08600

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.08600

[6] [6]

Toy Models of Superposition

Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, e.a.: Toy models of superposition. arXiv preprint arXiv:2209.10652 (Sep 2022).https://doi.org/10.48550/arXiv.2209.10652, https: //arxiv.org/abs/2209.10652 12 S. Lanza et al

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2209.10652 2022

[7] [7]

Scaling and evaluating sparse autoencoders

Gao, L., la Tour, T.D., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J., Wu, J.: Scaling and evaluating sparse autoencoders. arXiv preprint arXiv:2406.04093(Jun2024). https://doi.org/10.48550/arXiv.2406.04093, https: //arxiv.org/abs/2406.04093

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.04093

[8] [8]

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., et al.: The llama 3 herd of models (2024),https: //arxiv.org/abs/2407.21783

Pith/arXiv arXiv 2024

[9] [9]

In: Meila, M., Zhang, T

Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. vol. 139, pp. 4904–4916 (18–24 Jul 2021),https://proceedings.m...

2021

[10] [10]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Kaduri, O., Bagon, S., Dekel, T.: What’s in the image? a deep-dive into the vision of vision language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 14549–14558 (June 2025), https://openaccess.thecvf.com/content/CVPR2025/html/Kaduri_Whats_in_the_ Image_A_Deep-Dive_into_the_Vision_of_CVPR_2025_paper.html

2025

[11] [11]

com/system/files/nai-paper-697.pdf

Lee, J.H., Lanza, S., Wermter, S.: From neural activations to concepts: A survey on explaining concepts in neural networks (2024),https://neurosymbolic-ai-journal. com/system/files/nai-paper-697.pdf

2024

[12] [12]

arXiv preprint arXiv:2412.05276 (Mar 2025)

Lim, H., Choi, J., Choo, J., Schneider, S.: Sparse autoencoders reveal selective remapping of visual concepts during adaptation. arXiv preprint arXiv:2412.05276 (Mar 2025). https://doi.org/10.48550/arXiv.2412.05276, https://arxiv.org/ abs/2412.05276

work page doi:10.48550/arxiv.2412.05276 2025

[13] [13]

Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., Lee, Y.J.: Llava-next: Improved reasoning, ocr, and world knowledge (January 2024),https://llava-vl.github.io/ blog/2024-01-30-llava-next/

2024

[14] [14]

32638,https://doi.org/10.1609/aaai.v39i6.32638

Ma, P., Rietdorf, L., Kotovenko, D., Hu, V.T., Ommer, B.: Does vlm classification benefit from llm description semantics? In: Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence (2025).https://doi.org/10.1609/aaai.v39i6. 32638,https://doi.org/10.1609/aaai.v39i6.32638

work page doi:10.1609/aaai.v39i6 2025

[15] [15]

2107.00135

Pach,M.,Karthik,S.,Bouniot,Q.,Belongie,S.,Akata,Z.: Sparse autoencoders learn monosemantic Features in vision-language models. Poster on Neural Information Processing Systems (NeurIPS 2025) (Apr 2025).https://doi.org/10.48550/arXiv. 2504.02821,https://neurips.cc/virtual/2025/loc/san-diego/poster/119210

work page internal anchor Pith review doi:10.48550/arxiv 2025

[16] [16]

CoRR (2018),http://arxiv.org/abs/1802.08129

Park, D.H., Hendricks, L.A., Akata, Z., Rohrbach, A., Schiele, B., Darrell, T., Rohrbach, M.: Multimodal explanations: Justifying decisions and pointing to the evidence. CoRR (2018),http://arxiv.org/abs/1802.08129

Pith/arXiv arXiv 2018

[17] [17]

arXiv preprint arXiv:2410.13928 (Dec 2024)

Paulo, G., Mallen, A., Juang, C., Belrose, N.: Automatically interpreting mil- lions of features in large language models. arXiv preprint arXiv:2410.13928 (Dec 2024). https://doi.org/10.48550/arXiv.2410.13928, https://arxiv.org/ abs/2410.13928

work page doi:10.48550/arxiv.2410.13928 2024

[18] [18]

In: Proceedings of the 38th International Conference on Machine Learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning. vol. 139, pp. 8748–8763 (18–24 Jul 2021),https://proceedings.mlr....

2021

[19] [19]

In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T

Sim,M.Y.,Zhang,W.E.,Dai,X.,Fang,B.: Can VLMsactuallyseeandread? asurvey on modality collapse in vision-language models. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Findings of the Association for Computational Linguistics: Extraction and Analysis of Multimodal Concepts in VLMs through SAEs 13 ACL 2025. pp. 24452–24470. Association for Com...

work page doi:10.18653/v1/2025.findings-acl.1256 2025

[20] [20]

Transactions on Machine Learning Research (2025),https://openreview.net/forum?id=Vq0wMFBjo2

Zang, Y., Yun, T., Tan, H., Bui, T., Sun, C.: Pre-trained vision-language models learn discoverable visual concepts. Transactions on Machine Learning Research (2025),https://openreview.net/forum?id=Vq0wMFBjo2

2025

[21] [21]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (Nov 2025),https://openaccess.thecvf

Zhang, K., Shen, Y., Li, B., Liu, Z.: Large multimodal models can interpret features in large multimodal models. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (Nov 2025),https://openaccess.thecvf. com/content/ICCV2025/papers/Zhang_Large_Multi-modal_Models_Can_Interpret_ Features_in_Large_Multi-modal_Models_ICCV_2025_paper....

2025

[22] [22]

However, the final concept should be predominantly based on visual patterns

Consider Text Context: While maintaining primary focus on the highlighted regions in images, you may marginally consider the associated text (questions and answers) to support or refine your visual observations. However, the final concept should be predominantly based on visual patterns

[23] [23]

mesh-like structures

Concise Description Only: Provide a short, direct description of the common features within the highlighted regions. Avoid any interpretive language—simply state what you see, such as “mesh-like structures” or “actions related to joy or happiness”

[24] [24]

Concept: ‘No visual concept‘

Describe Only the Highlighted Regions: Generate captions solely based on the highlighted regions. If no meaningful pattern is visible, or if only a few scattered spots are highlighted, output: "Concept: ‘No visual concept‘" Fig.8: Visual Generation Guidelines (LLaVA 72B) 18 S. Lanza et al. [REQUIREMENTS] Focus only on the text content provided with each e...

[25] [25]

Only use the text, and in particular the word between parentheses, to identify the shared concept

You will receive a series of text snippets, sometimes accompanied by images. Only use the text, and in particular the word between parentheses, to identify the shared concept. Images should not be considered in your analysis. These examples are derived from a Visual Question Answering dataset, so each text is in the form of a question or an answer

[26] [26]

vehicles,

Concise Description Only: Provide a short, direct description of the common concept emerging from the texts. Avoid speculation or abstract interpreta- tion—simply state what is explicitly or implicitly repeated, especially in relation to the highlighted words (e.g., “vehicles,” “cooking actions,” “types of animals”). Use the image only for reference if ab...

[27] [27]

A tennis match

If no clear concept emerges from the texts (e.g., if they are too diverse or vague), write: No textual concept [OUTPUT EXAMPLES] Concept: "A tennis match" Concept: "Descriptions of birds" Concept: "No textual concept" Remember, Write always only one Concept for the entire set of inputs Fig.9: A4: Textual Generation Guidelines (LLaVA 72B) Extraction and An...

2015