Vision-Default, Prior-Override: Causal Mechanisms of Perception-Knowledge Conflict in Vision-Language Models

Carsten Eickhoff; Danielle Bitterman; Michal Golovanevsky; Niclas Lietzow; William Rudman

arxiv: 2606.28273 · v1 · pith:C6FNNLUVnew · submitted 2026-06-26 · 💻 cs.CL

Vision-Default, Prior-Override: Causal Mechanisms of Perception-Knowledge Conflict in Vision-Language Models

Niclas Lietzow , Danielle Bitterman , Carsten Eickhoff , William Rudman , Michal Golovanevsky This is my paper

Pith reviewed 2026-06-29 03:49 UTC · model grok-4.3

classification 💻 cs.CL

keywords vision-language modelsattention headscausal mechanismsperception-knowledge conflictactivation patchingablation studiesmechanistic interpretability

0 comments

The pith

Vision-language models default to visual evidence but use a small set of late attention heads to override with stored knowledge on conflict.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how vision-language models resolve cases where visual input clashes with memorized facts. It shows that visual grounding is the default behavior, while answers drawn from prior knowledge depend on a sparse group of attention heads located in the second half of the network. These heads are identified through activation patching and ablation across three model families. Removing them shifts most knowledge-based answers to visual ones, but barely affects visual answers. The result points to an asymmetric causal structure that explains when and how models prioritize one source of information over the other.

Core claim

Across three VLM families, visual grounding emerges by default, whereas prior grounding depends on a small set of causally necessary attention heads (2.5-4.8%) concentrated in the second half of the network. These heads enable answers from stored world knowledge despite conflicting visual input. Ablating them flips predictions from knowledge-grounded to visually grounded answers in 68-96% of cases under prior-knowledge prompts, but changes only 0.8-7.5% of visually grounded predictions. The identified heads decompose into routing heads, which modulate information flow, and writing heads, which directly project answer tokens into the residual stream. This structure is consistent across model

What carries the argument

A sparse set of 2.5-4.8% of attention heads in later layers that split into routing heads controlling information flow and writing heads injecting knowledge tokens into the residual stream.

If this is right

Ablating the heads shifts 68-96% of prior-knowledge answers to visual ones while affecting only 0.8-7.5% of visual answers.
The same heads and routing-writing decomposition appear in multiple VLM families and scales.
Prior grounding requires these specific heads; visual grounding does not.
The circuit explains the mechanism of override rather than distributed processing across the whole model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Targeted interventions on these heads could selectively increase or decrease reliance on world knowledge in deployed VLMs.
Similar sparse circuits may exist for other evidence conflicts, such as between text sources or between different sensory inputs.
Training procedures could be designed to strengthen or weaken these heads to improve model reliability under conflicting inputs.
The asymmetry suggests that default visual behavior might be harder to override without precise circuit-level edits.

Load-bearing premise

The prompts and conflict cases chosen for patching and ablation truly isolate perception-knowledge conflicts without artifacts from prompt design or dataset selection.

What would settle it

A new set of conflict examples in which ablating the reported heads fails to flip the majority of prior-knowledge answers to visual ones would falsify the claim of causal necessity.

Figures

Figures reproduced from arXiv: 2606.28273 by Carsten Eickhoff, Danielle Bitterman, Michal Golovanevsky, Niclas Lietzow, William Rudman.

**Figure 2.** Figure 2: Residual stream restoration scores Rd(ℓ) by layer for three representative models. P2V (dashed) and V2P (solid) patching directions are shown; the shaded region highlights the V2P–P2V asymmetry, and vertical dashed lines mark the critical window boundaries. Across models, V2P restoration rises earlier and more strongly than P2V, indicating that visual information is established before prior knowledge. Dif… view at source ↗

**Figure 4.** Figure 4: Flip rates under promoting-head group abla [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Image-attention fraction for classified heads under [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Residual stream restoration scores for all five models. P2V (dashed) and V2P (solid) patching directions. [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Attention head classification scatters for all five models; the dashed line shows the PC1 axis used for the [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Residual-stream restoration scores under the visual-circuit contrast (vary image, hold prompt) for all five [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: MLP restoration scores across layers for all five models. Effects are sparse, with only a few layers showing [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: Image-attention fraction for all classified heads across five models, under [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Logit-lens hit rates on head-output differences for all classified heads across five models. Top-20 hit [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

read the original abstract

Vision-language models must reconcile visual evidence with memorized world knowledge when the two conflict. How they resolve this conflict shapes the reliability of multimodal systems, yet prior work characterizes it behaviorally without a component-level causal account. We combine activation patching across three granularities (residual stream, attention heads, and MLP sublayers) with model-component ablation studies and mechanistic analysis. Across three VLM families, we find that visual grounding emerges by default, whereas prior grounding depends on a small set of causally necessary attention heads (2.5-4.8%) concentrated in the second half of the network. These heads enable answers from stored world knowledge (e.g., "red" for a strawberry) despite conflicting visual input. Ablating them flips predictions from knowledge-grounded to visually grounded answers in 68-96% of cases under prior-knowledge prompts, but changes only 0.8-7.5% of visually grounded predictions, establishing an asymmetric causal structure. The identified heads decompose into routing heads, which modulate information flow, and writing heads, which directly project answer tokens into the residual stream. This structure is consistent across model families and scales, revealing a sparse causal circuit underlying perception-knowledge conflict in VLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper pins down a small set of later attention heads that causally enable prior knowledge to override visual input in VLMs, with clean asymmetry in the ablation results.

read the letter

The main thing to know is that visual grounding wins by default in these models when vision and stored knowledge conflict, and ablating roughly 3-5% of attention heads in the second half of the network flips the output to visual answers in most cases under knowledge prompts while barely affecting visual ones.

The work does a few things solidly. It runs activation patching at residual, head, and MLP levels plus targeted ablations across three VLM families, reports consistent percentages, and decomposes the heads into routing and writing types. The asymmetry in flip rates (68-96% vs 0.8-7.5%) is the clearest new signal beyond earlier behavioral studies. That level of component-level intervention is useful for the subfield.

The soft spot is the prompt and case construction. The stress-test concern holds: if the prior-knowledge prompts or conflict examples are worded or selected in ways that already favor knowledge answers or weaken the visual signal, the heads will look more decisive than they are in neutral settings. Post-hoc head identification also needs tighter justification to rule out selection effects. The abstract numbers are specific, but without the full methods and error bars the causal claim stays provisional.

This is for people doing mechanistic work on multimodal models. It has enough intervention data and cross-model consistency to merit referee time rather than a desk reject, though the experimental controls will need scrutiny.

Referee Report

2 major / 2 minor

Summary. The paper claims that vision-language models resolve perception-knowledge conflicts with visual grounding as the default behavior, while prior (knowledge-based) grounding depends on a sparse set of causally necessary attention heads (2.5-4.8% of total heads, concentrated in the second half of the network). These heads are decomposed into routing and writing types. Using activation patching at residual, head, and MLP levels plus ablation studies across three VLM families, ablating the heads flips predictions from knowledge-grounded to visually-grounded answers in 68-96% of cases under prior-knowledge prompts, but only 0.8-7.5% under visual prompts, establishing an asymmetric causal structure.

Significance. If the results hold, the work supplies a component-level mechanistic account of how VLMs handle conflicting visual evidence and memorized knowledge, identifying a consistent sparse causal circuit. The empirical approach via multi-granularity patching and ablation, plus cross-family consistency, is a strength that could guide targeted interventions for more reliable multimodal outputs.

major comments (2)

[Methods / Experimental Setup (prompt and case selection)] The construction of prior-knowledge prompts and curation of conflict cases (introduced when defining the experimental conditions for patching and ablation) is load-bearing for the central causal claim. The manuscript must demonstrate that these prompts do not systematically increase the salience of knowledge answers or attenuate visual evidence, as any such artifact would make the 68-96% flip rates appear more diagnostic of the identified heads than they are in neutral settings.
[Results (head identification and ablation)] The identification of the specific 2.5-4.8% attention heads appears post-hoc from the patching results. The paper should specify the exact selection procedure (e.g., threshold, held-out data) and report per-model variance or statistical tests on the flip rates to rule out selection bias inflating the apparent necessity of these heads.

minor comments (2)

[Abstract] The abstract states ranges (68-96%, 2.5-4.8%) without per-family breakdowns or error bars; adding these would strengthen the quantitative claims.
[Mechanistic Analysis] The decomposition into 'routing heads' and 'writing heads' is introduced without explicit operational definitions or examples of their distinct effects on the residual stream.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us identify areas for improvement in the presentation of our methods and results. We respond to each major comment below.

read point-by-point responses

Referee: [Methods / Experimental Setup (prompt and case selection)] The construction of prior-knowledge prompts and curation of conflict cases (introduced when defining the experimental conditions for patching and ablation) is load-bearing for the central causal claim. The manuscript must demonstrate that these prompts do not systematically increase the salience of knowledge answers or attenuate visual evidence, as any such artifact would make the 68-96% flip rates appear more diagnostic of the identified heads than they are in neutral settings.

Authors: We agree that it is important to rule out potential artifacts in prompt construction. In the revised manuscript, we will include additional validation experiments and analyses demonstrating that our prior-knowledge prompts do not systematically increase the salience of knowledge-based answers relative to neutral settings. This will involve reporting response distributions on matched control prompts. revision: yes
Referee: [Results (head identification and ablation)] The identification of the specific 2.5-4.8% attention heads appears post-hoc from the patching results. The paper should specify the exact selection procedure (e.g., threshold, held-out data) and report per-model variance or statistical tests on the flip rates to rule out selection bias inflating the apparent necessity of these heads.

Authors: The head identification procedure is described in the methods, but we acknowledge the need for greater clarity on selection criteria to address concerns about post-hoc selection. We will revise the relevant sections to explicitly detail the selection procedure, including any thresholds or data splits used, and add per-model variance along with appropriate statistical tests for the reported flip rates. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical interventions are independent of inputs

full rationale

The paper reports results from activation patching, ablation studies, and mechanistic analysis across VLM families. These are direct experimental interventions on model components (residual stream, attention heads, MLPs) whose outcomes (flip rates of 68-96% vs 0.8-7.5%) are measured observations rather than quantities derived from equations or parameters fitted to the same data. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the derivation chain; the central claim of an asymmetric causal circuit is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that activation patching and ablation isolate causal roles without major side effects; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Activation patching and component ablation can isolate the causal roles of attention heads in resolving perception-knowledge conflicts without confounding effects from the intervention itself.
This assumption underpins the claim that the identified heads are causally necessary for prior grounding.

pith-pipeline@v0.9.1-grok · 5765 in / 1431 out tokens · 56326 ms · 2026-06-29T03:49:08.446052+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 8 linked inside Pith

[1]

Samyadeep Basu, Martin Grayson, Cecily Morrison, Besmira Nushi, Soheil Feizi, and Daniela Massiceti

Do VLMs have bad eyes? diag- nosing compositional failures via mechanistic inter- pretability.Preprint, arXiv:2508.16652. Samyadeep Basu, Martin Grayson, Cecily Morrison, Besmira Nushi, Soheil Feizi, and Daniela Massiceti

arXiv
[2]

Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei

Understanding information storage and trans- fer in multi-modal large language models.Preprint, arXiv:2406.04236. Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei

arXiv
[3]

Preprint, arXiv:2407.14561

NNsight and NDIF: Democratiz- ing access to open-weight foundation model internals. Preprint, arXiv:2407.14561. Mor Geva, Avi Caciularu, Kevin Ro Wang, and Yoav Goldberg

arXiv
[4]

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy

Transformer feed-forward layers build predictions by promoting concepts in the vo- cabulary space.Preprint, arXiv:2203.14680. Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy

arXiv
[5]

Michal Golovanevsky, William Rudman, Michael Lep- ori, Amir Bar, Ritambhara Singh, and Carsten Eick- hoff

Transformer feed-forward layers are key-value memories.Preprint, arXiv:2012.14913. Michal Golovanevsky, William Rudman, Michael Lep- ori, Amir Bar, Ritambhara Singh, and Carsten Eick- hoff. 2025a. Pixels versus priors: Controlling knowl- edge priors in vision-language models through visual counterfacts.Preprint, arXiv:2505.17127. Michal Golovanevsky, Will...

Pith/arXiv arXiv 2012
[6]

Preprint, arXiv:2404.05729

Finding visual task vectors. Preprint, arXiv:2404.05729. Tianze Hua, Tian Yun, and Ellie Pavlick

arXiv
[7]

Nick Jiang, Anish Kachinthaya, Suzie Petryk, and Yossi Gandelsman

How do vision-language models process conflicting informa- tion across modalities?Preprint, arXiv:2507.01790. Nick Jiang, Anish Kachinthaya, Suzie Petryk, and Yossi Gandelsman

arXiv
[8]

Preprint, arXiv:2410.02762

Interpreting and editing vision- language representations to mitigate hallucinations. Preprint, arXiv:2410.02762. Zhuoran Jin, Pengfei Cao, Hongbang Yuan, Yubo Chen, Jiexin Xu, Huaijun Li, Xiaojian Jiang, Kang Liu, and Jun Zhao

arXiv
[9]

Omri Kaduri, Shai Bagon, and Tali Dekel

Cutting off the head ends the conflict: A mechanism for interpreting and mitigating knowledge conflicts in language models.Preprint, arXiv:2402.18154. Omri Kaduri, Shai Bagon, and Tali Dekel

arXiv
[10]

Benlin Liu, Amita Kamath, Madeleine Grunde- McLaughlin, Winson Han, and Ranjay Krishna

What’s in the image? a deep-dive into the vision of vision language models.Preprint, arXiv:2411.17491. Benlin Liu, Amita Kamath, Madeleine Grunde- McLaughlin, Winson Han, and Ranjay Krishna

arXiv
[11]

Preprint, arXiv:2510.04819

Visual representations inside the language model. Preprint, arXiv:2510.04819. Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuan- han Zhang, Sheng Shen, and Yong Jae Lee

arXiv
[12]

Julian Minder, Clément Dumas, Caden Juang, Bilal Chughtai, and Neel Nanda

Locating and editing factual associa- tions in gpt.Preprint, arXiv:2202.05262. Julian Minder, Clément Dumas, Caden Juang, Bilal Chughtai, and Neel Nanda

Pith/arXiv arXiv
[13]

Preprint, arXiv:2504.02922

Overcoming spar- sity artifacts in crosscoders to interpret chat-tuning. Preprint, arXiv:2504.02922. Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, and Fazl Barez

arXiv
[14]

9 Yaniv Nikankin, Dana Arad, Yossi Gandelsman, and Yonatan Belinkov

Towards interpret- ing visual information processing in vision-language models.Preprint, arXiv:2410.07149. 9 Yaniv Nikankin, Dana Arad, Yossi Gandelsman, and Yonatan Belinkov

arXiv
[15]

Farhad Nooralahzadeh, Omid Rohanian, Yi Zhang, Jonathan Fürst, and Kurt Stockinger

Same task, different cir- cuits: Disentangling modality-specific mechanisms in VLMs.Preprint, arXiv:2506.09047. Farhad Nooralahzadeh, Omid Rohanian, Yi Zhang, Jonathan Fürst, and Kurt Stockinger

arXiv
[16]

Vedant Palit, Rohan Pandey, Aryaman Arora, and Paul Pu Liang

When seeing overrides know- ing: Disentangling knowledge conflicts in vision- language models.Preprint, arXiv:2507.13868. Vedant Palit, Rohan Pandey, Aryaman Arora, and Paul Pu Liang

Pith/arXiv arXiv
[17]

Mechanisms of prompt-induced hallucination in vision-language models.arXiv preprint arXiv:2601.05201. Andreas Steiner, André Susano Pinto, Michael Tschan- nen, Daniel Keysers, Xiao Wang, Yonatan Bit- ton, Alexey Gritsenko, Matthias Minderer, Anthony Sherbondy, Shangbang Long, Siyang Qin, Reeve Ingle, Emanuele Bugliarello, Sahar Kazemzadeh, Thomas Mesnard,...

Pith/arXiv arXiv
[18]

Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Simas Sakenis, Jason Huang, Yaron Singer, and Stuart Shieber

PaliGemma 2: A family of versatile VLMs for transfer.Preprint, arXiv:2412.03555. Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Simas Sakenis, Jason Huang, Yaron Singer, and Stuart Shieber

Pith/arXiv arXiv
[19]

Causal mediation analysis for interpreting neural nlp: The case of gender bias.Preprint, arXiv:2004.12265. Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhi- hao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin

arXiv 2004
[20]

Qidong Wang, Junjie Hu, and Ming Jiang

Qwen2-VL: Enhancing vision-language model’s per- ception of the world at any resolution.Preprint, arXiv:2409.12191. Qidong Wang, Junjie Hu, and Ming Jiang

Pith/arXiv arXiv
[21]

Fred Zhang and Neel Nanda

V- seam: Visual semantic editing and attention modu- lating for causal interpretability of vision-language models.Preprint, arXiv:2509.14837. Fred Zhang and Neel Nanda

Pith/arXiv arXiv
[22]

Towards best prac- tices of activation patching in language models: Met- rics and methods.Preprint, arXiv:2309.16042. Model Total Correct conflict Qwen-VL 3B 467 73 Qwen-VL 7B 467 212 LLaV A-NeXT 7B 467 80 PaliGemma 3B 467 121 PaliGemma 10B 467 177 Table 3: Number of correctly conflicting examples per model, used for all quantitative analyses. Model V2P P...

Pith/arXiv arXiv
[23]

Preprint, arXiv:2511.02243

When modalities conflict: How unimodal reasoning un- certainty governs preference dynamics in MLLMs. Preprint, arXiv:2511.02243. A Dataset and Example Selection The Visual-Counterfact dataset (Golovanevsky et al., 2025a) contains 469 examples of common objects with digitally recolored images. Each ex- ample pairs an object (e.g., banana, elephant) with it...

arXiv

[1] [1]

Samyadeep Basu, Martin Grayson, Cecily Morrison, Besmira Nushi, Soheil Feizi, and Daniela Massiceti

Do VLMs have bad eyes? diag- nosing compositional failures via mechanistic inter- pretability.Preprint, arXiv:2508.16652. Samyadeep Basu, Martin Grayson, Cecily Morrison, Besmira Nushi, Soheil Feizi, and Daniela Massiceti

arXiv

[2] [2]

Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei

Understanding information storage and trans- fer in multi-modal large language models.Preprint, arXiv:2406.04236. Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei

arXiv

[3] [3]

Preprint, arXiv:2407.14561

NNsight and NDIF: Democratiz- ing access to open-weight foundation model internals. Preprint, arXiv:2407.14561. Mor Geva, Avi Caciularu, Kevin Ro Wang, and Yoav Goldberg

arXiv

[4] [4]

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy

Transformer feed-forward layers build predictions by promoting concepts in the vo- cabulary space.Preprint, arXiv:2203.14680. Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy

arXiv

[5] [5]

Michal Golovanevsky, William Rudman, Michael Lep- ori, Amir Bar, Ritambhara Singh, and Carsten Eick- hoff

Transformer feed-forward layers are key-value memories.Preprint, arXiv:2012.14913. Michal Golovanevsky, William Rudman, Michael Lep- ori, Amir Bar, Ritambhara Singh, and Carsten Eick- hoff. 2025a. Pixels versus priors: Controlling knowl- edge priors in vision-language models through visual counterfacts.Preprint, arXiv:2505.17127. Michal Golovanevsky, Will...

Pith/arXiv arXiv 2012

[6] [6]

Preprint, arXiv:2404.05729

Finding visual task vectors. Preprint, arXiv:2404.05729. Tianze Hua, Tian Yun, and Ellie Pavlick

arXiv

[7] [7]

Nick Jiang, Anish Kachinthaya, Suzie Petryk, and Yossi Gandelsman

How do vision-language models process conflicting informa- tion across modalities?Preprint, arXiv:2507.01790. Nick Jiang, Anish Kachinthaya, Suzie Petryk, and Yossi Gandelsman

arXiv

[8] [8]

Preprint, arXiv:2410.02762

Interpreting and editing vision- language representations to mitigate hallucinations. Preprint, arXiv:2410.02762. Zhuoran Jin, Pengfei Cao, Hongbang Yuan, Yubo Chen, Jiexin Xu, Huaijun Li, Xiaojian Jiang, Kang Liu, and Jun Zhao

arXiv

[9] [9]

Omri Kaduri, Shai Bagon, and Tali Dekel

Cutting off the head ends the conflict: A mechanism for interpreting and mitigating knowledge conflicts in language models.Preprint, arXiv:2402.18154. Omri Kaduri, Shai Bagon, and Tali Dekel

arXiv

[10] [10]

Benlin Liu, Amita Kamath, Madeleine Grunde- McLaughlin, Winson Han, and Ranjay Krishna

What’s in the image? a deep-dive into the vision of vision language models.Preprint, arXiv:2411.17491. Benlin Liu, Amita Kamath, Madeleine Grunde- McLaughlin, Winson Han, and Ranjay Krishna

arXiv

[11] [11]

Preprint, arXiv:2510.04819

Visual representations inside the language model. Preprint, arXiv:2510.04819. Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuan- han Zhang, Sheng Shen, and Yong Jae Lee

arXiv

[12] [12]

Julian Minder, Clément Dumas, Caden Juang, Bilal Chughtai, and Neel Nanda

Locating and editing factual associa- tions in gpt.Preprint, arXiv:2202.05262. Julian Minder, Clément Dumas, Caden Juang, Bilal Chughtai, and Neel Nanda

Pith/arXiv arXiv

[13] [13]

Preprint, arXiv:2504.02922

Overcoming spar- sity artifacts in crosscoders to interpret chat-tuning. Preprint, arXiv:2504.02922. Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, and Fazl Barez

arXiv

[14] [14]

9 Yaniv Nikankin, Dana Arad, Yossi Gandelsman, and Yonatan Belinkov

Towards interpret- ing visual information processing in vision-language models.Preprint, arXiv:2410.07149. 9 Yaniv Nikankin, Dana Arad, Yossi Gandelsman, and Yonatan Belinkov

arXiv

[15] [15]

Farhad Nooralahzadeh, Omid Rohanian, Yi Zhang, Jonathan Fürst, and Kurt Stockinger

Same task, different cir- cuits: Disentangling modality-specific mechanisms in VLMs.Preprint, arXiv:2506.09047. Farhad Nooralahzadeh, Omid Rohanian, Yi Zhang, Jonathan Fürst, and Kurt Stockinger

arXiv

[16] [16]

Vedant Palit, Rohan Pandey, Aryaman Arora, and Paul Pu Liang

When seeing overrides know- ing: Disentangling knowledge conflicts in vision- language models.Preprint, arXiv:2507.13868. Vedant Palit, Rohan Pandey, Aryaman Arora, and Paul Pu Liang

Pith/arXiv arXiv

[17] [17]

Mechanisms of prompt-induced hallucination in vision-language models.arXiv preprint arXiv:2601.05201. Andreas Steiner, André Susano Pinto, Michael Tschan- nen, Daniel Keysers, Xiao Wang, Yonatan Bit- ton, Alexey Gritsenko, Matthias Minderer, Anthony Sherbondy, Shangbang Long, Siyang Qin, Reeve Ingle, Emanuele Bugliarello, Sahar Kazemzadeh, Thomas Mesnard,...

Pith/arXiv arXiv

[18] [18]

Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Simas Sakenis, Jason Huang, Yaron Singer, and Stuart Shieber

PaliGemma 2: A family of versatile VLMs for transfer.Preprint, arXiv:2412.03555. Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Simas Sakenis, Jason Huang, Yaron Singer, and Stuart Shieber

Pith/arXiv arXiv

[19] [19]

Causal mediation analysis for interpreting neural nlp: The case of gender bias.Preprint, arXiv:2004.12265. Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhi- hao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin

arXiv 2004

[20] [20]

Qidong Wang, Junjie Hu, and Ming Jiang

Qwen2-VL: Enhancing vision-language model’s per- ception of the world at any resolution.Preprint, arXiv:2409.12191. Qidong Wang, Junjie Hu, and Ming Jiang

Pith/arXiv arXiv

[21] [21]

Fred Zhang and Neel Nanda

V- seam: Visual semantic editing and attention modu- lating for causal interpretability of vision-language models.Preprint, arXiv:2509.14837. Fred Zhang and Neel Nanda

Pith/arXiv arXiv

[22] [22]

Towards best prac- tices of activation patching in language models: Met- rics and methods.Preprint, arXiv:2309.16042. Model Total Correct conflict Qwen-VL 3B 467 73 Qwen-VL 7B 467 212 LLaV A-NeXT 7B 467 80 PaliGemma 3B 467 121 PaliGemma 10B 467 177 Table 3: Number of correctly conflicting examples per model, used for all quantitative analyses. Model V2P P...

Pith/arXiv arXiv

[23] [23]

Preprint, arXiv:2511.02243

When modalities conflict: How unimodal reasoning un- certainty governs preference dynamics in MLLMs. Preprint, arXiv:2511.02243. A Dataset and Example Selection The Visual-Counterfact dataset (Golovanevsky et al., 2025a) contains 469 examples of common objects with digitally recolored images. Each ex- ample pairs an object (e.g., banana, elephant) with it...

arXiv