Who Wins the Conflict? Mechanistic Interpretability of Text Bias in Audio LLMs

Changick Kim; Hyebin Cho; Jaehyuk Jang; Joon Son Chung; Suho Yoo

arxiv: 2606.18924 · v1 · pith:Z3SGKYMQnew · submitted 2026-06-17 · 💻 cs.SD

Who Wins the Conflict? Mechanistic Interpretability of Text Bias in Audio LLMs

Hyebin Cho , Suho Yoo , Jaehyuk Jang , Changick Kim , Joon Son Chung This is my paper

Pith reviewed 2026-06-26 19:34 UTC · model grok-4.3

classification 💻 cs.SD

keywords audio large language modelstext dominancemechanistic interpretabilitymultimodal biasback-patchinglayer-wise activation analysisconflict resolution

0 comments

The pith

Text pathways in Audio LLMs actively suppress intact audio representations instead of erasing them when inputs conflict.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper traces how internal activations move through layers in Audio LLMs to show that text and audio follow separate routes until they meet in late layers. It finds that the text route suppresses audio signals that remain otherwise complete. The authors introduce back-patching, which moves late-layer audio activations into earlier layers, as a training-free way to strengthen audio signals and cut text dominance. This matters because the bias produces hallucinations on conflicting inputs, and the intervention offers a direct way to shift the balance without retraining the model.

Core claim

Text dominance is systematic across models. Text and audio use functionally distinct pathways that converge into a shared semantic space in late layers. The text pathway does not erase audio information but actively suppresses intact audio representations. Back-patching routes late-layer audio activations back into earlier layers, amplifying the audio representations so they overcome textual suppression and reduce the bias.

What carries the argument

Back-patching, a training-free intervention that routes late-layer audio activations back into earlier layers to amplify audio representations against textual suppression.

If this is right

Text dominance appears reliably across different Audio LLMs when audio and text contradict.
Audio and text inputs travel separate functional pathways until they merge in late layers.
Back-patching consistently lowers the rate of text dominance on conflicting inputs.
The approach opens a route to mechanistic multimodal alignment without additional training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same suppression pattern may appear in other multimodal models that combine language with another modality.
Back-patching could be tested on real-world inputs where audio evidence directly contradicts text instructions.
Identifying the exact layers where suppression begins could allow more targeted interventions.

Load-bearing premise

That tracing representation propagation and applying back-patching reveals the causal mechanism of text dominance rather than a side effect of the chosen models and prompts.

What would settle it

A test in which back-patching is applied across multiple conflicting audio-text pairs yet the rate of text dominance stays the same or rises, or in which audio representations are shown to be overwritten before any suppression occurs.

Figures

Figures reproduced from arXiv: 2606.18924 by Changick Kim, Hyebin Cho, Jaehyuk Jang, Joon Son Chung, Suho Yoo.

**Figure 1.** Figure 1: Overview of our approach to analyzing and mitigating text dominance in ALMs. (a) An example of modality conflict where the model ignores acoustic evidence and generates a text-dominant answer. (b) Identification of the individual text and audio circuits and the analysis of their functional relationship. (c) Targeted ablations reveal that removing the text circuit restores the model’s reliance on audio cues… view at source ↗

**Figure 2.** Figure 2: Causal tracing of model components via AP-IG. We visualize the patching effects from all model components across each input position and layer using Qwen on the negation swap conflict task. Distinct operational behaviors are observed across different functional segments, corresponding to the specific positions of Data, Query, and Generation tokens. the effect is weaker and less consistent; in one conditio… view at source ↗

**Figure 3.** Figure 3: Faithfulness of discovered circuits. We evaluate circuit faithfulness across models and English conflict tasks. The smallest circuit with faithfulness above 0.8 is used for subsequent analysis. those from the corresponding counterfactual run. We sweep over circuit sparsity levels and select the smallest percentile whose normalized faithfulness exceeds 0.8. As illustrated in the Faithfulness Graphs (fig.… view at source ↗

**Figure 5.** Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt template example for the English Number Swap task applied to Qwen, illustrating a multimodal setup where text data precedes audio data. The prompt design is adapted from ALME (Billa, 2026), and a structurally similar template is employed for the Ultravox model. remains predominantly positive across the multilingual spectrum. However, the magnitude of this bias varies depending on the specific confl… view at source ↗

**Figure 7.** Figure 7: Effect of iterative back-patching on audio accuracy. The results illustrate that applying the intervention iteratively does not guarantee continuous improvements. In most cases, a single back-patching iteration yields the maximum performance gain, while subsequent iterations result in diminishing returns or slight regressions. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Intervention heatmaps across a linguistically diverse subset of languages. The plots illustrate the back-patching results for Arabic (AR), German (DE), English (EN), and French (FR). Despite minor topological variations, the overarching trend consistently highlights significant performance improvements when routing representations from the late layers back to the early layers. 18 [PITH_FULL_IMAGE:figures/… view at source ↗

read the original abstract

While Audio Large Language Models (Audio LLMs) excel at multimodal understanding, they suffer from text dominance, a bias where models blindly favor text over acoustic evidence, causing hallucinations. However, the internal mechanisms underlying how these models behave when audio and textual inputs contradict each other remain unexplored. In this work, we present the first mechanistic analysis of this phenomenon by tracing the propagation of internal representations across layers. Our investigation reveals three key findings: (i) text dominance is systematically and empirically across models; (ii) while text and audio rely on functionally distinct pathways, they ultimately converge into a shared semantic space in late layers; and (iii) the text pathway does not erase audio information, but rather actively suppresses intact audio representations. Building on these insights, we leverage back-patching, a training-free intervention that routes late-layer audio activations back into earlier layers. This amplifies the audio representations, enabling them to overcome textual suppression. Our evaluation shows that back-patching consistently reduces text dominance, paving the way for mechanistic multimodal alignment under conflict.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper traces text dominance in audio LLMs across layers and tests back-patching as a fix, but the evidence that it specifically counters active suppression rather than just boosting audio signals remains thin.

read the letter

The core contribution is the layer-wise tracing that shows text and audio paths stay distinct until late layers, where text appears to suppress rather than erase audio representations. They then route late audio activations back to earlier layers via back-patching and report reduced text dominance on conflicting inputs. That combination of tracing plus a training-free intervention is the part that feels new relative to prior hallucination work.

The tracing itself is a reasonable first step for this subfield. Finding that audio information persists but gets overridden gives a clearer picture than just observing output bias, and the back-patching idea is simple enough that others could try it quickly.

The main weakness is the causal link. The claim that text actively suppresses intact audio rests on the tracing and the intervention effect, yet nothing in the abstract rules out the possibility that back-patching simply increases the relative weight of audio signals at the right time, regardless of any suppression mechanism. A control that perturbs non-suppressive paths or directly checks whether audio representations remain unchanged after text arrives would tighten this. Without those, the results could be consistent with several different stories about how the conflict is resolved.

The work is aimed at people already working on multimodal interpretability and audio LLM reliability. It is narrow enough that it will mainly interest that group rather than a broader audience. The idea is worth referee time because the problem is practical and the proposed fix is cheap to test, even if the current evidence for the mechanism is still preliminary. I would send it out for review and ask for the additional controls on the suppression claim.

Referee Report

2 major / 1 minor

Summary. The manuscript conducts the first mechanistic interpretability analysis of text dominance in Audio LLMs under conflicting audio-text inputs. By tracing representation propagation across layers, it reports three findings: (i) text dominance is systematic across models, (ii) text and audio pathways are distinct but converge in late-layer semantic space, and (iii) text actively suppresses rather than erases intact audio representations. It introduces back-patching, a training-free intervention routing late-layer audio activations backward to amplify audio signals and overcome suppression, with evaluation claiming consistent bias reduction.

Significance. If the causal claims hold, the work supplies concrete mechanistic evidence for multimodal bias and a simple intervention for mitigation, advancing interpretability beyond correlational observations. The layer-tracing plus intervention approach offers a template for testing suppression hypotheses in other multimodal settings.

major comments (2)

[Findings (iii) and evaluation of back-patching] Findings (iii) and back-patching evaluation: the claim that audio representations remain intact after text input and are actively suppressed is not directly tested; activation tracing alone does not distinguish suppression from other forms of interference or overwriting, and no measurement (e.g., probe accuracy on audio features post-text) confirms intactness.
[Evaluation of back-patching] Back-patching results: the reported reduction in text dominance could arise from any early-layer amplification of audio signals rather than specifically countering a suppression mechanism. No control condition (e.g., equivalent-magnitude perturbations to non-suppressive pathways or random activations) is described to isolate the effect.

minor comments (1)

[Abstract] Abstract: the sentence 'text dominance is systematically and empirically across models' is grammatically incomplete and should be revised for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important gaps in directly testing the suppression claim and in controlling for the specificity of back-patching. We address each point below and commit to revisions that strengthen the evidence without altering the core findings.

read point-by-point responses

Referee: [Findings (iii) and evaluation of back-patching] Findings (iii) and back-patching evaluation: the claim that audio representations remain intact after text input and are actively suppressed is not directly tested; activation tracing alone does not distinguish suppression from other forms of interference or overwriting, and no measurement (e.g., probe accuracy on audio features post-text) confirms intactness.

Authors: We agree that layer-wise activation tracing alone provides indirect rather than direct evidence for intact audio representations. Our tracing demonstrates persistence of audio-specific signals into late layers under text conflict, which we interpret as suppression rather than erasure; however, this does not rule out alternative interference mechanisms. In the revised version we will add linear probe experiments measuring audio feature classification accuracy from late-layer activations both with and without conflicting text input. These probes will directly test intactness and help differentiate suppression from overwriting or other forms of interference. revision: yes
Referee: [Evaluation of back-patching] Back-patching results: the reported reduction in text dominance could arise from any early-layer amplification of audio signals rather than specifically countering a suppression mechanism. No control condition (e.g., equivalent-magnitude perturbations to non-suppressive pathways or random activations) is described to isolate the effect.

Authors: The back-patching procedure is motivated by our tracing results showing late-layer convergence and selective suppression of audio pathways. Nevertheless, we acknowledge that the current evaluation lacks controls to isolate the mechanism. In revision we will include two control conditions: (1) injecting equivalent-magnitude random noise activations into the same early layers, and (2) applying matched perturbations to non-audio semantic pathways identified in our analysis. These controls will test whether bias reduction is specific to routing intact late-layer audio signals or arises from generic amplification. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical activation tracing and intervention

full rationale

The paper's core claims rest on tracing internal representations across layers in Audio LLMs to identify three empirical findings about text dominance, followed by a training-free back-patching intervention. No equations, fitted parameters presented as predictions, self-definitional constructs, or load-bearing self-citations appear in the abstract or described chain. The derivation is self-contained via experimental observations rather than reducing to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only; no free parameters, new entities, or explicit axioms listed. Analysis rests on the unstated domain assumption that activation tracing reveals causal mechanisms.

axioms (1)

domain assumption Activation tracing and back-patching can identify and alter causal pathways responsible for observed model behavior
Entire analysis and proposed intervention depend on this standard assumption in mechanistic interpretability.

pith-pipeline@v0.9.1-grok · 5724 in / 1211 out tokens · 31304 ms · 2026-06-26T19:34:55.051202+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 1 canonical work pages

[1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972
[2]

Publications Manual , year = "1983", publisher =

1983
[3]

and Kozen, Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[5]

Dan Gusfield , title =. 1997

1997
[6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
[8]

Nikankin, Yaniv and Arad, Dana and Gandelsman, Yossi and Belinkov, Yonatan , booktitle=nips, year=
[9]

Transformer feed-forward layers are key-value memories , author=
[10]

Arithmetic without algorithms: Language models solve math with a bag of heuristics , author=
[11]

Axiomatic attribution for deep networks , author=
[12]

Have faith in faithfulness: Going beyond circuit overlap when finding model mechanisms , author=
[13]

2022 , url=

Attribution Patching: Activation Patching at Industrial Scale , author=. 2022 , url=

2022
[14]

2022 , url =

Nanda, Neel and Bloom, Joseph , title =. 2022 , url =

2022
[15]

A survey on sparse autoencoders: Interpreting the internal mechanisms of large language models , author=
[16]

Locating and editing factual associations in gpt , author=
[17]

Interpretability in the wild: a circuit for indirect object identification in gpt-2 small , author=
[18]

Towards automated circuit discovery for mechanistic interpretability , author=
[19]

Causal analysis of syntactic agreement mechanisms in neural language models , author=
[20]

Towards vision-language mechanistic interpretability: A causal tracing tool for blip , author=
[21]

What do vlms notice? a mechanistic interpretability pipeline for gaussian-noise-free text-image corruption and evaluation , author=
[22]

Behind the Scenes: Mechanistic Interpretability of LoRA-adapted Whisper for Speech Emotion Recognition , author=
[23]

arXiv preprint arXiv:2605.12225 , year=

Mechanistic Interpretability of ASR models using Sparse Autoencoders , author=. arXiv preprint arXiv:2605.12225 , year=

Pith/arXiv arXiv
[24]

arXiv preprint arXiv:2502.17516 , year=

A survey on mechanistic interpretability for multi-modal foundation models , author=. arXiv preprint arXiv:2502.17516 , year=

arXiv
[25]

Probing Cross-modal Information Hubs in Audio-Visual LLMs , author=
[26]

arXiv preprint arXiv:2602.11488 , year=

When Audio-LLMs Don't Listen: A Cross-Linguistic Study of Modality Arbitration , author=. arXiv preprint arXiv:2602.11488 , year=

arXiv
[27]

arXiv preprint arXiv:2407.10759 , year=

Qwen2-audio technical report , author=. arXiv preprint arXiv:2407.10759 , year=

Pith/arXiv arXiv
[28]

2024 , url =

Ultravox: A fast multimodal. 2024 , url =

2024
[29]

arXiv preprint arXiv:2508.10552 , year=

When language overrules: Revealing text dominance in multimodal large language models , author=. arXiv preprint arXiv:2508.10552 , year=

arXiv
[30]

Muchomusic: Evaluating music understanding in multimodal audio-language models , author=
[31]

What's in the Image? A Deep-Dive into the Vision of Vision Language Models , author=
[32]

Hopping too late: Exploring the limitations of large language models on multi-hop queries , author=
[33]

Racing thoughts: Explaining contextualization errors in large language models , author=
[34]

arXiv preprint arXiv:2603.06854 , year=

Are Audio-Language Models Listening? Audio-Specialist Heads for Adaptive Audio Steering , author=. arXiv preprint arXiv:2603.06854 , year=

arXiv
[35]

AR&D: A Framework for Retrieving and Describing Concepts for Interpreting AudioLLMs , author=
[36]

arXiv preprint arXiv:2603.13768 , year=

Causal Tracing of Audio-Text Fusion in Large Audio Language Models , author=. arXiv preprint arXiv:2603.13768 , year=

arXiv
[37]

Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities , author=
[38]

Beyond single-audio: Advancing multi-audio processing in audio large language models , author=
[39]

Evaluating robustness of large audio language models to audio injection: An empirical study , author=
[40]

Soundmind: Rl-incentivized logic reasoning for audio-language models , author=
[41]

In-the-wild Audio Spatialization with Flexible Text-guided Localization , author=
[42]

Listen, think, and understand , author=
[43]

Learning to See through Sound: From VggCaps to Multi2Cap for Richer Automated Audio Captioning , author=
[44]

Audio entailment: Assessing deductive reasoning for audio understanding , author=
[45]

Can large language models understand spatial audio? , author=
[46]

Hurst, Aaron and Lerer, Adam and Goucher, Adam P and Perelman, Adam and Ramesh, Aditya and Clark, Aidan and Ostrow, AJ and Welihinda, Akila and Hayes, Alan and Radford, Alec and others , journal=
[47]

arXiv preprint arXiv:2605.16403 , year=

When Vision Speaks for Sound , author=. arXiv preprint arXiv:2605.16403 , year=

Pith/arXiv arXiv
[48]

arXiv preprint arXiv:2604.16902 , year=

Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models , author=. arXiv preprint arXiv:2604.16902 , year=

Pith/arXiv arXiv
[49]

Avhbench: A cross-modal hallucination benchmark for audio-visual large language models , author=
[50]

Chenshuang Zhang, Kyeong Seon Kim, Chengxin Liu and Tae-Hyun Oh , title =

[1] [1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972

[2] [2]

Publications Manual , year = "1983", publisher =

1983

[3] [3]

and Kozen, Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981

[4] [4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

[5] [5]

Dan Gusfield , title =. 1997

1997

[6] [6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015

[7] [7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

[8] [8]

Nikankin, Yaniv and Arad, Dana and Gandelsman, Yossi and Belinkov, Yonatan , booktitle=nips, year=

[9] [9]

Transformer feed-forward layers are key-value memories , author=

[10] [10]

Arithmetic without algorithms: Language models solve math with a bag of heuristics , author=

[11] [11]

Axiomatic attribution for deep networks , author=

[12] [12]

Have faith in faithfulness: Going beyond circuit overlap when finding model mechanisms , author=

[13] [13]

2022 , url=

Attribution Patching: Activation Patching at Industrial Scale , author=. 2022 , url=

2022

[14] [14]

2022 , url =

Nanda, Neel and Bloom, Joseph , title =. 2022 , url =

2022

[15] [15]

A survey on sparse autoencoders: Interpreting the internal mechanisms of large language models , author=

[16] [16]

Locating and editing factual associations in gpt , author=

[17] [17]

Interpretability in the wild: a circuit for indirect object identification in gpt-2 small , author=

[18] [18]

Towards automated circuit discovery for mechanistic interpretability , author=

[19] [19]

Causal analysis of syntactic agreement mechanisms in neural language models , author=

[20] [20]

Towards vision-language mechanistic interpretability: A causal tracing tool for blip , author=

[21] [21]

What do vlms notice? a mechanistic interpretability pipeline for gaussian-noise-free text-image corruption and evaluation , author=

[22] [22]

Behind the Scenes: Mechanistic Interpretability of LoRA-adapted Whisper for Speech Emotion Recognition , author=

[23] [23]

arXiv preprint arXiv:2605.12225 , year=

Mechanistic Interpretability of ASR models using Sparse Autoencoders , author=. arXiv preprint arXiv:2605.12225 , year=

Pith/arXiv arXiv

[24] [24]

arXiv preprint arXiv:2502.17516 , year=

A survey on mechanistic interpretability for multi-modal foundation models , author=. arXiv preprint arXiv:2502.17516 , year=

arXiv

[25] [25]

Probing Cross-modal Information Hubs in Audio-Visual LLMs , author=

[26] [26]

arXiv preprint arXiv:2602.11488 , year=

When Audio-LLMs Don't Listen: A Cross-Linguistic Study of Modality Arbitration , author=. arXiv preprint arXiv:2602.11488 , year=

arXiv

[27] [27]

arXiv preprint arXiv:2407.10759 , year=

Qwen2-audio technical report , author=. arXiv preprint arXiv:2407.10759 , year=

Pith/arXiv arXiv

[28] [28]

2024 , url =

Ultravox: A fast multimodal. 2024 , url =

2024

[29] [29]

arXiv preprint arXiv:2508.10552 , year=

When language overrules: Revealing text dominance in multimodal large language models , author=. arXiv preprint arXiv:2508.10552 , year=

arXiv

[30] [30]

Muchomusic: Evaluating music understanding in multimodal audio-language models , author=

[31] [31]

What's in the Image? A Deep-Dive into the Vision of Vision Language Models , author=

[32] [32]

Hopping too late: Exploring the limitations of large language models on multi-hop queries , author=

[33] [33]

Racing thoughts: Explaining contextualization errors in large language models , author=

[34] [34]

arXiv preprint arXiv:2603.06854 , year=

Are Audio-Language Models Listening? Audio-Specialist Heads for Adaptive Audio Steering , author=. arXiv preprint arXiv:2603.06854 , year=

arXiv

[35] [35]

AR&D: A Framework for Retrieving and Describing Concepts for Interpreting AudioLLMs , author=

[36] [36]

arXiv preprint arXiv:2603.13768 , year=

Causal Tracing of Audio-Text Fusion in Large Audio Language Models , author=. arXiv preprint arXiv:2603.13768 , year=

arXiv

[37] [37]

Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities , author=

[38] [38]

Beyond single-audio: Advancing multi-audio processing in audio large language models , author=

[39] [39]

Evaluating robustness of large audio language models to audio injection: An empirical study , author=

[40] [40]

Soundmind: Rl-incentivized logic reasoning for audio-language models , author=

[41] [41]

In-the-wild Audio Spatialization with Flexible Text-guided Localization , author=

[42] [42]

Listen, think, and understand , author=

[43] [43]

Learning to See through Sound: From VggCaps to Multi2Cap for Richer Automated Audio Captioning , author=

[44] [44]

Audio entailment: Assessing deductive reasoning for audio understanding , author=

[45] [45]

Can large language models understand spatial audio? , author=

[46] [46]

Hurst, Aaron and Lerer, Adam and Goucher, Adam P and Perelman, Adam and Ramesh, Aditya and Clark, Aidan and Ostrow, AJ and Welihinda, Akila and Hayes, Alan and Radford, Alec and others , journal=

[47] [47]

arXiv preprint arXiv:2605.16403 , year=

When Vision Speaks for Sound , author=. arXiv preprint arXiv:2605.16403 , year=

Pith/arXiv arXiv

[48] [48]

arXiv preprint arXiv:2604.16902 , year=

Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models , author=. arXiv preprint arXiv:2604.16902 , year=

Pith/arXiv arXiv

[49] [49]

Avhbench: A cross-modal hallucination benchmark for audio-visual large language models , author=

[50] [50]

Chenshuang Zhang, Kyeong Seon Kim, Chengxin Liu and Tae-Hyun Oh , title =