arxiv: 2605.05594 · v1 · submitted 2026-05-07 · 💻 cs.CL · cs.CV· cs.LG

Recognition: unknown

The Cost of Context: Mitigating Textual Bias in Multimodal Retrieval-Augmented Generation

Hoin Jung, Xiaoqian Wang

Pith reviewed 2026-05-08 11:14 UTC · model grok-4.3

classification 💻 cs.CL cs.CVcs.LG

keywords recorruptionmultimodal RAGattentional collapsevisual blindnesspositional biasoracle contextinference-time interventiontextual bias

0 comments

The pith

Even perfectly accurate context can make multimodal models abandon correct answers by suppressing image attention and fixating on text boundaries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that retrieval-augmented generation in multimodal models introduces a failure mode called recorruption, where oracle-level external documents cause capable systems to drop an initially right prediction. This occurs through a specific two-part breakdown in how the model attends: it largely stops looking at the visual input while also defaulting to the start and end tokens of the text rather than its meaning. The authors trace this to patterns in the model's internal attention matrices and demonstrate that many apparent RAG wins are only lucky alignments with positional habits. To counter it they introduce an inference-only method that boosts visual focus and down-weights boundary tokens, improving results across medical, fairness, and location tasks without any retraining.

Core claim

The central claim is that recorruption arises even when the added context is perfectly accurate, because internal attention undergoes a two-fold collapse: visual blindness that reduces both the total mass and the sharpness of attention to image tokens, plus a structural positional bias that elevates boundary tokens over semantically relevant ones, creating an illusion of success whenever textual copying happens to match the ground-truth location.

What carries the argument

The two-fold attentional collapse, consisting of visual blindness (suppressed M_vis and S_vis) and structural positional bias in the attention matrices, which BAIR counters at inference time by restoring visual saliency and imposing position-aware penalties on textual distractors.

If this is right

Adding external documents can reduce multimodal grounding even when those documents contain the correct information.
Apparent correctness in RAG outputs often stems from coincidental alignment with the model's bias toward copying boundary tokens.
A parameter-free attention intervention at inference time can recover visual focus and raise accuracy on factuality, fairness, and geospatial tasks.
Textual bias in attention matrices creates a hidden cost that scales with the amount of retrieved context.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar attention patterns may appear in non-multimodal RAG systems when long contexts are introduced, suggesting the positional bias is not limited to images.
The same intervention approach could be tested on other modalities such as audio or video to check whether boundary bias generalizes.
If attention collapse is causal, then future model architectures might need explicit safeguards against context-induced visual suppression rather than relying on scale alone.

Load-bearing premise

Observed shifts in attention mass and sharpness are the direct cause of the model changing its output rather than a side effect that would persist even if attention were restored.

What would settle it

A controlled test in which the attention patterns are artificially restored to pre-context levels yet the model still switches away from its original correct answer would disprove the claimed causal role of the collapse.

Figures

Figures reproduced from arXiv: 2605.05594 by Hoin Jung, Xiaoqian Wang.

**Figure 1.** Figure 1: Qualitative examples of recorruption and our proposed cure. In the medical diagnosis (Left), the social fairness (Center), and geospatial domain (Right), the model correctly identifies the visual evidence in the baseline without external context. The introduction of Oracle context causes the model to ignore the image and generate hallucinated text (recorruption). Our proposed BAIR method successfully cures… view at source ↗

**Figure 2.** Figure 2: Sankey diagrams illustrating the recorruption phenomenon for medical MLLMs (MedGemma-4B, CheXagent-8B). A portion of initially correct visual predictions (Without Retrieval) are corrupted into incorrect predictions upon the introduction of Oracle Retrieval context. To safely harness the benefits of retrieval without compromising the model’s visual perception or textual grounding, we propose Bottleneck Att… view at source ↗

**Figure 3.** Figure 3: (a) Visual Attention Degradation: The introduction of textual context results in a systemic drop in Visual Attention Mass (Mvis) and Sharpness (Svis) across architectures. (MedGemma-4B and Qwen2.5-VL-7B.) (b) Comparison of Success and Recorruption Profiles: Attention metrics for successful RAG outcomes and recorruption failures are statistically indistinguishable (Qwen2.5-VL7B). (c) Textual Profile Analys… view at source ↗

**Figure 4.** Figure 4: Illustration of Text-Induced Visual Suppression and Recovery via BAIR. (Left) In the No RAG setting, the model maintains focused attention on the relevant visual evidence. (Right) When retrieved textual context is introduced, standard RAG suffers from visual suppression and textual positional bias. BAIR mitigates this failure by restoring visually grounded attention while reducing the dominance of distract… view at source ↗

**Figure 5.** Figure 5: Impact of the BAIR intervention on multimodal RAG pipelines. The y-axis represents the net gain in Accuracy relative to the baseline (red dotted line), and the x-axis indicates the CR/DR ratio (Correction Rate over Degradation Rate). Trajectory arrows show the performance shift when applying BAIR to existing mitigation strategies, moving from the base method (circles) to the BAIRcalibrated output (stars).… view at source ↗

**Figure 6.** Figure 6: Efficiency of the bisection search on the monotonic Svis(T) function. Sharpness is a strictly increasing function of the temperature scalar T, ensuring a unique root for any Sboost ∈ [0, 1]. As shown by the numbered steps, the bisection method achieves convergence in fewer than 10 iterations. This targeted intervention at the pre-filling bottleneck layer allows BAIR to restore groundtruth visual clarity w… view at source ↗

**Figure 7.** Figure 7: Ablation. Conduct ablation study on MedGemma-4B on IU-Chest Dataset. D.1 IU Chest X ray For the IU Chest experiments, both CheXagent and MedGemma use the same question and instruction templates. Question Based on the visual evidence, what are the primary impressions for this chest radiograph? Base instruction You are a radiologist. When context is provided, refer to it to accurately describe the image. If … view at source ↗

**Figure 8.** Figure 8: Attention Profile Analysis. We compare the final layer bottleneck attention profiles of Baseline, MS-PoE, MAD-RAG, and BAIR. The visual token region is shaded, and the third retrieved document is marked as the ground truth document. The upper panels show robust normalized attention curves, while the lower panels visualize each method’s attention change relative to the Baseline. I More Experimental Results … view at source ↗

read the original abstract

While Multimodal Large Language Models (MLLMs) are increasingly integrated with Retrieval-Augmented Generation (RAG) to mitigate hallucinations, the introduction of external documents can conceal severe failure modes at the instance level. We identify and formalize the phenomenon of recorruption, where the introduction of even perfectly accurate "oracle" context causes a capable model to abandon an initially correct prediction. Through a mechanistic diagnosis of internal attention matrices, we show that recorruption is driven by a two-fold attentional collapse: (1) visual blindness, characterized by the systemic suppression of visual attention mass ($M_{vis}$) and sharpness ($S_{vis}$), and (2) a structural positional bias that forces the model to prioritize boundary tokens over semantic relevance. Our analysis reveals an Illusion of Success, demonstrating that many seemingly correct RAG outcomes are merely positional coincidences where the model's textual copying bias happens to align with the ground-truth location. To address these vulnerabilities, we propose Bottleneck Attention Intervention for Recovery (BAIR), a parameter-free, inference-time framework that restores visual saliency and applies position-aware penalties to textual distractors. Across medical factuality, social fairness, and geospatial benchmarks, BAIR successfully restores multimodal grounding and improves diagnostic reliability without requiring model retraining or fine-tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper names a real instance-level failure in multimodal RAG and gives a lightweight inference-time patch, but the link from attention collapse to the failure stays observational.

read the letter

This paper shows that even perfectly accurate context can make a multimodal model drop an answer it had right before, and the authors call the effect recorruption. They trace it to two attention problems: visuals get ignored and the model fixates on text boundary tokens instead of content. Their fix, BAIR, is a parameter-free intervention at inference that boosts visual attention and adds position penalties, and it lifts performance on medical, fairness, and geospatial tests without any retraining.

Referee Report

3 major / 2 minor

Summary. The paper claims that multimodal LLMs in RAG settings exhibit 'recorruption,' where even perfectly accurate oracle context causes abandonment of initially correct predictions. It attributes this via mechanistic attention analysis to a two-fold collapse—visual blindness (suppressed M_vis and S_vis) and structural positional bias—and identifies an 'Illusion of Success' where correct outputs arise from positional coincidence aligning with copying bias. It proposes BAIR, a parameter-free inference-time intervention restoring visual saliency and applying position penalties, reporting improvements on medical factuality, social fairness, and geospatial benchmarks.

Significance. If the mechanistic account and causal efficacy of BAIR hold, the work would highlight under-appreciated instance-level failure modes in MLLM-RAG that evade standard metrics, while offering a practical, training-free mitigation. The attention-based diagnosis and identification of positional biases contribute to understanding transformer pathologies in multimodal settings; the parameter-free design and cross-domain benchmarks strengthen applicability to reliability-critical domains.

major comments (3)

[Mechanistic diagnosis section] Mechanistic diagnosis section: the central claim that recorruption is 'driven by' the two-fold attentional collapse (visual blindness via suppressed M_vis/S_vis and positional bias) is supported only by observational before/after attention matrix comparisons after oracle insertion. No controlled causal test—such as targeted attention editing, head ablation, or isolation from FFN/value vector shifts—is reported to rule out alternative mechanisms.
[BAIR evaluation (Tables 1-3)] BAIR evaluation (Tables 1-3): while recovery is reported, the paper lacks a controlled ablation isolating the visual saliency restoration component from the position-penalty machinery, leaving open whether gains stem from the hypothesized mechanisms or unmeasured side effects on output distributions.
[Benchmark results sections] Benchmark results sections: reported improvements lack error bars, number of runs, statistical significance tests, or explicit data exclusion rules, which is load-bearing for claims of diagnostic reliability gains across medical, fairness, and geospatial tasks.

minor comments (2)

[Abstract and §1] Abstract and §1: the term 'recorruption' and 'BAIR' are used before full definition, which could be clarified with a brief inline gloss for accessibility.
[Attention matrix figures] Attention matrix figures: axes, color scales for M_vis/S_vis, and token boundary annotations could be labeled more explicitly to aid interpretation of the collapse and bias effects.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important opportunities to strengthen the rigor of our mechanistic claims and experimental reporting. We address each major comment below and describe the revisions we will incorporate.

read point-by-point responses

Referee: [Mechanistic diagnosis section] Mechanistic diagnosis section: the central claim that recorruption is 'driven by' the two-fold attentional collapse (visual blindness via suppressed M_vis/S_vis and positional bias) is supported only by observational before/after attention matrix comparisons after oracle insertion. No controlled causal test—such as targeted attention editing, head ablation, or isolation from FFN/value vector shifts—is reported to rule out alternative mechanisms.

Authors: We appreciate the referee's emphasis on establishing causality. Our current analysis is observational, relying on systematic before-and-after comparisons of attention matrices following oracle context insertion, which consistently reveal the described patterns of visual suppression and positional bias. While we did not include targeted interventions such as head ablation or attention editing in this work, the BAIR framework itself constitutes an inference-time intervention that directly modifies the identified attention components, with the resulting performance recovery providing indirect support for their causal role. In the revised manuscript, we will add an explicit limitations subsection acknowledging the observational nature of the diagnosis and outlining directions for future causal experiments (e.g., targeted editing) to further validate the mechanisms. revision: partial
Referee: [BAIR evaluation (Tables 1-3)] BAIR evaluation (Tables 1-3): while recovery is reported, the paper lacks a controlled ablation isolating the visual saliency restoration component from the position-penalty machinery, leaving open whether gains stem from the hypothesized mechanisms or unmeasured side effects on output distributions.

Authors: We agree that isolating the contributions of each BAIR component is essential for confirming the hypothesized mechanisms. In the revised manuscript, we will include a controlled ablation study that evaluates the visual saliency restoration and position-penalty components both separately and in combination. This will report their individual effects on the medical, fairness, and geospatial benchmarks, allowing readers to assess whether the observed gains derive from the intended attention interventions rather than unintended distributional shifts. revision: yes
Referee: [Benchmark results sections] Benchmark results sections: reported improvements lack error bars, number of runs, statistical significance tests, or explicit data exclusion rules, which is load-bearing for claims of diagnostic reliability gains across medical, fairness, and geospatial tasks.

Authors: We acknowledge the importance of statistical transparency for the reliability claims. We will revise the experimental sections to report error bars computed across multiple independent runs (specifying the exact number of runs and random seeds), include statistical significance tests (e.g., paired t-tests with p-values), and explicitly document all data exclusion criteria, preprocessing steps, and evaluation protocols. These additions will be integrated into Tables 1-3 and the corresponding result descriptions. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical attention analysis and parameter-free intervention stand independent of inputs

full rationale

The paper identifies recorruption via direct mechanistic inspection of attention matrices before and after oracle context insertion, characterizing visual blindness and positional bias as observed patterns. BAIR is introduced as an inference-time, parameter-free intervention that restores saliency and applies penalties without any fitted parameters, self-citations, or derivations that reduce the claimed effects to the input observations by construction. No equations, uniqueness theorems, or ansatzes are invoked that would create self-definitional or load-bearing circular steps. The derivation chain remains self-contained empirical diagnosis plus independent mitigation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that internal attention matrices are diagnostic of prediction errors and that targeted intervention on them restores grounding without side effects.

axioms (1)

domain assumption Changes in visual attention mass and sharpness directly cause the observed drop in prediction accuracy when context is added.
Invoked to link mechanistic diagnosis to the recorruption phenomenon.

invented entities (2)

recorruption no independent evidence
purpose: Name the specific failure mode where accurate context induces errors.
Newly defined term in the paper.
BAIR no independent evidence
purpose: Inference-time attention intervention framework.
Newly proposed method.

pith-pipeline@v0.9.0 · 5529 in / 1197 out tokens · 38913 ms · 2026-05-08T11:14:22.439348+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 9 canonical work pages · 3 internal anchors

[1]

B. An, S. Zhang, and M. Dredze. Rag llms are not safer: A safety analysis of retrieval-augmented generation for large language models.arXiv preprint arXiv:2504.18041, 2025

work page arXiv 2025
[2]

Z. Chen, M. Varma, J.-B. Delbrouck, M. Paschali, L. Blankemeier, D. Van Veen, J. M. J. Valanarasu, A. Youssef, J. P. Cohen, E. P. Reis, et al. Chexagent: Towards a foundation model for chest x-ray interpretation. InAAAI 2024 Spring Symposium on Clinical F oundation Models, 2024

2024
[3]

Cheng, J

G. Cheng, J. Han, and X. Lu. Remote sensing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 105(10):1865–1883, 2017

2017
[4]

Delbrouck, J

J.-B. Delbrouck, J. Xu, J. Moll, A. Thomas, Z. Chen, S. Ostmeier, A. Azhar, K. Z. Li, A. John- ston, C. Bluethgen, et al. Automated structured radiology report generation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 26813–26829, 2025

2025
[5]

Demner-Fushman, M

D. Demner-Fushman, M. D. Kohli, M. B. Rosenman, S. E. Shooshan, L. Rodriguez, S. Antani, G. R. Thoma, and C. J. McDonald. Preparing a collection of radiology examinations for distribution and retrieval.Journal of the American Medical Informatics Association, 23(2): 304–310, 2015

2015
[6]

Elfwing, E

S. Elfwing, E. Uchibe, and K. Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning.Neural networks, 107:3–11, 2018

2018
[7]

Gustafson, C

L. Gustafson, C. Rolland, N. Ravi, Q. Duval, A. Adcock, C.-Y . Fu, M. Hall, and C. Ross. Facet: Fairness in computer vision evaluation benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20370–20382, 2023

2023
[8]

Hsieh, Y .-S

C.-Y . Hsieh, Y .-S. Chuang, C.-L. Li, Z. Wang, L. Le, A. Kumar, J. Glass, A. Ratner, C.-Y . Lee, R. Krishna, et al. Found in the middle: Calibrating positional attention bias improves long context utilization. InFindings of the Association for Computational Linguistics: ACL 2024, pages 14982–14995, 2024

2024
[9]

Jiang, Q

H. Jiang, Q. Wu, X. Luo, D. Li, C.-Y . Lin, Y . Yang, and L. Qiu. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 1658–1677, 2024

2024
[10]

Kortukov, A

E. Kortukov, A. Rubinstein, E. Nguyen, and S. J. Oh. Studying large language model behaviors under context-memory conflicts with real documents.arXiv preprint arXiv:2404.16032, 2024

work page arXiv 2024
[11]

Lewis, E

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459–9474, 2020

2020
[12]

C.-Y . Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

2004
[13]

F. Liu, D. Chen, Z. Guan, X. Zhou, J. Zhu, Q. Ye, L. Fu, and J. Zhou. Remoteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024

2024
[14]

N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024. 10

2024
[15]

H. Lu, W. Liu, B. Zhang, B. Wang, K. Dong, B. Liu, J. Sun, T. Ren, Z. Li, H. Yang, et al. Deepseek-vl: towards real-world vision-language understanding.arXiv preprint arXiv:2403.05525, 2024

work page internal anchor Pith review arXiv 2024
[16]

J. Luo, Z. Pang, Y . Zhang, T. Wang, L. Wang, B. Dang, J. Lao, J. Wang, J. Chen, Y . Tan, et al. Skysensegpt: A fine-grained instruction tuning dataset and model for remote sensing vision-language understanding.arXiv preprint arXiv:2406.10100, 2024

work page arXiv 2024
[17]

M. K. Mandanetwork, H. E. Rekik, and O. Bouaziz. Enhancing technical knowledge acquisition with rag systems: the tei use case. InTexts, Languages and Communities-TEI 2024, 2024

2024
[18]

Marino, M

K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. InProceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019

2019
[19]

M. Raja, E. Yuvaraajan, et al. A rag-based medical assistant especially for infectious diseases. In2024 International Conference on Inventive Computation Technologies (ICICT), pages 1128–1133. IEEE, 2024

2024
[20]

MedGemma Technical Report

A. Sellergren, S. Kazemzadeh, T. Jaroensri, A. Kiraly, M. Traverse, T. Kohlberger, S. Xu, F. Jamil, C. Hughes, C. Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

S. Soni, A. Dudhane, H. Debary, M. Fiaz, M. A. Munir, M. S. Danish, P. Fraccaro, C. D. Watson, L. J. Klein, F. S. Khan, et al. Earthdial: Turning multi-sensory earth observations to interactive dialogues. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14303–14313, 2025

2025
[22]

Wadhwa, R

H. Wadhwa, R. Seetharaman, S. Aggarwal, R. Ghosh, S. Basu, S. Srinivasan, W. Zhao, S. Chaudhari, and E. Aghazadeh. From rags to rich parameters: Probing how language models utilize external knowledge over parametric information for factual queries.arXiv preprint arXiv:2406.12824, 2024

work page arXiv 2024
[23]

P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, Y . Fan, K. Dang, M. Du, X. Ren, R. Men, D. Liu, C. Zhou, J. Zhou, and J. Lin. Qwen2-vl: En- hancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review arXiv 2024
[24]

Z. Wang, H. Zhang, X. Li, K.-H. Huang, C. Han, S. Ji, S. M. Kakade, H. Peng, and H. Ji. Eliminating position bias of language models: A mechanistic approach. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview. net/forum?id=fvkElsJOsN

2025
[25]

Wiratunga, R

N. Wiratunga, R. Abeyratne, L. Jayawardena, K. Martin, S. Massie, I. Nkisi-Orji, R. Weeras- inghe, A. Liret, and B. Fleisch. Cbr-rag: case-based reasoning for retrieval augmented generation in llms for legal question answering. InInternational Conference on Case-Based Reasoning, pages 445–460. Springer, 2024

2024
[26]

X. Wu, Y . Wang, S. Jegelka, and A. Jadbabaie. On the emergence of position bias in transformers. In A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, editors,Proceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 67756–67781. PMLR,...

2025
[27]

J. Yao, S. Liu, Y . Wang, L. Mei, B. Bi, Y . Ge, Z. Li, and X. Cheng. Who is in the spotlight: The hidden bias undermining multimodal retrieval-augmented generation.arXiv preprint arXiv:2506.11063, 2025

work page arXiv 2025
[28]

Zhang, R

Z. Zhang, R. Chen, S. Liu, Z. Yao, O. Ruwase, B. Chen, X. Wu, and Z. Wang. Found in the middle: How language models use long contexts better via plug-and-play positional encoding. Advances in Neural Information Processing Systems, 37:60755–60775, 2024

2024
[29]

image-question

B. Zhao, W. Deng, X. Liao, Y . Li, N. Shaikh, Y . Nie, and X. Li. When rag hurts: Diagnosing and mitigating attention distraction in retrieval-augmented lvlms.arXiv preprint arXiv:2602.00344, 2026. 11 A Derivation of the Mass Restoration Shift We seek a global scalar shift α to be added to the sharpened visual logits ˜Ev ∈R Nv such that the post-softmax v...

work page arXiv 2026