pith. machine review for the scientific record. sign in

arxiv: 2605.10253 · v1 · submitted 2026-05-11 · 💻 cs.CR · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation

Haoran Zheng, Jiajun Liu, Peiru Yang, Shangguang Wang, Shiting Wang, Tao Qi, Tong Ju, Wanchun Ni, Yongfeng Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:23 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords knowledge poisoningmedical RAGmultimodal retrievaladversarial attacksretrieval-augmented generationLLM securitymedical AIvisual perturbations
0
0 comments X

The pith

Medical multimodal RAG systems can be poisoned by covert text misinformation paired with imperceptible visual perturbations that manipulate retrieval without any knowledge of user queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Retrieval-augmented generation improves medical LLMs by pulling in external knowledge, yet the source databases can be tampered with to produce wrong answers. Earlier attacks required advance knowledge of the exact question asked, an assumption that fails in real deployments. M³Att works from only rough knowledge of the database contents by adding tiny, hard-to-see changes to images that raise the chance the system will fetch the bad content. The text is written with deliberate medical ambiguity so the errors survive the model's normal fact-checking. Experiments across five LLMs and several datasets show the result is generations that look clinically reasonable but are factually incorrect.

Core claim

M³Att is a knowledge-poisoning framework for medical multimodal RAG that assumes only limited distribution knowledge of the database; it injects covert misinformation into textual entries while using imperceptible perturbations on paired visual data as a query-agnostic trigger to alter retrieval probabilities, thereby producing clinically plausible yet incorrect model outputs that evade LLM self-correction.

What carries the argument

The M³Att framework, which applies a unified visual perturbation technique to shift retrieval probabilities toward poisoned items and a covert misinformation injection method that exploits inherent ambiguity in medical diagnosis to prevent automatic correction.

If this is right

  • Databases for medical RAG can be compromised using only distribution-level knowledge and visual triggers that require no query information.
  • LLM self-correction fails against carefully ambiguous medical misinformation, allowing plausible incorrect generations.
  • The attack produces consistent results across five different LLMs and multiple medical datasets.
  • Retrieval manipulation via visual perturbations can bypass standard safeguards in multimodal medical systems.
  • Protecting medical RAG requires defenses against both database poisoning and generation-stage evasion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same trigger-and-ambiguity pattern could be tested in non-medical RAG settings that also contain domain-specific uncertainty.
  • Hospitals or clinics using image-based retrieval might add checks for small visual alterations as a practical countermeasure.
  • One direct test would be to measure whether adding perturbation detectors at the retrieval stage blocks the attack.
  • The work points to a general need for robustness measures in any multimodal retrieval system that handles ambiguous expert content.

Load-bearing premise

Limited knowledge of the database distribution together with imperceptible visual changes is enough to reliably steer retrieval and keep the LLM from spotting the planted medical errors.

What would settle it

An experiment on a medical multimodal RAG system in which the perturbed images produce no measurable rise in retrieval probability for the poisoned entries or in which the LLM output corrects the misinformation to the correct diagnosis.

Figures

Figures reproduced from arXiv: 2605.10253 by Haoran Zheng, Jiajun Liu, Peiru Yang, Shangguang Wang, Shiting Wang, Tao Qi, Tong Ju, Wanchun Ni, Yongfeng Huang.

Figure 1
Figure 1. Figure 1: Overview of the M3Att framework for poisoning medical multimodal RAG services. without access to model parameters, user queries, or retrieved contexts. The attacker’s goal is to max￾imize the retrieval likelihood of poisoned entries and mislead the LVLM into incorrect decisions. 3.2 Motivation Next, we briefly introduce the motivation for ad￾dressing the challenges of poisoning medical multi￾modal RAG syst… view at source ↗
Figure 2
Figure 2. Figure 2: Retrieval hijacking success (ASR@Top-k) of M3Att under black-box and white-box settings. 0.001 0.020 0.040 0.060 0.080 Poison Ratio 0 20 40 60 80 100 Retrieval ASR@Top-5 IU-XRay MIMIC CRC100K MHIST PCam 2/255 10/255 20/255 30/255 40/255 50/255 Perturbation Intensity 0 20 40 60 80 100 IU-XRay MIMIC CRC100K MHIST PCam 5 8 10 20 40 Cluster Number 0 20 40 60 80 100 IU-XRay MIMIC CRC100K MHIST PCam [PITH_FULL_… view at source ↗
Figure 3
Figure 3. Figure 3: Hyperparameter analysis of M3Att, including poison rate, PGD perturbation intensity, and cluster numbers. across tasks with different numbers of options. For radiology report generation, we evaluate factual consistency and completeness (Ferber et al., 2024; Xia et al., 2025) using an LLM-based evaluator, and ensemble the judgments of multiple LLMs for robustness. All results are reported as downstream util… view at source ↗
Figure 5
Figure 5. Figure 5: Robustness of M3Att against potential defense methods, confirming stealthiness of its poisoned data. egories, so excessively fine-grained clustering of￾fers limited additional coverage. Overall, these re￾sults show that M3Att is robust to hyperparameter choices and can achieve strong retrieval hijacking performance under moderate attack budgets and visually constrained perturbations. 4.5 Ablation Study We … view at source ↗
Figure 6
Figure 6. Figure 6: Case study of M3Att. puts. Therefore, the effectiveness of M3Att arises not from any single aggressive component, but from the coordinated coupling of retrieval-stage promotion and generation-stage manipulation. 4.6 Robustness Against Defenses We evaluate M3Att against three simple yet prac￾tical pre-retrieval corpus defenses that filter po￾tentially harmful samples based on distributional abnormality or c… view at source ↗
Figure 7
Figure 7. Figure 7: t-SNE visualization of three selected clusters and corresponding poisoned samples across three retrievers [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Extended hyperparameter analysis of the proposed poisoning attack. Top- [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: System prompt used for clinical ambiguity-guided poisoning. [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Examples of clinical ambiguity-guided poisoning. [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
read the original abstract

Retrieval-augmented generation (RAG) is a widely adopted paradigm for enhancing LLMs in medical applications by incorporating expert multimodal knowledge during generation. However, the underlying retrieval databases may naturally contain, or be intentionally injected with, adversarial knowledge, which can perturb model outputs and undermine system reliability. To investigate this risk, prior studies have explored knowledge poisoning attacks in medical RAG systems. Nevertheless, most of them rely on the strong assumption that adversaries possess prior knowledge of user queries, which is unrealistic in deployments and substantially limits their practical applicability. In this paper, we propose M\textsuperscript{3}Att, a knowledge-poisoning framework designed for medical multimodal RAG systems, assuming only limited distribution knowledge of the underlying database. Our core idea is to inject covert misinformation into textual data while using paired visual data as a query-agnostic trigger to promote retrieval. We first propose a unified framework that introduces imperceptible perturbations to visual inputs to manipulate retrieval probabilities. Besides, due to the prior medical knowledge in LLMs, naively poisoned medical content with explicit factual errors can be corrected during generation. Thus, we leverage the inherent ambiguity of medical diagnosis and design a covert misinformation injection strategy that degrades diagnostic accuracy while evading model self-correction. Experiments on five LLMs and datasets demonstrate that M\textsuperscript{3}Att consistently produces clinically plausible yet incorrect generations. Codes: https://github.com/ypr17/M3Att.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The paper proposes M³Att, a knowledge-poisoning framework for medical multimodal RAG systems that assumes only limited distribution knowledge of the retrieval database. It injects covert, ambiguous misinformation into textual entries while applying imperceptible gradient-based perturbations to paired visual data as a query-agnostic trigger to increase the probability that poisoned content is retrieved. The authors argue that this evades LLM self-correction due to medical diagnostic ambiguity and report that experiments across five LLMs and datasets produce clinically plausible yet incorrect generations.

Significance. If the central empirical claims are supported by rigorous quantitative evaluation and robustness checks, the work would be significant for demonstrating practical poisoning risks in medical RAG under weaker adversary assumptions than prior query-dependent attacks. It could motivate defenses such as retrieval verification or ambiguity-aware generation safeguards. The absence of reported metrics, baselines, and transfer experiments in the provided abstract, however, makes it difficult to gauge the result's reliability or generalizability at present.

major comments (3)
  1. [Abstract] Abstract: the claim that M³Att 'consistently produces clinically plausible yet incorrect generations' across five LLMs is presented without any quantitative metrics (e.g., attack success rate, retrieval hit rate, diagnostic accuracy drop), baselines, statistical tests, or description of how clinical plausibility was measured or validated. This is load-bearing for the central claim and prevents assessment of whether the data actually support reliable attack success.
  2. [Attack Framework and Experiments] Attack framework and experimental evaluation: the visual perturbation trigger relies on the surrogate encoder producing embeddings sufficiently aligned with the target medical VLM so that small L_p-bounded changes transfer. No cross-encoder ablation or transfer experiments are described to test this under encoder mismatch, which directly challenges the 'limited distribution knowledge' assumption in realistic deployments where the deployed VLM may differ from the attacker's surrogate.
  3. [Covert Misinformation Injection and Evaluation] Covert misinformation strategy: the paper motivates the ambiguous poisoning approach by noting that explicit errors are corrected by LLMs' prior medical knowledge, yet the evaluation does not report quantitative self-correction rates on the poisoned ambiguous statements (e.g., as a function of perturbation strength or ambiguity level). Without this, it is unclear whether observed incorrect generations result from successful retrieval poisoning or from weak LLM priors on the chosen datasets.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving the clarity of our quantitative claims and the robustness of our experimental design. We address each major comment point by point below and commit to revisions that strengthen the presentation without altering the core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that M³Att 'consistently produces clinically plausible yet incorrect generations' across five LLMs is presented without any quantitative metrics (e.g., attack success rate, retrieval hit rate, diagnostic accuracy drop), baselines, statistical tests, or description of how clinical plausibility was measured or validated. This is load-bearing for the central claim and prevents assessment of whether the data actually support reliable attack success.

    Authors: We agree that the abstract would benefit from explicit quantitative anchors to support the central claim. The full manuscript provides these details in Section 4, including attack success rates, retrieval hit rates, diagnostic accuracy drops, comparison baselines, and statistical significance testing. Clinical plausibility was assessed via blinded expert review by medical professionals on sampled generations. We will revise the abstract to include representative metrics and a concise description of the evaluation protocol for clinical plausibility. revision: yes

  2. Referee: [Attack Framework and Experiments] Attack framework and experimental evaluation: the visual perturbation trigger relies on the surrogate encoder producing embeddings sufficiently aligned with the target medical VLM so that small L_p-bounded changes transfer. No cross-encoder ablation or transfer experiments are described to test this under encoder mismatch, which directly challenges the 'limited distribution knowledge' assumption in realistic deployments where the deployed VLM may differ from the attacker's surrogate.

    Authors: This observation correctly identifies a potential limitation in validating the transferability of the query-agnostic visual trigger. Our current setup uses a surrogate aligned with the target distribution to reflect the limited-knowledge adversary model. To directly address encoder mismatch, we will add cross-encoder ablation studies in the revised experimental section, evaluating perturbation transfer across distinct medical VLMs and reporting the resulting changes in retrieval probabilities and end-to-end attack success. revision: yes

  3. Referee: [Covert Misinformation Injection and Evaluation] Covert misinformation strategy: the paper motivates the ambiguous poisoning approach by noting that explicit errors are corrected by LLMs' prior medical knowledge, yet the evaluation does not report quantitative self-correction rates on the poisoned ambiguous statements (e.g., as a function of perturbation strength or ambiguity level). Without this, it is unclear whether observed incorrect generations result from successful retrieval poisoning or from weak LLM priors on the chosen datasets.

    Authors: We appreciate this point on isolating the source of incorrect generations. Our evaluation isolates the poisoning effect by comparing outputs under clean versus poisoned retrieval sets, showing that explicit errors are frequently self-corrected while ambiguous statements produce clinically plausible errors. We will revise the manuscript to include quantitative self-correction rates for ambiguous versus explicit misinformation, reported as a function of both perturbation strength and expert-rated diagnostic ambiguity levels. revision: yes

Circularity Check

0 steps flagged

Empirical attack construction with no derivation chain or self-referential reductions

full rationale

The paper introduces M³Att as an empirical framework for knowledge poisoning in medical multimodal RAG, relying on visual perturbations as triggers and covert misinformation strategies, then evaluates it through experiments on five LLMs and datasets. No equations, predictions, or first-principles derivations are presented that could reduce to fitted inputs or self-definitions by construction. The central claims rest on experimental outcomes rather than any closed mathematical loop, and the provided text contains no load-bearing self-citations or ansatzes that would trigger circularity patterns. This is a standard empirical security paper whose validity hinges on experimental design, not tautological derivations.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; full paper may contain additional parameters for perturbation generation and misinformation crafting.

free parameters (2)
  • visual perturbation strength
    Controls imperceptibility versus retrieval manipulation effectiveness; value not specified in abstract.
  • misinformation ambiguity level
    Determines how covert the textual errors are while still degrading accuracy.
axioms (1)
  • domain assumption Medical LLMs possess prior knowledge sufficient to correct explicit factual errors but not ambiguous diagnostic misinformation.
    Invoked to justify why naive poisoning fails and covert injection is needed.

pith-pipeline@v0.9.0 · 5582 in / 1209 out tokens · 28544 ms · 2026-05-12T05:23:49.254601+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 5 internal anchors

  1. [1]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

    Vision-language models for vision tasks: A survey , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

  2. [4]

    Advances in neural information processing systems , volume=

    Visual instruction tuning , author=. Advances in neural information processing systems , volume=

  3. [5]

    Advances in neural information processing systems , volume=

    Flamingo: a visual language model for few-shot learning , author=. Advances in neural information processing systems , volume=

  4. [6]

    Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

    MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

  5. [7]

    The Eleventh International Conference on Learning Representations , year=

    Universal Vision-Language Dense Retrieval: Learning A Unified Representation Space for Multi-Modal Retrieval , author=. The Eleventh International Conference on Learning Representations , year=

  6. [8]

    European Conference on Computer Vision , pages=

    Uniir: Training and benchmarking universal multimodal information retrievers , author=. European Conference on Computer Vision , pages=. 2024 , organization=

  7. [9]

    arXiv preprint arXiv:2504.08748 , year=

    A survey of multimodal retrieval-augmented generation , author=. arXiv preprint arXiv:2504.08748 , year=

  8. [10]

    Proceedings of the 33rd ACM International Conference on Multimedia , pages=

    Hm-rag: Hierarchical multi-agent multimodal retrieval augmented generation , author=. Proceedings of the 33rd ACM International Conference on Multimedia , pages=

  9. [11]

    Nature Medicine , volume=

    A generalist vision-language foundation model for diverse biomedical tasks , author=. Nature Medicine , volume=. 2024 , publisher=

  10. [12]

    Advances in Neural Information Processing Systems , volume=

    Llava-med: Training a large language-and-vision assistant for biomedicine in one day , author=. Advances in Neural Information Processing Systems , volume=

  11. [13]

    Proceedings of the Conference on Empirical Methods in Natural Language Processing

    Medclip: Contrastive learning from unpaired medical images and text , author=. Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing , volume=

  12. [14]

    International Conference on Medical Image Computing and Computer-Assisted Intervention , pages=

    Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning , author=. International Conference on Medical Image Computing and Computer-Assisted Intervention , pages=. 2025 , organization=

  13. [15]

    Nature Communications , volume=

    In-context learning enables multimodal large language models to classify cancer pathology images , author=. Nature Communications , volume=. 2024 , publisher=

  14. [16]

    The Thirteenth International Conference on Learning Representations , year=

    MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models , author=. The Thirteenth International Conference on Learning Representations , year=

  15. [17]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

    Rule: Reliable multimodal rag for factuality in medical vision language models , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

  16. [22]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

    Glue pizza and eat rocks-Exploiting Vulnerabilities in Retrieval-Augmented Generative Models , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

  17. [23]

    34th USENIX Security Symposium (USENIX Security 25) , pages=

    PoisonedRAG: Knowledge corruption attacks to Retrieval-Augmented generation of large language models , author=. 34th USENIX Security Symposium (USENIX Security 25) , pages=

  18. [24]

    European Conference on Computer Vision , pages=

    Images are achilles' heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models , author=. European Conference on Computer Vision , pages=. 2024 , organization=

  19. [29]

    Nature Medicine , pages=

    Medical large language models are vulnerable to data-poisoning attacks , author=. Nature Medicine , pages=. 2025 , publisher=

  20. [30]

    Journal of the American Medical Informatics Association , volume=

    Preparing a collection of radiology examinations for distribution and retrieval , author=. Journal of the American Medical Informatics Association , volume=. 2015 , publisher=

  21. [31]

    Scientific data , volume=

    MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports , author=. Scientific data , volume=. 2019 , publisher=

  22. [32]

    PLoS medicine , volume=

    Predicting survival from colorectal cancer histology slides using deep learning: A retrospective multicenter study , author=. PLoS medicine , volume=. 2019 , publisher=

  23. [33]

    International Conference on Artificial Intelligence in Medicine , pages=

    A petri dish for histopathology image analysis , author=. International Conference on Artificial Intelligence in Medicine , pages=. 2021 , organization=

  24. [34]

    Jama , volume=

    Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer , author=. Jama , volume=. 2017 , publisher=

  25. [36]

    2025 , journal=

    OpenAI GPT-5 System Card , author=. 2025 , journal=

  26. [37]

    2025 , howpublished =

    Anthropic , title =. 2025 , howpublished =

  27. [38]

    Some methods of classification and analysis of multivariate observations , author=. Proc. of 5th Berkeley Symposium on Math. Stat. and Prob. , pages=

  28. [39]

    International Conference on Learning Representations , year=

    Towards Deep Learning Models Resistant to Adversarial Attacks , author=. International Conference on Learning Representations , year=

  29. [40]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Megapairs: Massive data synthesis for universal multimodal retrieval , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  30. [41]

    International conference on machine learning , pages=

    Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

  31. [42]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    Sigmoid loss for language image pre-training , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  32. [43]

    Advances in neural information processing systems , volume=

    Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=

  33. [44]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Retrieval-Augmented Generation for Large Language Models: A Survey , author=. arXiv preprint arXiv:2312.10997 , year=

  34. [45]

    ACM Transactions on Information Systems , volume=

    A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions , author=. ACM Transactions on Information Systems , volume=. 2025 , publisher=

  35. [47]

    The Thirteenth International Conference on Learning Representations , year=

    Follow My Instruction and Spill the Beans: Scalable Data Extraction from Retrieval-Augmented Generation Systems , author=. The Thirteenth International Conference on Learning Representations , year=

  36. [48]

    Findings of the Association for Computational Linguistics: ACL 2024 , pages=

    The good and the bad: Exploring privacy issues in retrieval-augmented generation (RAG) , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

  37. [50]

    Daniel Alexander Alber, Zihao Yang, Anton Alyakin, Eunice Yang, Sumedha Rai, Aly A Valliani, Jeff Zhang, Gabriel R Rosenbaum, Ashley K Amend-Thomas, David B Kurland, and 1 others. 2025. Medical large language models are vulnerable to data-poisoning attacks. Nature Medicine, pages 1--9

  38. [51]

    Shakiba Amirshahi, Amin Bigdeli, Charles LA Clarke, and Amira Ghenai. 2025. Evaluating the robustness of retrieval-augmented generation to adversarial evidence in the health domain. arXiv preprint arXiv:2509.03787

  39. [52]

    Anthropic. 2025. Claude haiku 4.5 system card. https://assets.anthropic.com/m/99128ddd009bdcb/Claude-Haiku-4-5-System-Card.pdf. Accessed: 2025-11-22

  40. [53]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, and 1 others. 2025. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923

  41. [54]

    Babak Ehteshami Bejnordi, Mitko Veta, Paul Johannes Van Diest, Bram Van Ginneken, Nico Karssemeijer, Geert Litjens, Jeroen AWM Van Der Laak, Meyke Hermsen, Quirine F Manson, Maschenka Balkenhol, and 1 others. 2017. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. Jama, 318(22):2199--2210

  42. [55]

    Wenhu Chen, Hexiang Hu, Xi Chen, Pat Verga, and William Cohen. 2022. Murag: Multimodal retrieval-augmented generator for open question answering over images and text. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5558--5570

  43. [56]

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, and 1 others. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261

  44. [57]

    Dina Demner-Fushman, Marc D Kohli, Marc B Rosenman, Sonya E Shooshan, Laritza Rodriguez, Sameer Antani, George R Thoma, and Clement J McDonald. 2015. Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association, 23(2):304--310

  45. [58]

    o lflein, Isabella C Wiest, Marta Ligero, Srividhya Sainath, Narmin Ghaffari Laleh, Omar SM El Nahhas, Gustav M \

    Dyke Ferber, Georg W \"o lflein, Isabella C Wiest, Marta Ligero, Srividhya Sainath, Narmin Ghaffari Laleh, Omar SM El Nahhas, Gustav M \"u ller-Franzes, Dirk J \"a ger, Daniel Truhn, and 1 others. 2024. In-context learning enables multimodal large language models to classify cancer pathology images. Nature Communications, 15(1):10104

  46. [59]

    Hyeonjeong Ha, Qiusi Zhan, Jeonghwan Kim, Dimitrios Bralios, Saikrishna Sanniboina, Nanyun Peng, Kai-Wei Chang, Daniel Kang, and Heng Ji. 2025. Mm-poisonrag: Disrupting multimodal rag with local and global poisoning attacks. arXiv preprint arXiv:2502.17832

  47. [60]

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, and 1 others. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276

  48. [61]

    Changyue Jiang, Xudong Pan, Geng Hong, Chenfu Bao, and Min Yang. 2024. Rag-thief: Scalable extraction of private data from retrieval-augmented generation applications with agent-based attacks. arXiv preprint arXiv:2411.14110

  49. [62]

    Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. 2019. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1):317

  50. [63]

    Jakob Nikolas Kather, Johannes Krisam, Pornpimol Charoentong, Tom Luedde, Esther Herpel, Cleo-Aron Weis, Timo Gaiser, Alexander Marx, Nektarios A Valous, Dyke Ferber, and 1 others. 2019. Predicting survival from colorectal cancer histology slides using deep learning: A retrospective multicenter study. PLoS medicine, 16(1):e1002730

  51. [64]

    Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. 2023. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems, 36:28541--28564

  52. [65]

    Yifan Li, Hangyu Guo, Kun Zhou, Wayne Xin Zhao, and Ji-Rong Wen. 2024. Images are achilles' heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models. In European Conference on Computer Vision, pages 174--189. Springer

  53. [66]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023 a . Visual instruction tuning. Advances in neural information processing systems, 36:34892--34916

  54. [67]

    Pei Liu, Xin Liu, Ruoyu Yao, Junming Liu, Siyuan Meng, Ding Wang, and Jun Ma. 2025 a . Hm-rag: Hierarchical multi-agent multimodal retrieval augmented generation. In Proceedings of the 33rd ACM International Conference on Multimedia, pages 2781--2790

  55. [68]

    Yinuo Liu, Zenghui Yuan, Guiyao Tie, Jiawen Shi, Pan Zhou, Lichao Sun, and Neil Zhenqiang Gong. 2025 b . Poisoned-mrag: Knowledge poisoning attacks to multimodal retrieval augmented generation. arXiv preprint arXiv:2503.06254

  56. [69]

    Zhenghao Liu, Chenyan Xiong, Yuanhuiyi Lv, Zhiyuan Liu, and Ge Yu. 2023 b . Universal vision-language dense retrieval: Learning a unified representation space for multi-modal retrieval. In The Eleventh International Conference on Learning Representations

  57. [70]

    Linyin Luo, Yujuan Ding, Yunshan Ma, Wenqi Fan, and Hanjiang Lai. 2025. Hv-attack: Hierarchical visual attack for multimodal retrieval augmented generation. arXiv preprint arXiv:2511.15435

  58. [71]

    Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations

  59. [72]

    Bo Ni, Zheyuan Liu, Leyao Wang, Yongjia Lei, Yuying Zhao, Xueqi Cheng, Qingkai Zeng, Luna Dong, Yinglong Xia, Krishnaram Kenthapadi, and 1 others. 2025. Towards trustworthy retrieval augmented generation for large language models: A survey. arXiv preprint arXiv:2502.06872

  60. [73]

    OpenAI. 2025. Openai gpt-5 system card. arXiv preprint arXiv:2601.03267

  61. [74]

    Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, and Daniel Rueckert. 2025. Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 337--347. Springer

  62. [75]

    Zhenting Qi, Hanlin Zhang, Eric P Xing, Sham M Kakade, and Himabindu Lakkaraju. 2025. Follow my instruction and spill the beans: Scalable data extraction from retrieval-augmented generation systems. In The Thirteenth International Conference on Learning Representations

  63. [76]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, and 1 others. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748--8763. PmLR

  64. [77]

    Yingjia Shang, Yi Liu, Huimin Wang, Furong Li, Wenfang Sun, Wu Chengyu, and Yefeng Zheng. 2025. Medusa: Cross-modal transferable adversarial attacks on multimodal medical retrieval-augmented generation. arXiv preprint arXiv:2511.19257

  65. [78]

    Zhen Tan, Chengshuai Zhao, Raha Moraffah, Yifan Li, Song Wang, Jundong Li, Tianlong Chen, and Huan Liu. 2024. Glue pizza and eat rocks-exploiting vulnerabilities in retrieval-augmented generative models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1610--1626

  66. [79]

    Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. 2024. Uniir: Training and benchmarking universal multimodal information retrievers. In European Conference on Computer Vision, pages 387--404. Springer

  67. [80]

    Jerry Wei, Arief Suriawinata, Bing Ren, Xiaoying Liu, Mikhail Lisovsky, Louis Vaickus, Charles Brown, Michael Baker, Naofumi Tomita, Lorenzo Torresani, and 1 others. 2021. A petri dish for histopathology image analysis. In International Conference on Artificial Intelligence in Medicine, pages 11--24. Springer

  68. [81]

    Peng Xia, Kangyu Zhu, Haoran Li, Tianze Wang, Weijia Shi, Sheng Wang, Linjun Zhang, James Zou, and Huaxiu Yao. 2025. Mmed-rag: Versatile multimodal rag system for medical vision language models. In The Thirteenth International Conference on Learning Representations

  69. [82]

    Peng Xia, Kangyu Zhu, Haoran Li, Hongtu Zhu, Yun Li, Gang Li, Linjun Zhang, and Huaxiu Yao. 2024. Rule: Reliable multimodal rag for factuality in medical vision language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1081--1093

  70. [83]

    Xun Xian, Ganghua Wang, Xuan Bi, Jayanth Srinivasa, Ashish Kundu, Charles Fleming, Mingyi Hong, and Jie Ding. 2024. On the vulnerability of applying retrieval-augmented generation within knowledge-intensive application domains. arXiv preprint arXiv:2409.17275

  71. [84]

    Lei Yu, Yechao Zhang, Ziqi Zhou, Yang Wu, Wei Wan, Minghui Li, Shengshan Hu, Pei Xiaobing, and Jing Wang. 2025. Spa-vlm: Stealthy poisoning attacks on rag-based vlm. arXiv preprint arXiv:2505.23828

  72. [85]

    Shenglai Zeng, Jiankun Zhang, Pengfei He, Yiding Liu, Yue Xing, Han Xu, Jie Ren, Yi Chang, Shuaiqiang Wang, Dawei Yin, and 1 others. 2024. The good and the bad: Exploring privacy issues in retrieval-augmented generation (rag). In Findings of the Association for Computational Linguistics: ACL 2024, pages 4505--4524

  73. [86]

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. 2023. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11975--11986

  74. [87]

    Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. 2024 a . Vision-language models for vision tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence

  75. [88]

    Kai Zhang, Rong Zhou, Eashan Adhikarla, Zhiling Yan, Yixin Liu, Jun Yu, Zhengliang Liu, Xun Chen, Brian D Davison, Hui Ren, and 1 others. 2024 b . A generalist vision-language foundation model for diverse biomedical tasks. Nature Medicine, 30(11):3129--3141

  76. [89]

    Junjie Zhou, Yongping Xiong, Zheng Liu, Ze Liu, Shitao Xiao, Yueze Wang, Bo Zhao, Chen Jason Zhang, and Defu Lian. 2025. Megapairs: Massive data synthesis for universal multimodal retrieval. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 19076--19095

  77. [90]

    Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. 2025. Poisonedrag: Knowledge corruption attacks to retrieval-augmented generation of large language models. In 34th USENIX Security Symposium (USENIX Security 25), pages 3827--3844

  78. [91]

    Kaiwen Zuo, Zelin Liu, Raman Dutt, Ziyang Wang, Zhongtian Sun, Yeming Wang, Fan Mo, and Pietro Li \`o . 2025. How to make medical ai systems safer? simulating vulnerabilities, and threats in multimodal medical rag system. arXiv preprint arXiv:2508.17215