arxiv: 2604.27724 · v1 · submitted 2026-04-30 · 💻 cs.AI

Recognition: unknown

Iterative Multimodal Retrieval-Augmented Generation for Medical Question Answering

Binbin Shi, Chenqian Le, Chihang Wang, Jiaqi Zhang, Jinhan Zhang, Kewen Wang, Ran Gong, Xupeng Chen

Pith reviewed 2026-05-07 07:32 UTC · model grok-4.3

classification 💻 cs.AI

keywords medical question answeringmultimodal retrieval-augmented generationvision-language modelsiterative retrievalpage image retrievalbiomedical literaturememory bankPMC documents

0 comments

The pith

Retrieving full page images from medical papers and reasoning over them iteratively improves question-answering accuracy over text-chunk methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes MED-VRAG, a framework that retrieves original PMC document page images rather than OCR-extracted text chunks for medical question answering. A vision-language model then refines its information needs and accumulates supporting evidence in a memory bank across up to three rounds. Controlled experiments on four standard benchmarks isolate measurable contributions from the use of page images, from the iterative refinement, and from the memory mechanism. The approach matters because tables, figures, and structured layouts in medical literature often contain information that text-only extraction discards.

Core claim

MED-VRAG is an iterative multimodal retrieval-augmented generation system that operates directly on PMC document page images. It indexes pages with patch-level embeddings for fast retrieval, applies a sharded MapReduce filter, and lets a vision-language model refine queries while building evidence in a memory bank. The method records higher accuracy on medical question-answering tasks than text-centric baselines, with ablations attributing separate gains to image-based retrieval, iteration, and memory accumulation.

What carries the argument

The iterative refinement loop in which a vision-language model updates its query from retrieved page images and stores relevant evidence in a memory bank.

If this is right

Medical question-answering systems gain accuracy when they retrieve and reason over the visual layout of document pages rather than text alone.
Multiple rounds of query refinement allow a retrieval-augmented model to gather more complete evidence than a single retrieval step.
A memory bank that persists evidence across rounds prevents loss of information and contributes additional performance gains.
Coarse-to-fine indexing of page embeddings makes retrieval over hundreds of thousands of document pages practical at low latency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same iterative page-image approach could be tested on scientific or technical documents outside medicine where figures and tables carry critical information.
Current text-only retrieval-augmented generation pipelines may systematically miss layout-dependent facts that become available once page images are used.
Questions that explicitly require interpreting a figure or table would provide a sharper test of where the visual advantage is largest.

Load-bearing premise

The vision-language model accurately interprets visual content such as tables and figures on the page images without introducing new errors during iterative refinement or memory accumulation.

What would settle it

Running the identical benchmarks with a version that substitutes OCR text for the original page images and finding no accuracy difference would show that the visual content supplies no additive benefit.

Figures

Figures reproduced from arXiv: 2604.27724 by Binbin Shi, Chenqian Le, Chihang Wang, Jiaqi Zhang, Jinhan Zhang, Kewen Wang, Ran Gong, Xupeng Chen.

**Figure 1.** Figure 1: MEDVRAG pipeline. A medical question is encoded by ColQwen2.5 and matched against a FAISS index over ∼350K PMC pages; an LLM filters candidates by summary (sharded MapReduce); a VLM iteratively reasons over the highest-ranked 10 page images plus top-20 summaries from the N2=100 filtered set, refining queries and accumulating findings in a memory bank across ≤ 3 rounds view at source ↗

**Figure 3.** Figure 3: Accuracy conditioned on iteration count, per dataset. The 72.9% multi-iteration figure is the within-dataset weighted average over R2+R3. Error analysis. MEDVRAG (full system) errors on MedQA fall into four mutually exclusive primary causes (single-rater): (i) retrieval miss (∼35.9%) — no relevant page in the top-100; (ii) filter miss (∼19.7%) — relevant page in top-2000 dropped by Stage-2; (iii) VLM misin… view at source ↗

read the original abstract

Medical retrieval-augmented generation (RAG) systems typically operate on text chunks extracted from biomedical literature, discarding the rich visual content (tables, figures, structured layouts) of original document pages. We propose MED-VRAG, an iterative multimodal RAG framework that retrieves and reasons over PMC document page images instead of OCR'd text. The system pairs ColQwen2.5 patch-level page embeddings with a sharded MapReduce LLM filter, scaling to ~350K pages while keeping Stage-1 retrieval under 30 ms via an offline coarse-to-fine index (C=8 centroids per page, ANN over centroids, exact two-way scoring on the top-R shortlist). A vision-language model (VLM) then iteratively refines its query and accumulates evidence in a memory bank across up to 3 reasoning rounds, with a single iteration costing ~15.9 s and the full three-round pipeline ~47.8 s on 4xA100. Across four medical QA benchmarks (MedQA, MedMCQA, PubMedQA, MMLU-Med), MEDVRAG reaches 78.6% average accuracy. Under controlled comparison with the same Qwen2.5-VL-32B backbone, retrieval contributes a +5.8 point gain over the no-retrieval baseline; we also note a +1.8 point edge over MedRAG + GPT-4 (76.8%), with the caveat that this is a cross-paper rather than head-to-head comparison. Ablations isolate +1.0 from page-image vs text-chunk retrieval, +1.5 from iteration, and +1.0 from the memory bank.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MED-VRAG gets modest gains on medical QA by pulling full page images and running iterative VLM reasoning with a memory bank, but the visual accuracy assumption needs more checking.

read the letter

The main takeaway is that this system improves medical QA by retrieving original PMC page images instead of text chunks, then letting a VLM refine queries and build evidence over up to three rounds in a memory bank. With Qwen2.5-VL-32B it reaches 78.6% average across the four benchmarks, and the controlled ablations show retrieval adding 5.8 points, page images beating text chunks by 1, iteration adding 1.5, and the memory bank adding 1 more. The scaling setup with ColQwen2.5 centroids, ANN shortlisting, and MapReduce filtering lets it handle 350K pages with fast first-stage retrieval and reasonable per-query time on A100s. That combination of full-page multimodal retrieval plus explicit iteration and memory is the concrete new piece relative to the MedRAG baseline they cite. The ablations are useful because they isolate each component on the same backbone, and the runtime numbers give a sense of practicality. The paper does a reasonable job keeping the comparisons internal where possible. The soft spots are the cross-paper +1.8 edge over MedRAG plus GPT-4, which is weaker than a head-to-head, and the missing error analysis on the VLM's visual reads. The stress-test concern about hallucinated table values or misread figures propagating through the memory bank is fair based on the abstract; if the full paper does not include failure cases or OCR comparisons, that leaves the multimodal benefit less secure than the numbers suggest. Minor details like exact centroid count and round limits are listed as free parameters but not deeply explored. This is for people working on domain RAG systems where documents contain tables and figures that matter, especially in medicine. A reader who wants to see how page-level image retrieval and iterative VLM loops play out on standard benchmarks will find usable design choices and numbers here. It deserves a serious referee because the implementation is described with enough controls and results to evaluate, even if revisions will be needed on robustness checks. I would send it to peer review and ask specifically for visual error analysis and tighter baselines.

Referee Report

3 major / 2 minor

Summary. The paper presents MED-VRAG, an iterative multimodal RAG framework for medical QA that retrieves PMC page images using ColQwen2.5 patch embeddings and a sharded MapReduce filter, then uses a VLM to iteratively refine queries and accumulate evidence in a memory bank over up to 3 rounds. It reports an average accuracy of 78.6% on MedQA, MedMCQA, PubMedQA, and MMLU-Med, with controlled ablations showing +5.8 from retrieval, +1.0 from page images vs text chunks, +1.5 from iteration, and +1.0 from the memory bank using the Qwen2.5-VL-32B backbone.

Significance. This work has potential significance in advancing RAG systems for medicine by incorporating visual elements from documents, which are often critical in biomedical literature. The efficiency of the retrieval system (under 30 ms Stage-1, ~48 s full pipeline) and the ablation studies isolating contributions of different components are positive aspects. If the visual interpretation by the VLM is reliable, it could lead to better performance in tasks requiring understanding of tables and figures.

major comments (3)

[Ablation Studies] The +1.0 gain attributed to page-image retrieval over text-chunk retrieval (reported in the ablation results) relies on the VLM correctly interpreting visual content without introducing errors. However, there is no reported error analysis, hallucination rate on tables/figures, or direct comparison of VLM outputs on raw images versus OCR/text equivalents. This is a load-bearing issue for the central claim that multimodal retrieval provides additive benefits.
[Experimental Results] The +1.8 edge over MedRAG + GPT-4 is noted as a cross-paper comparison; while the caveat is mentioned in the abstract, the lack of head-to-head evaluation with the same backbone and retrieval setup limits the strength of this claim.
[Method] Details on how the iterative process (up to 3 rounds) and memory bank handle potential misinterpretations from the VLM on page images (e.g., table values or figure trends) are insufficient, raising concerns about error propagation or amplification across rounds.

minor comments (2)

[Abstract] Clarify whether the 78.6% average is a simple or weighted average across the four benchmarks.
[Implementation Details] The time costs (15.9 s per iteration, 47.8 s full) are given but could include variance or more hardware specifics for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects for strengthening our claims on multimodal benefits, experimental comparisons, and robustness of the iterative process. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: The +1.0 gain attributed to page-image retrieval over text-chunk retrieval (reported in the ablation results) relies on the VLM correctly interpreting visual content without introducing errors. However, there is no reported error analysis, hallucination rate on tables/figures, or direct comparison of VLM outputs on raw images versus OCR/text equivalents. This is a load-bearing issue for the central claim that multimodal retrieval provides additive benefits.

Authors: We agree that quantifying VLM reliability on visual elements is important to validate the multimodal advantage. Our ablation uses the identical Qwen2.5-VL-32B backbone for both page-image and text-chunk conditions, providing a controlled comparison. In the revision, we will add a new subsection with qualitative examples of VLM reasoning on page images (tables, figures, layouts) versus OCR/text equivalents, plus a manual error analysis on a sample of 100 retrieved pages reporting hallucination rates and discrepancies. This directly addresses the load-bearing concern with evidence of visual interpretation quality. revision: yes
Referee: The +1.8 edge over MedRAG + GPT-4 is noted as a cross-paper comparison; while the caveat is mentioned in the abstract, the lack of head-to-head evaluation with the same backbone and retrieval setup limits the strength of this claim.

Authors: We acknowledge the cross-paper nature of the comparison and that a same-backbone head-to-head would strengthen it. MedRAG was originally evaluated with text-only retrieval and different models, so a full replication with our multimodal pipeline and Qwen2.5-VL-32B would require substantial re-implementation that risks conflating variables. In the revision, we will expand the discussion to detail setup differences (retrieval modality, model capabilities) and more prominently qualify the +1.8 point edge as indicative rather than definitive. revision: partial
Referee: Details on how the iterative process (up to 3 rounds) and memory bank handle potential misinterpretations from the VLM on page images (e.g., table values or figure trends) are insufficient, raising concerns about error propagation or amplification across rounds.

Authors: We thank the referee for this observation. The memory bank accumulates evidence across rounds, with the VLM prompted to refine queries using all prior entries and add new evidence only when it provides distinct information; the final generation step includes instructions to resolve conflicts by cross-referencing. In the revision, we will include pseudocode for the iterative loop and memory update, plus per-round performance breakdowns and discussion of failure cases to show that iteration typically mitigates rather than amplifies errors. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark results with controlled ablations

full rationale

The paper proposes MED-VRAG, an iterative multimodal RAG system using page images from PMC documents, ColQwen2.5 embeddings, MapReduce filtering, and VLM-based query refinement with a memory bank. It reports 78.6% average accuracy across MedQA, MedMCQA, PubMedQA, and MMLU-Med, plus ablation gains (+5.8 retrieval, +1.0 page-image vs text-chunk, +1.5 iteration, +1.0 memory bank) under same-backbone controls. These are direct experimental measurements on public benchmarks, not mathematical derivations, fitted predictions, or self-referential definitions. No equations, first-principles claims, or load-bearing self-citations appear in the provided description; performance figures are obtained externally via standard evaluation protocols rather than reducing to the system's own inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The framework assumes standard VLM image interpretation capabilities and relies on a small number of retrieval hyperparameters and iteration count chosen for the reported trade-off between accuracy and latency.

free parameters (2)

number of reasoning rounds = 3
Set to a maximum of 3; appears chosen to balance performance and compute time (~47.8 s total).
centroids per page (C) = 8
Set to 8 for the coarse-to-fine index to achieve sub-30 ms Stage-1 retrieval.

axioms (1)

domain assumption Vision-language models can reliably extract and reason over structured visual content (tables, figures, layouts) in biomedical page images.
Invoked throughout the iterative refinement and evidence accumulation steps.

pith-pipeline@v0.9.0 · 5627 in / 1492 out tokens · 44358 ms · 2026-05-07T07:32:30.102456+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 6 canonical work pages · 2 internal anchors

[1]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923,

work page internal anchor Pith review arXiv
[2]

Huatuogpt-o1, towards medical complex reasoning with llms

Chen, J., Cai, Z., Ji, K., Wang, X., Liu, W., Wang, R., Hou, J., and Wang, B. HuatuoGPT-o1, towards med- ical complex reasoning with LLMs.arXiv preprint arXiv:2412.18925,

work page arXiv
[3]

Meditron-70b: Scaling medical pretraining for large language models,

Chen, Z., Hern´andez-Cano, A., Romanou, A., Bonnet, A., Matoba, K., Salvi, F., Pagliardini, M., Fan, S., K¨opf, A., Mohtashami, A., Sallinen, A., Sakhaeirad, A., Swamy, V ., Krawczuk, I., Bayazit, D., Marmet, A., Montariol, S., Hartley, M.-A., Jaggi, M., and Bosselut, A. MEDITRON- 70B: Scaling medical pretraining for large language mod- els.arXiv preprint...

work page arXiv
[4]

Colpali: Efficient document retrieval with vision language models.arXiv preprint arXiv:2407.01449, 2024

Faysse, M., Sibille, H., Wu, T., Omrani, B., Viaud, G., Hudelot, C., and Colombo, P. ColPali: Efficient document retrieval with vision language models.arXiv preprint arXiv:2407.01449,

work page arXiv
[5]

Capabil- ities of GPT-4 on Medical Challenge Problems

Nori, H., King, N., McKinney, S. M., Carignan, D., and Horvitz, E. Capabilities of GPT-4 on medical challenge problems.arXiv preprint arXiv:2303.13375,

work page arXiv
[6]

Rationale-guided retrieval augmented generation for medical question answering

Sohn, J., Park, Y ., Yoon, C., Park, S., Hwang, H., Sung, M., Kim, H., and Kang, J. Rationale-guided retrieval augmented generation for medical question answering. In Proceedings of the 2025 Conference of the North Amer- ican Chapter of the Association for Computational Lin- guistics (NAACL), pp. 12739–12753,

2025
[7]

Improving retrieval-augmented generation in medicine with iterative follow-up questions

6 Iterative Multimodal RAG for Medical QA Xiong, G., Jin, Q., Wang, X., Zhang, M., Lu, Z., and Zhang, A. Improving retrieval-augmented generation in medicine with iterative follow-up questions. InBio- computing 2025: Proceedings of the Pacific Symposium, pp. 199–214,

2025
[8]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review arXiv
[9]

Prompts and Output Schema Stage-2 filter prompt(Qwen3-30B-A3B, sharded MapReduce;target k=25 in map shards, 100 in reduce): You are a medical document retrieval expert

7 Iterative Multimodal RAG for Medical QA A. Prompts and Output Schema Stage-2 filter prompt(Qwen3-30B-A3B, sharded MapReduce;target k=25 in map shards, 100 in reduce): You are a medical document retrieval expert. Given a medical question and candidate page summaries, select the{target k} most relevant pages. Question: {question}. Candidate page summaries...

2048
[10]

Retrieval cutoffs

for page embedding; Qwen2.5-VL-7B for offline page summaries (chosen over the 32B variant for indexing throughput; mean summary length ∼120 tokens); Qwen3-30B-A3B for Stage-2 filter; Qwen2.5-VL-32B-Instruct for iterative reasoning. Retrieval cutoffs. N1=2,000 candidates from Stage 1, N2=100 pages after the Stage-2 LLM filter, max 3 iterations (hard cap, e...

2048