MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering

· 2026 · cs.CV · arXiv 2604.09757

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Medical vision--language models (VLMs) have shown strong potential for medical visual question answering (VQA), yet their reasoning remains largely text-centric: images are encoded once as static context, and subsequent inference is dominated by language. This paradigm is fundamentally limited in clinical scenarios, where accurate answers often depend on subtle, localized visual evidence that cannot be reliably preserved in static embeddings. We propose \textsc{MedLVR}, a latent visual reasoning framework that introduces an explicit visual evidence state into autoregressive decoding. Instead of relying solely on text-based intermediate reasoning, \textsc{MedLVR} interleaves a short latent reasoning segment within the decoder by reusing hidden states as continuous latent steps, enabling iterative preservation and refinement of query-relevant visual evidence before answer generation. To support effective visual supervision, we adopt a two-stage training strategy: region of interest (ROI)-supervised fine-tuning aligns latent states with clinically relevant image evidence, and Visual-Latent Policy Optimization (VLPO) further optimizes latent reasoning and answer generation under outcome-level rewards. Experiments on OmniMedVQA and five external medical VQA benchmarks show that \textsc{MedLVR} consistently outperforms recent reasoning baselines and improves the average score over the Qwen2.5-VL-7B backbone from 48.3\% to 53.4\%. These results show that latent visual reasoning provides an effective mechanism for preserving diagnostically relevant visual evidence and improving the reliability of medical VQA.

representative citing papers

VITAL: Visual-Semantic Dual Supervision for Enhanced and Interpretable Latent Reasoning in Medical MLLMs

cs.CV · 2026-05-27 · unverdicted · novelty 5.0

VITAL adds visual-semantic dual supervision during training of medical MLLMs for latent reasoning, yielding SOTA results on 7 benchmarks with a new 61K multi-modality dataset while enabling post-hoc textual and visual explanations at zero inference overhead.

citing papers explorer

Showing 1 of 1 citing paper.

VITAL: Visual-Semantic Dual Supervision for Enhanced and Interpretable Latent Reasoning in Medical MLLMs cs.CV · 2026-05-27 · unverdicted · none · ref 3 · internal anchor
VITAL adds visual-semantic dual supervision during training of medical MLLMs for latent reasoning, yielding SOTA results on 7 benchmarks with a new 61K multi-modality dataset while enabling post-hoc textual and visual explanations at zero inference overhead.

MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering

fields

years

verdicts

representative citing papers

citing papers explorer