Recognition: 2 theorem links
· Lean TheoremMedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering
Pith reviewed 2026-05-10 17:59 UTC · model grok-4.3
The pith
MedLVR adds short latent visual reasoning segments to medical VQA models to keep subtle diagnostic image details active during answer generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MedLVR introduces an explicit visual evidence state into autoregressive decoding by interleaving short latent reasoning segments formed from reused decoder hidden states. These segments enable iterative preservation and refinement of query-relevant visual evidence. The method is trained first with ROI-supervised fine-tuning to align latent states to clinically relevant image regions and then with Visual-Latent Policy Optimization under outcome-level rewards. Experiments show the approach raises average performance from 48.3% to 53.4% over the Qwen2.5-VL-7B backbone on OmniMedVQA and five other medical VQA benchmarks.
What carries the argument
The latent visual reasoning segment: short continuous latent steps created by reusing decoder hidden states that carry and iteratively refine visual evidence across decoding steps.
If this is right
- Medical VQA models can maintain diagnostically relevant visual information throughout text generation instead of discarding it after the initial image encoding.
- ROI-supervised fine-tuning aligns the reused hidden states with clinically meaningful image regions.
- Visual-Latent Policy Optimization jointly improves the quality of the latent reasoning and the final generated answers under outcome rewards.
- The same gains appear consistently across OmniMedVQA and five additional external medical VQA datasets.
Where Pith is reading between the lines
- The same reuse of hidden states for latent visual steps could be tested on non-medical vision-language tasks that require tracking fine visual details over long outputs.
- Varying the number or duration of latent segments might reveal an optimal balance between visual preservation and computational cost.
- Combining this mechanism with other forms of visual supervision could address additional sources of error in diagnostic image interpretation.
Load-bearing premise
Reusing decoder hidden states as short continuous latent reasoning segments will reliably preserve and refine query-relevant visual evidence rather than adding noise or redundant computation.
What would settle it
An ablation that removes the latent reasoning segments entirely and measures no drop (or even a gain) in accuracy on the same medical VQA benchmarks would show the added visual state is not responsible for the reported gains.
read the original abstract
Medical vision--language models (VLMs) have shown strong potential for medical visual question answering (VQA), yet their reasoning remains largely text-centric: images are encoded once as static context, and subsequent inference is dominated by language. This paradigm is fundamentally limited in clinical scenarios, where accurate answers often depend on subtle, localized visual evidence that cannot be reliably preserved in static embeddings. We propose \textsc{MedLVR}, a latent visual reasoning framework that introduces an explicit visual evidence state into autoregressive decoding. Instead of relying solely on text-based intermediate reasoning, \textsc{MedLVR} interleaves a short latent reasoning segment within the decoder by reusing hidden states as continuous latent steps, enabling iterative preservation and refinement of query-relevant visual evidence before answer generation. To support effective visual supervision, we adopt a two-stage training strategy: region of interest (ROI)-supervised fine-tuning aligns latent states with clinically relevant image evidence, and Visual-Latent Policy Optimization (VLPO) further optimizes latent reasoning and answer generation under outcome-level rewards. Experiments on OmniMedVQA and five external medical VQA benchmarks show that \textsc{MedLVR} consistently outperforms recent reasoning baselines and improves the average score over the Qwen2.5-VL-7B backbone from 48.3\% to 53.4\%. These results show that latent visual reasoning provides an effective mechanism for preserving diagnostically relevant visual evidence and improving the reliability of medical VQA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes MedLVR, a latent visual reasoning framework for medical VQA that interleaves short continuous latent reasoning segments (reused decoder hidden states) into autoregressive decoding to iteratively preserve and refine query-relevant visual evidence, rather than relying on static image embeddings and text-centric reasoning. It employs a two-stage training process consisting of ROI-supervised fine-tuning to align latent states with clinically relevant regions followed by Visual-Latent Policy Optimization (VLPO) using outcome-level rewards. Experiments on OmniMedVQA and five external medical VQA benchmarks report consistent gains, raising average performance from 48.3% to 53.4% over the Qwen2.5-VL-7B backbone.
Significance. If the central mechanism is verified to actively preserve diagnostically relevant visual information rather than arising from supervision or added capacity alone, the approach could meaningfully improve reliability in clinical VQA where subtle localized evidence is critical. The reported gains are modest but consistent across benchmarks; however, the significance hinges on demonstrating that the latent states function as visual evidence carriers, which is not yet established by the provided details.
major comments (3)
- [Methods] Methods (latent reasoning segment definition): Reusing decoder hidden states as short continuous latent steps is presented as enabling iterative visual evidence preservation, yet the description provides no isolation from the ROI supervision signal or extra forward-pass capacity; without ablations that replace the segments with non-visual tokens or disable ROI alignment while keeping parameter count fixed, the gains cannot be attributed to the claimed visual mechanism.
- [Experiments] Experiments (results and ablations): The 48.3% to 53.4% improvement is reported without error bars, statistical significance tests, or ablations on latent segment length, ROI supervision weight, or VLPO reward coefficients; this leaves open whether the gains stem from the two-stage training procedure itself rather than the latent visual reasoning component.
- [Experiments] No probing analysis: The manuscript contains no attention-map visualizations, feature reconstruction experiments, or comparisons of latent states against purely textual hidden states to verify that the reused decoder states carry and refine query-relevant visual information rather than generic computation.
minor comments (2)
- [Abstract / Methods] The abstract and methods would benefit from an explicit equation or diagram defining how the latent reasoning segment is inserted into the decoder hidden-state sequence and how it interacts with the visual encoder output.
- [Experiments] Table or figure captions for benchmark results should include the exact number of test samples per dataset and the precise backbone configuration used for the 48.3% baseline.
Simulated Author's Rebuttal
We thank the referee for the thorough review and constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and indicating planned revisions to strengthen the evidence for the latent visual reasoning mechanism.
read point-by-point responses
-
Referee: [Methods] Methods (latent reasoning segment definition): Reusing decoder hidden states as short continuous latent steps is presented as enabling iterative visual evidence preservation, yet the description provides no isolation from the ROI supervision signal or extra forward-pass capacity; without ablations that replace the segments with non-visual tokens or disable ROI alignment while keeping parameter count fixed, the gains cannot be attributed to the claimed visual mechanism.
Authors: We agree that explicit isolation ablations are needed to attribute gains specifically to the visual evidence preservation in latent states. In the revised manuscript, we will add controlled experiments that (1) replace latent reasoning segments with non-visual tokens (e.g., zero or random embeddings) while preserving architecture and training, and (2) disable ROI alignment during the first training stage while matching parameter counts and forward-pass capacity. These will directly test whether the iterative refinement arises from the claimed mechanism rather than supervision or added capacity. revision: yes
-
Referee: [Experiments] Experiments (results and ablations): The 48.3% to 53.4% improvement is reported without error bars, statistical significance tests, or ablations on latent segment length, ROI supervision weight, or VLPO reward coefficients; this leaves open whether the gains stem from the two-stage training procedure itself rather than the latent visual reasoning component.
Authors: The referee is correct that additional statistical rigor and hyperparameter ablations would strengthen the claims. We will revise the experiments section to include error bars (standard deviation across multiple runs), paired statistical significance tests on the reported improvements, and targeted ablations varying latent segment length, ROI supervision weight, and VLPO reward coefficients. These additions will help isolate the contribution of the latent visual reasoning component from the two-stage training procedure as a whole. revision: yes
-
Referee: [Experiments] No probing analysis: The manuscript contains no attention-map visualizations, feature reconstruction experiments, or comparisons of latent states against purely textual hidden states to verify that the reused decoder states carry and refine query-relevant visual information rather than generic computation.
Authors: We acknowledge that direct probing would provide stronger verification that the reused decoder states function as visual evidence carriers. In the revised manuscript, we will add (1) attention-map visualizations contrasting MedLVR with the baseline, (2) feature reconstruction experiments measuring how well latent states recover query-relevant image regions, and (3) quantitative comparisons of latent states versus purely textual hidden states (e.g., via cosine similarity to visual features and query relevance metrics). These analyses will directly address whether the states preserve and refine visual information. revision: yes
Circularity Check
No circularity: empirical gains on held-out benchmarks after independent supervision
full rationale
The paper's derivation chain consists of a proposed architecture (reusing decoder hidden states as latent segments) trained in two stages with external ROI labels and outcome rewards, followed by evaluation on held-out benchmarks (OmniMedVQA and five others). No equations, self-citations, or ansatzes are shown that define the claimed preservation of visual evidence in terms of the same fitted quantities or reduce the reported 48.3% to 53.4% improvement to a definitional equivalence. The central claim is supported by external performance metrics rather than by construction from the training signals alone, satisfying the criteria for a self-contained derivation.
Axiom & Free-Parameter Ledger
free parameters (2)
- latent reasoning segment length
- ROI supervision weight and VLPO reward coefficients
axioms (2)
- domain assumption Hidden states from the vision-language decoder can serve as continuous, refinable representations of query-relevant visual evidence
- domain assumption Outcome-level rewards in VLPO will improve both latent reasoning quality and final answer accuracy
invented entities (1)
-
latent reasoning segment / visual evidence state
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/Breath1024.leanBreath1024's 8-tick periodic micro-structure echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
set the latent rollout budget to K=8 under the step-based decoding strategy... varying the latent size from 2 to 16... performance peak typically appearing at 4 or 8
-
IndisputableMonolith/Foundation/ArrowOfTime.leanforward_accumulates / z_monotone_absolute echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
reusing hidden states as continuous latent steps, enabling iterative preservation and refinement of query-relevant visual evidence
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763
2021
-
[2]
Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,
J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” inInternational conference on machine learning. PMLR, 2023, pp. 19 730–19 742
2023
-
[3]
Visual instruction tuning,
H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, pp. 34 892– 34 916, 2023
2023
-
[4]
Blink: Multimodal large language models can see but not perceive,
X. Fu, Y . Hu, B. Li, Y . Feng, H. Wang, X. Lin, D. Roth, N. A. Smith, W.-C. Ma, and R. Krishna, “Blink: Multimodal large language models can see but not perceive,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 148–166
2024
-
[5]
Eyes wide shut? exploring the visual shortcomings of multimodal llms,
S. Tong, Z. Liu, Y . Zhai, Y . Ma, Y . LeCun, and S. Xie, “Eyes wide shut? exploring the visual shortcomings of multimodal llms,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9568–9578
2024
-
[6]
Perception tokens enhance visual reasoning in multimodal language models,
M. Bigverdi, Z. Luo, C.-Y . Hsieh, E. Shen, D. Chen, L. G. Shapiro, and R. Krishna, “Perception tokens enhance visual reasoning in multimodal language models,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 3836–3845
2025
-
[7]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
C. Team, “Chameleon: Mixed-modal early-fusion foundation models,” arXiv preprint arXiv:2405.09818, 2024
work page internal anchor Pith review arXiv 2024
-
[8]
LLaVA-OneVision: Easy Visual Task Transfer
B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y . Li, Z. Liuet al., “Llava-onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024. 10 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2026
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tanget al., “Qwen2. 5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shaoet al., “Internvl3. 5: Advancing open-source multi- modal models in versatility, reasoning, and efficiency,”arXiv preprint arXiv:2508.18265, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection- allocation,
Q. Huang, X. Dong, P. Zhang, B. Wang, C. He, J. Wang, D. Lin, W. Zhang, and N. Yu, “Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection- allocation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 418–13 427
2024
-
[12]
Why are visually-grounded language models bad at image classification?
Y . Zhang, A. Unell, X. Wang, D. Ghosh, Y . Su, L. Schmidt, and S. Yeung-Levy, “Why are visually-grounded language models bad at image classification?”Advances in Neural Information Processing Systems, vol. 37, pp. 51 727–51 753, 2024
2024
-
[13]
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
J. Geiping, S. McLeish, N. Jain, J. Kirchenbauer, S. Singh, B. R. Bartoldson, B. Kailkhura, A. Bhatele, and T. Goldstein, “Scaling up test-time compute with latent reasoning: A recurrent depth approach,” arXiv preprint arXiv:2502.05171, 2025
work page internal anchor Pith review arXiv 2025
-
[14]
Multimodal Chain-of-Thought Reasoning in Language Models
Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and A. Smola, “Multimodal chain-of-thought reasoning in language models,”arXiv preprint arXiv:2302.00923, 2023
work page internal anchor Pith review arXiv 2023
-
[15]
Visual cot: Unleashing chain-of-thought reasoning in multi- modal language models,
H. Shao, S. Qian, H. Xiao, G. Song, Z. Zong, L. Wang, Y . Liu, and H. Li, “Visual cot: Unleashing chain-of-thought reasoning in multi- modal language models,”CoRR, 2024
2024
-
[16]
Openthinkimg: Learning to think with images via visual tool reinforcement learning
Z. Su, L. Li, M. Song, Y . Hao, Z. Yang, J. Zhang, G. Chen, J. Gu, J. Li, X. Quet al., “Openthinkimg: Learning to think with images via visual tool reinforcement learning,”arXiv preprint arXiv:2505.08617, 2025
-
[17]
Llava- cot: Let vision language models reason step-by-step,
G. Xu, P. Jin, Z. Wu, H. Li, Y . Song, L. Sun, and L. Yuan, “Llava- cot: Let vision language models reason step-by-step,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 2087–2098
2025
-
[18]
Q. Wang, Y . Shi, Y . Wang, Y . Zhang, P. Wan, K. Gai, X. Ying, and Y . Wang, “Monet: Reasoning in latent visual space beyond images and language,”arXiv preprint arXiv:2511.21395, 2025
-
[19]
M. Lu, Y . Lu, Y . Zhuang, M. Mullins, Y . Xie, G. Xiao, C. Fleming, W. Shi, and X. Wang, “Medvistagym: A scalable training environment for thinking with medical images via tool-integrated reinforcement learning,”arXiv preprint arXiv:2601.07107, 2026
-
[20]
arXiv preprint arXiv:2510.10052 (2025)
K. Chen, S. Rui, Y . Jiang, J. Wu, Q. Zheng, C. Song, X. Wang, M. Zhou, and M. Liu, “Think twice to see more: Iterative visual reasoning in medical vlms,”arXiv preprint arXiv:2510.10052, 2025
-
[21]
X. Zhang, Z. Gao, B. Zhang, P. Li, X. Zhang, Y . Liu, T. Yuan, Y . Wu, Y . Jia, S.-C. Zhuet al., “Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl,”arXiv preprint arXiv:2505.15436, 2025
-
[22]
X. Fu, M. Liu, Z. Yang, J. Corring, Y . Lu, J. Yang, D. Roth, D. Flo- rencio, and C. Zhang, “Refocus: Visual editing as a chain of thought for structured image understanding,”arXiv preprint arXiv:2501.05452, 2025
-
[23]
Vipergpt: Visual inference via python execution for reasoning,
D. Sur ´ıs, S. Menon, and C. V ondrick, “Vipergpt: Visual inference via python execution for reasoning,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 11 888–11 898
2023
-
[24]
Med-flamingo: a multimodal medical few-shot learner,
M. Moor, Q. Huang, S. Wu, M. Yasunaga, Y . Dalmia, J. Leskovec, C. Zakka, E. P. Reis, and P. Rajpurkar, “Med-flamingo: a multimodal medical few-shot learner,” inMachine Learning for Health (ML4H). PMLR, 2023, pp. 353–367
2023
-
[25]
Toward expert- level medical question answering with large language models,
K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, M. Amin, L. Hou, K. Clark, S. R. Pfohl, H. Cole-Lewiset al., “Toward expert- level medical question answering with large language models,”Nature Medicine, vol. 31, no. 3, pp. 943–950, 2025
2025
-
[26]
Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data,
C. Wu, X. Zhang, Y . Zhang, H. Hui, Y . Wang, and W. Xie, “Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data,”Nature Communications, vol. 16, no. 1, p. 7866, 2025
2025
-
[27]
Chexagent: Towards a foundation model for chest x-ray interpretation,
Z. Chen, M. Varma, J.-B. Delbrouck, M. Paschali, L. Blankemeier, D. Van Veen, J. M. J. Valanarasu, A. Youssef, J. P. Cohen, E. P. Reiset al., “Chexagent: Towards a foundation model for chest x-ray interpretation,” inAAAI 2024 Spring Symposium on Clinical Foundation Models, 2024
2024
-
[28]
Training Large Language Models to Reason in a Continuous Latent Space
S. Hao, S. Sukhbaatar, D. Su, X. Li, Z. Hu, J. Weston, and Y . Tian, “Training large language models to reason in a continuous latent space,” arXiv preprint arXiv:2412.06769, 2024
work page internal anchor Pith review arXiv 2024
-
[29]
J. Cheng and B. Van Durme, “Compressed chain of thought: Efficient reasoning through dense representations,”arXiv preprint arXiv:2412.13171, 2024
-
[30]
Z. Shen, H. Yan, L. Zhang, Z. Hu, Y . Du, and Y . He, “Codi: Compress- ing chain-of-thought into continuous space via self-distillation,”arXiv preprint arXiv:2502.21074, 2025
-
[31]
B. Li, X. Sun, J. Liu, Z. Wang, J. Wu, X. Yu, H. Chen, E. Bar- soum, M. Chen, and Z. Liu, “Latent visual reasoning,”arXiv preprint arXiv:2509.24251, 2025
work page internal anchor Pith review arXiv 2025
-
[32]
Llava-med: Training a large language-and-vision assistant for biomedicine in one day,
C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang, T. Naumann, H. Poon, and J. Gao, “Llava-med: Training a large language-and-vision assistant for biomedicine in one day,”Advances in Neural Information Processing Systems, vol. 36, pp. 28 541–28 564, 2023
2023
-
[33]
A. Sellergren, S. Kazemzadeh, T. Jaroensri, A. Kiraly, M. Traverse, T. Kohlberger, S. Xu, F. Jamil, C. Hughes, C. Lauet al., “Medgemma technical report,”arXiv preprint arXiv:2507.05201, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning
W. Xu, H. P. Chan, L. Li, M. Aljunied, R. Yuan, J. Wang, C. Xiao, G. Chen, C. Liu, Z. Liet al., “Lingshu: A generalist foundation model for unified multimodal medical understanding and reasoning,”arXiv preprint arXiv:2506.07044, 2025
work page internal anchor Pith review arXiv 2025
-
[35]
Y . Yang, X. He, H. Pan, X. Jiang, Y . Deng, X. Yang, H. Lu, D. Yin, F. Rao, M. Zhuet al., “R1-onevision: Advancing generalized mul- timodal reasoning through cross-modal formalization,”arXiv preprint arXiv:2503.10615, 2025
-
[36]
G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosenet al., “Gem- ini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Visual sketchpad: Sketching as a visual chain of thought for multimodal language models,
Y . Hu, W. Shi, X. Fu, D. Roth, M. Ostendorf, L. Zettlemoyer, N. A. Smith, and R. Krishna, “Visual sketchpad: Sketching as a visual chain of thought for multimodal language models,”Advances in Neural Information Processing Systems, vol. 37, pp. 139 348–139 379, 2024
2024
-
[38]
C. Li, W. Wu, H. Zhang, Y . Xia, S. Mao, L. Dong, I. Vuli ´c, and F. Wei, “Imagine while reasoning in space: Multimodal visualization- of-thought,”arXiv preprint arXiv:2501.07542, 2025
-
[39]
Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering,
B. Liu, L.-M. Zhan, L. Xu, L. Ma, Y . Yang, and X.-M. Wu, “Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering,” in2021 IEEE 18th international symposium on biomedical imaging (ISBI). IEEE, 2021, pp. 1650–1654
2021
-
[40]
A dataset of clinically generated visual questions and answers about radiology images,
J. J. Lau, S. Gayen, A. Ben Abacha, and D. Demner-Fushman, “A dataset of clinically generated visual questions and answers about radiology images,”Scientific data, vol. 5, no. 1, p. 180251, 2018
2018
-
[41]
arXiv preprint arXiv:2305.10415 (2023)
X. Zhang, C. Wu, Z. Zhao, W. Lin, Y . Zhang, Y . Wang, and W. Xie, “Pmc-vqa: Visual instruction tuning for medical visual question answer- ing,”arXiv preprint arXiv:2305.10415, 2023
-
[42]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,
X. Yue, Y . Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y . Sunet al., “Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 9556–9567
2024
-
[43]
Y . Zuo, S. Qu, Y . Li, Z. Chen, X. Zhu, E. Hua, K. Zhang, N. Ding, and B. Zhou, “Medxpertqa: Benchmarking expert-level medical reasoning and understanding,”arXiv preprint arXiv:2501.18362, 2025
-
[44]
Improved baselines with visual instruction tuning,
H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26 296–26 306
2024
-
[45]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Mmedagent: Learning to use medical tools with multi- modal agent,
B. Li, T. Yan, Y . Pan, J. Luo, R. Ji, J. Ding, Z. Xu, S. Liu, H. Dong, Z. Linet al., “Mmedagent: Learning to use medical tools with multi- modal agent,” inFindings of the Association for Computational Linguis- tics: EMNLP 2024, 2024, pp. 8745–8760
2024
-
[47]
Vila-m3: Enhancing vision-language models with medical expert knowledge,
V . Nath, W. Li, D. Yang, A. Myronenko, M. Zheng, Y . Lu, Z. Liu, H. Yin, Y . M. Law, Y . Tanget al., “Vila-m3: Enhancing vision-language models with medical expert knowledge,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 14 788–14 798
2025
-
[48]
Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning,
J. Pan, C. Liu, J. Wu, F. Liu, J. Zhu, H. B. Li, C. Chen, C. Ouyang, and D. Rueckert, “Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning,” inInterna- tional Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2025, pp. 337–347
2025
-
[49]
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
H. Wang, A. Su, W. Ren, F. Lin, and W. Chen, “Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning,”arXiv preprint arXiv:2505.15966, 2025
work page internal anchor Pith review arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.