arxiv: 2604.09757 · v1 · submitted 2026-04-10 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering

Suyang Xi , Songtao Hu , Yuxiang Lai , Wangyun Dan , Yaqi Liu , Shansong Wang , Xiaofeng Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:59 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords medical visual question answeringlatent visual reasoningvision-language modelsautoregressive decodingROI supervisionpolicy optimizationvisual evidence preservationmedical imaging AI

0 comments

The pith

MedLVR adds short latent visual reasoning segments to medical VQA models to keep subtle diagnostic image details active during answer generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes MedLVR, a framework that interleaves brief segments of latent visual reasoning into the autoregressive decoder of vision-language models used for medical visual question answering. Standard models encode the image once as static context and then shift to text-dominated reasoning, which tends to lose localized visual cues essential for accurate clinical answers. MedLVR reuses decoder hidden states to create continuous latent steps that iteratively preserve and refine query-relevant visual evidence before the final answer is produced. Training proceeds in two stages: region-of-interest supervised fine-tuning followed by Visual-Latent Policy Optimization that rewards better outcomes. A sympathetic reader would care because improved preservation of faint visual signals could make AI assistants more trustworthy in settings where missing a small lesion or pattern leads to wrong diagnoses.

Core claim

MedLVR introduces an explicit visual evidence state into autoregressive decoding by interleaving short latent reasoning segments formed from reused decoder hidden states. These segments enable iterative preservation and refinement of query-relevant visual evidence. The method is trained first with ROI-supervised fine-tuning to align latent states to clinically relevant image regions and then with Visual-Latent Policy Optimization under outcome-level rewards. Experiments show the approach raises average performance from 48.3% to 53.4% over the Qwen2.5-VL-7B backbone on OmniMedVQA and five other medical VQA benchmarks.

What carries the argument

The latent visual reasoning segment: short continuous latent steps created by reusing decoder hidden states that carry and iteratively refine visual evidence across decoding steps.

If this is right

Medical VQA models can maintain diagnostically relevant visual information throughout text generation instead of discarding it after the initial image encoding.
ROI-supervised fine-tuning aligns the reused hidden states with clinically meaningful image regions.
Visual-Latent Policy Optimization jointly improves the quality of the latent reasoning and the final generated answers under outcome rewards.
The same gains appear consistently across OmniMedVQA and five additional external medical VQA datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reuse of hidden states for latent visual steps could be tested on non-medical vision-language tasks that require tracking fine visual details over long outputs.
Varying the number or duration of latent segments might reveal an optimal balance between visual preservation and computational cost.
Combining this mechanism with other forms of visual supervision could address additional sources of error in diagnostic image interpretation.

Load-bearing premise

Reusing decoder hidden states as short continuous latent reasoning segments will reliably preserve and refine query-relevant visual evidence rather than adding noise or redundant computation.

What would settle it

An ablation that removes the latent reasoning segments entirely and measures no drop (or even a gain) in accuracy on the same medical VQA benchmarks would show the added visual state is not responsible for the reported gains.

read the original abstract

Medical vision--language models (VLMs) have shown strong potential for medical visual question answering (VQA), yet their reasoning remains largely text-centric: images are encoded once as static context, and subsequent inference is dominated by language. This paradigm is fundamentally limited in clinical scenarios, where accurate answers often depend on subtle, localized visual evidence that cannot be reliably preserved in static embeddings. We propose \textsc{MedLVR}, a latent visual reasoning framework that introduces an explicit visual evidence state into autoregressive decoding. Instead of relying solely on text-based intermediate reasoning, \textsc{MedLVR} interleaves a short latent reasoning segment within the decoder by reusing hidden states as continuous latent steps, enabling iterative preservation and refinement of query-relevant visual evidence before answer generation. To support effective visual supervision, we adopt a two-stage training strategy: region of interest (ROI)-supervised fine-tuning aligns latent states with clinically relevant image evidence, and Visual-Latent Policy Optimization (VLPO) further optimizes latent reasoning and answer generation under outcome-level rewards. Experiments on OmniMedVQA and five external medical VQA benchmarks show that \textsc{MedLVR} consistently outperforms recent reasoning baselines and improves the average score over the Qwen2.5-VL-7B backbone from 48.3\% to 53.4\%. These results show that latent visual reasoning provides an effective mechanism for preserving diagnostically relevant visual evidence and improving the reliability of medical VQA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MedLVR reuses decoder hidden states for short latent visual steps in medical VQA and reports a 5-point average lift, but the gains are not isolated from the ROI supervision and extra training stages.

read the letter

The core move is to insert short continuous latent segments into the autoregressive decoder by reusing hidden states, so the model can iteratively refine query-relevant visual evidence instead of encoding the image once and then doing everything in text. They back this with a two-stage recipe: ROI-supervised fine-tuning to align the latent states to clinically marked regions, followed by Visual-Latent Policy Optimization that uses outcome rewards on the final answer. On OmniMedVQA plus five other medical VQA sets the average score rises from 48.3 % to 53.4 % over the Qwen2.5-VL-7B baseline, beating several recent reasoning baselines. That is the concrete result on the table. The training stages are described clearly enough that someone could try to reproduce the pipeline, and the use of external ROI labels plus outcome-level rewards avoids the most obvious circularity problems. The numbers are reported on held-out benchmarks, which is the right place to put them. The soft spot is exactly the one the stress-test flags: nothing in the abstract or the reported experiments shows that the reused hidden states are actually carrying or refining visual information rather than simply giving the model more parameters and more forward passes under stronger supervision. No attention maps, no reconstruction probes, no ablation that removes the latent interleaving while keeping the ROI and VLPO stages, and no comparison to non-visual latent tokens. Without those checks the claimed mechanism stays unverified, and the observed improvement could be explained by the supervision signal alone. The paper is aimed at groups already working on medical VLMs who want a practical way to keep visual detail alive during decoding. It is incremental rather than foundational, but the training recipe is concrete and the benchmarks are standard, so a serious referee could usefully press on the missing controls and ask for the latent-state analysis. I would send it to review rather than desk-reject, with the expectation that the authors will need to add those experiments.

Referee Report

3 major / 2 minor

Summary. The paper proposes MedLVR, a latent visual reasoning framework for medical VQA that interleaves short continuous latent reasoning segments (reused decoder hidden states) into autoregressive decoding to iteratively preserve and refine query-relevant visual evidence, rather than relying on static image embeddings and text-centric reasoning. It employs a two-stage training process consisting of ROI-supervised fine-tuning to align latent states with clinically relevant regions followed by Visual-Latent Policy Optimization (VLPO) using outcome-level rewards. Experiments on OmniMedVQA and five external medical VQA benchmarks report consistent gains, raising average performance from 48.3% to 53.4% over the Qwen2.5-VL-7B backbone.

Significance. If the central mechanism is verified to actively preserve diagnostically relevant visual information rather than arising from supervision or added capacity alone, the approach could meaningfully improve reliability in clinical VQA where subtle localized evidence is critical. The reported gains are modest but consistent across benchmarks; however, the significance hinges on demonstrating that the latent states function as visual evidence carriers, which is not yet established by the provided details.

major comments (3)

[Methods] Methods (latent reasoning segment definition): Reusing decoder hidden states as short continuous latent steps is presented as enabling iterative visual evidence preservation, yet the description provides no isolation from the ROI supervision signal or extra forward-pass capacity; without ablations that replace the segments with non-visual tokens or disable ROI alignment while keeping parameter count fixed, the gains cannot be attributed to the claimed visual mechanism.
[Experiments] Experiments (results and ablations): The 48.3% to 53.4% improvement is reported without error bars, statistical significance tests, or ablations on latent segment length, ROI supervision weight, or VLPO reward coefficients; this leaves open whether the gains stem from the two-stage training procedure itself rather than the latent visual reasoning component.
[Experiments] No probing analysis: The manuscript contains no attention-map visualizations, feature reconstruction experiments, or comparisons of latent states against purely textual hidden states to verify that the reused decoder states carry and refine query-relevant visual information rather than generic computation.

minor comments (2)

[Abstract / Methods] The abstract and methods would benefit from an explicit equation or diagram defining how the latent reasoning segment is inserted into the decoder hidden-state sequence and how it interacts with the visual encoder output.
[Experiments] Table or figure captions for benchmark results should include the exact number of test samples per dataset and the precise backbone configuration used for the 48.3% baseline.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough review and constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and indicating planned revisions to strengthen the evidence for the latent visual reasoning mechanism.

read point-by-point responses

Referee: [Methods] Methods (latent reasoning segment definition): Reusing decoder hidden states as short continuous latent steps is presented as enabling iterative visual evidence preservation, yet the description provides no isolation from the ROI supervision signal or extra forward-pass capacity; without ablations that replace the segments with non-visual tokens or disable ROI alignment while keeping parameter count fixed, the gains cannot be attributed to the claimed visual mechanism.

Authors: We agree that explicit isolation ablations are needed to attribute gains specifically to the visual evidence preservation in latent states. In the revised manuscript, we will add controlled experiments that (1) replace latent reasoning segments with non-visual tokens (e.g., zero or random embeddings) while preserving architecture and training, and (2) disable ROI alignment during the first training stage while matching parameter counts and forward-pass capacity. These will directly test whether the iterative refinement arises from the claimed mechanism rather than supervision or added capacity. revision: yes
Referee: [Experiments] Experiments (results and ablations): The 48.3% to 53.4% improvement is reported without error bars, statistical significance tests, or ablations on latent segment length, ROI supervision weight, or VLPO reward coefficients; this leaves open whether the gains stem from the two-stage training procedure itself rather than the latent visual reasoning component.

Authors: The referee is correct that additional statistical rigor and hyperparameter ablations would strengthen the claims. We will revise the experiments section to include error bars (standard deviation across multiple runs), paired statistical significance tests on the reported improvements, and targeted ablations varying latent segment length, ROI supervision weight, and VLPO reward coefficients. These additions will help isolate the contribution of the latent visual reasoning component from the two-stage training procedure as a whole. revision: yes
Referee: [Experiments] No probing analysis: The manuscript contains no attention-map visualizations, feature reconstruction experiments, or comparisons of latent states against purely textual hidden states to verify that the reused decoder states carry and refine query-relevant visual information rather than generic computation.

Authors: We acknowledge that direct probing would provide stronger verification that the reused decoder states function as visual evidence carriers. In the revised manuscript, we will add (1) attention-map visualizations contrasting MedLVR with the baseline, (2) feature reconstruction experiments measuring how well latent states recover query-relevant image regions, and (3) quantitative comparisons of latent states versus purely textual hidden states (e.g., via cosine similarity to visual features and query relevance metrics). These analyses will directly address whether the states preserve and refine visual information. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains on held-out benchmarks after independent supervision

full rationale

The paper's derivation chain consists of a proposed architecture (reusing decoder hidden states as latent segments) trained in two stages with external ROI labels and outcome rewards, followed by evaluation on held-out benchmarks (OmniMedVQA and five others). No equations, self-citations, or ansatzes are shown that define the claimed preservation of visual evidence in terms of the same fitted quantities or reduce the reported 48.3% to 53.4% improvement to a definitional equivalence. The central claim is supported by external performance metrics rather than by construction from the training signals alone, satisfying the criteria for a self-contained derivation.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The framework rests on standard autoregressive decoding assumptions plus two new training stages whose effectiveness is demonstrated only empirically; the latent reasoning segment is an invented mechanism whose benefit is measured solely by downstream accuracy.

free parameters (2)

latent reasoning segment length
Short fixed or learned length of the interleaved latent steps; value not stated in abstract but required for the method to function.
ROI supervision weight and VLPO reward coefficients
Hyperparameters that balance the two training stages; chosen to produce the reported gains.

axioms (2)

domain assumption Hidden states from the vision-language decoder can serve as continuous, refinable representations of query-relevant visual evidence
Invoked when the paper states that reusing hidden states enables iterative preservation of visual evidence.
domain assumption Outcome-level rewards in VLPO will improve both latent reasoning quality and final answer accuracy
Central to the second training stage.

invented entities (1)

latent reasoning segment / visual evidence state no independent evidence
purpose: To maintain and refine diagnostically relevant image information inside the decoder before text answer generation
New construct introduced to address the static-embedding limitation; no independent falsifiable prediction (e.g., a measurable property of the state) is provided beyond accuracy improvement.

pith-pipeline@v0.9.0 · 5583 in / 1613 out tokens · 46598 ms · 2026-05-10T17:59:26.766421+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Breath1024.lean Breath1024's 8-tick periodic micro-structure echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

set the latent rollout budget to K=8 under the step-based decoding strategy... varying the latent size from 2 to 16... performance peak typically appearing at 4 or 8
IndisputableMonolith/Foundation/ArrowOfTime.lean forward_accumulates / z_monotone_absolute echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

reusing hidden states as continuous latent steps, enabling iterative preservation and refinement of query-relevant visual evidence

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 25 canonical work pages · 13 internal anchors

[1]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

2021
[2]

Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,

J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” inInternational conference on machine learning. PMLR, 2023, pp. 19 730–19 742

2023
[3]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, pp. 34 892– 34 916, 2023

2023
[4]

Blink: Multimodal large language models can see but not perceive,

X. Fu, Y . Hu, B. Li, Y . Feng, H. Wang, X. Lin, D. Roth, N. A. Smith, W.-C. Ma, and R. Krishna, “Blink: Multimodal large language models can see but not perceive,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 148–166

2024
[5]

Eyes wide shut? exploring the visual shortcomings of multimodal llms,

S. Tong, Z. Liu, Y . Zhai, Y . Ma, Y . LeCun, and S. Xie, “Eyes wide shut? exploring the visual shortcomings of multimodal llms,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9568–9578

2024
[6]

Perception tokens enhance visual reasoning in multimodal language models,

M. Bigverdi, Z. Luo, C.-Y . Hsieh, E. Shen, D. Chen, L. G. Shapiro, and R. Krishna, “Perception tokens enhance visual reasoning in multimodal language models,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 3836–3845

2025
[7]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

C. Team, “Chameleon: Mixed-modal early-fusion foundation models,” arXiv preprint arXiv:2405.09818, 2024

work page internal anchor Pith review arXiv 2024
[8]

LLaVA-OneVision: Easy Visual Task Transfer

B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y . Li, Z. Liuet al., “Llava-onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024. 10 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2026

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Qwen2.5-VL Technical Report

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tanget al., “Qwen2. 5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shaoet al., “Internvl3. 5: Advancing open-source multi- modal models in versatility, reasoning, and efficiency,”arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection- allocation,

Q. Huang, X. Dong, P. Zhang, B. Wang, C. He, J. Wang, D. Lin, W. Zhang, and N. Yu, “Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection- allocation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 418–13 427

2024
[12]

Why are visually-grounded language models bad at image classification?

Y . Zhang, A. Unell, X. Wang, D. Ghosh, Y . Su, L. Schmidt, and S. Yeung-Levy, “Why are visually-grounded language models bad at image classification?”Advances in Neural Information Processing Systems, vol. 37, pp. 51 727–51 753, 2024

2024
[13]

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

J. Geiping, S. McLeish, N. Jain, J. Kirchenbauer, S. Singh, B. R. Bartoldson, B. Kailkhura, A. Bhatele, and T. Goldstein, “Scaling up test-time compute with latent reasoning: A recurrent depth approach,” arXiv preprint arXiv:2502.05171, 2025

work page internal anchor Pith review arXiv 2025
[14]

Multimodal Chain-of-Thought Reasoning in Language Models

Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and A. Smola, “Multimodal chain-of-thought reasoning in language models,”arXiv preprint arXiv:2302.00923, 2023

work page internal anchor Pith review arXiv 2023
[15]

Visual cot: Unleashing chain-of-thought reasoning in multi- modal language models,

H. Shao, S. Qian, H. Xiao, G. Song, Z. Zong, L. Wang, Y . Liu, and H. Li, “Visual cot: Unleashing chain-of-thought reasoning in multi- modal language models,”CoRR, 2024

2024
[16]

Openthinkimg: Learning to think with images via visual tool reinforcement learning

Z. Su, L. Li, M. Song, Y . Hao, Z. Yang, J. Zhang, G. Chen, J. Gu, J. Li, X. Quet al., “Openthinkimg: Learning to think with images via visual tool reinforcement learning,”arXiv preprint arXiv:2505.08617, 2025

work page arXiv 2025
[17]

Llava- cot: Let vision language models reason step-by-step,

G. Xu, P. Jin, Z. Wu, H. Li, Y . Song, L. Sun, and L. Yuan, “Llava- cot: Let vision language models reason step-by-step,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 2087–2098

2025
[18]

Monet: Reasoning in latent visual space beyond images and language.arXiv preprint arXiv:2511.21395, 2025

Q. Wang, Y . Shi, Y . Wang, Y . Zhang, P. Wan, K. Gai, X. Ying, and Y . Wang, “Monet: Reasoning in latent visual space beyond images and language,”arXiv preprint arXiv:2511.21395, 2025

work page arXiv 2025
[19]

Medvistagym: A scalable training environment for thinking with medical images via tool-integrated reinforcement learning,

M. Lu, Y . Lu, Y . Zhuang, M. Mullins, Y . Xie, G. Xiao, C. Fleming, W. Shi, and X. Wang, “Medvistagym: A scalable training environment for thinking with medical images via tool-integrated reinforcement learning,”arXiv preprint arXiv:2601.07107, 2026

work page arXiv 2026
[20]

arXiv preprint arXiv:2510.10052 (2025)

K. Chen, S. Rui, Y . Jiang, J. Wu, Q. Zheng, C. Song, X. Wang, M. Zhou, and M. Liu, “Think twice to see more: Iterative visual reasoning in medical vlms,”arXiv preprint arXiv:2510.10052, 2025

work page arXiv 2025
[21]

Adaptive chain-of-focus reasoning via dynamic visual search and zooming for efficient vlms.arXiv preprint arXiv:2505.15436, 2025

X. Zhang, Z. Gao, B. Zhang, P. Li, X. Zhang, Y . Liu, T. Yuan, Y . Wu, Y . Jia, S.-C. Zhuet al., “Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl,”arXiv preprint arXiv:2505.15436, 2025

work page arXiv 2025
[22]

Refocus: Visual editing as a chain of thought for structured image understanding.arXiv preprint arXiv:2501.05452,

X. Fu, M. Liu, Z. Yang, J. Corring, Y . Lu, J. Yang, D. Roth, D. Flo- rencio, and C. Zhang, “Refocus: Visual editing as a chain of thought for structured image understanding,”arXiv preprint arXiv:2501.05452, 2025

work page arXiv 2025
[23]

Vipergpt: Visual inference via python execution for reasoning,

D. Sur ´ıs, S. Menon, and C. V ondrick, “Vipergpt: Visual inference via python execution for reasoning,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 11 888–11 898

2023
[24]

Med-flamingo: a multimodal medical few-shot learner,

M. Moor, Q. Huang, S. Wu, M. Yasunaga, Y . Dalmia, J. Leskovec, C. Zakka, E. P. Reis, and P. Rajpurkar, “Med-flamingo: a multimodal medical few-shot learner,” inMachine Learning for Health (ML4H). PMLR, 2023, pp. 353–367

2023
[25]

Toward expert- level medical question answering with large language models,

K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, M. Amin, L. Hou, K. Clark, S. R. Pfohl, H. Cole-Lewiset al., “Toward expert- level medical question answering with large language models,”Nature Medicine, vol. 31, no. 3, pp. 943–950, 2025

2025
[26]

Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data,

C. Wu, X. Zhang, Y . Zhang, H. Hui, Y . Wang, and W. Xie, “Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data,”Nature Communications, vol. 16, no. 1, p. 7866, 2025

2025
[27]

Chexagent: Towards a foundation model for chest x-ray interpretation,

Z. Chen, M. Varma, J.-B. Delbrouck, M. Paschali, L. Blankemeier, D. Van Veen, J. M. J. Valanarasu, A. Youssef, J. P. Cohen, E. P. Reiset al., “Chexagent: Towards a foundation model for chest x-ray interpretation,” inAAAI 2024 Spring Symposium on Clinical Foundation Models, 2024

2024
[28]

Training Large Language Models to Reason in a Continuous Latent Space

S. Hao, S. Sukhbaatar, D. Su, X. Li, Z. Hu, J. Weston, and Y . Tian, “Training large language models to reason in a continuous latent space,” arXiv preprint arXiv:2412.06769, 2024

work page internal anchor Pith review arXiv 2024
[29]

Compressed chain of thought: Efficient reasoning through dense representations.arXiv preprint arXiv:2412.13171,

J. Cheng and B. Van Durme, “Compressed chain of thought: Efficient reasoning through dense representations,”arXiv preprint arXiv:2412.13171, 2024

work page arXiv 2024
[30]

Codi: Compressing chain-of-thought into continuous space via self-distillation.arXiv preprint arXiv:2502.21074,

Z. Shen, H. Yan, L. Zhang, Z. Hu, Y . Du, and Y . He, “Codi: Compress- ing chain-of-thought into continuous space via self-distillation,”arXiv preprint arXiv:2502.21074, 2025

work page arXiv 2025
[31]

Latent Visual Reasoning

B. Li, X. Sun, J. Liu, Z. Wang, J. Wu, X. Yu, H. Chen, E. Bar- soum, M. Chen, and Z. Liu, “Latent visual reasoning,”arXiv preprint arXiv:2509.24251, 2025

work page internal anchor Pith review arXiv 2025
[32]

Llava-med: Training a large language-and-vision assistant for biomedicine in one day,

C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang, T. Naumann, H. Poon, and J. Gao, “Llava-med: Training a large language-and-vision assistant for biomedicine in one day,”Advances in Neural Information Processing Systems, vol. 36, pp. 28 541–28 564, 2023

2023
[33]

MedGemma Technical Report

A. Sellergren, S. Kazemzadeh, T. Jaroensri, A. Kiraly, M. Traverse, T. Kohlberger, S. Xu, F. Jamil, C. Hughes, C. Lauet al., “Medgemma technical report,”arXiv preprint arXiv:2507.05201, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

W. Xu, H. P. Chan, L. Li, M. Aljunied, R. Yuan, J. Wang, C. Xiao, G. Chen, C. Liu, Z. Liet al., “Lingshu: A generalist foundation model for unified multimodal medical understanding and reasoning,”arXiv preprint arXiv:2506.07044, 2025

work page internal anchor Pith review arXiv 2025
[35]

R1-onevision: Advancing gen- eralized multimodal reasoning through cross-modal formal- ization.arXiv preprint arXiv:2503.10615, 2025

Y . Yang, X. He, H. Pan, X. Jiang, Y . Deng, X. Yang, H. Lu, D. Yin, F. Rao, M. Zhuet al., “R1-onevision: Advancing generalized mul- timodal reasoning through cross-modal formalization,”arXiv preprint arXiv:2503.10615, 2025

work page arXiv 2025
[36]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosenet al., “Gem- ini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models,

Y . Hu, W. Shi, X. Fu, D. Roth, M. Ostendorf, L. Zettlemoyer, N. A. Smith, and R. Krishna, “Visual sketchpad: Sketching as a visual chain of thought for multimodal language models,”Advances in Neural Information Processing Systems, vol. 37, pp. 139 348–139 379, 2024

2024
[38]

Yiqing Liang, Jielin Qiu, Wenhao Ding, Zuxin Liu, James Tompkin, Mengdi Xu, Mengzhou Xia, Zhengzhong Tu, Laixi Shi, and Jiacheng Zhu

C. Li, W. Wu, H. Zhang, Y . Xia, S. Mao, L. Dong, I. Vuli ´c, and F. Wei, “Imagine while reasoning in space: Multimodal visualization- of-thought,”arXiv preprint arXiv:2501.07542, 2025

work page arXiv 2025
[39]

Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering,

B. Liu, L.-M. Zhan, L. Xu, L. Ma, Y . Yang, and X.-M. Wu, “Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering,” in2021 IEEE 18th international symposium on biomedical imaging (ISBI). IEEE, 2021, pp. 1650–1654

2021
[40]

A dataset of clinically generated visual questions and answers about radiology images,

J. J. Lau, S. Gayen, A. Ben Abacha, and D. Demner-Fushman, “A dataset of clinically generated visual questions and answers about radiology images,”Scientific data, vol. 5, no. 1, p. 180251, 2018

2018
[41]

arXiv preprint arXiv:2305.10415 (2023)

X. Zhang, C. Wu, Z. Zhao, W. Lin, Y . Zhang, Y . Wang, and W. Xie, “Pmc-vqa: Visual instruction tuning for medical visual question answer- ing,”arXiv preprint arXiv:2305.10415, 2023

work page arXiv 2023
[42]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,

X. Yue, Y . Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y . Sunet al., “Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 9556–9567

2024
[43]

Medxpertqa: Benchmarking expert-level medical reasoning and understanding.arXiv preprint arXiv:2501.18362, 2025

Y . Zuo, S. Qu, Y . Li, Z. Chen, X. Zhu, E. Hua, K. Zhang, N. Ding, and B. Zhou, “Medxpertqa: Benchmarking expert-level medical reasoning and understanding,”arXiv preprint arXiv:2501.18362, 2025

work page arXiv 2025
[44]

Improved baselines with visual instruction tuning,

H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26 296–26 306

2024
[45]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Mmedagent: Learning to use medical tools with multi- modal agent,

B. Li, T. Yan, Y . Pan, J. Luo, R. Ji, J. Ding, Z. Xu, S. Liu, H. Dong, Z. Linet al., “Mmedagent: Learning to use medical tools with multi- modal agent,” inFindings of the Association for Computational Linguis- tics: EMNLP 2024, 2024, pp. 8745–8760

2024
[47]

Vila-m3: Enhancing vision-language models with medical expert knowledge,

V . Nath, W. Li, D. Yang, A. Myronenko, M. Zheng, Y . Lu, Z. Liu, H. Yin, Y . M. Law, Y . Tanget al., “Vila-m3: Enhancing vision-language models with medical expert knowledge,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 14 788–14 798

2025
[48]

Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning,

J. Pan, C. Liu, J. Wu, F. Liu, J. Zhu, H. B. Li, C. Chen, C. Ouyang, and D. Rueckert, “Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning,” inInterna- tional Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2025, pp. 337–347

2025
[49]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

H. Wang, A. Su, W. Ren, F. Lin, and W. Chen, “Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning,”arXiv preprint arXiv:2505.15966, 2025

work page internal anchor Pith review arXiv 2025