arxiv: 2604.26283 · v1 · submitted 2026-04-29 · 💻 cs.CV · cs.AI

Recognition: unknown

MedSynapse-V: Bridging Visual Perception and Clinical Intuition via Latent Memory Evolution

Chunzheng Zhu , Jiaqi Zeng , Junyu Jiang , Jianxin Lin , Yijun Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-07 13:55 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords medical vision-language modelslatent memory evolutionclinical intuitiondiagnostic accuracyreinforcement learningcausal refinementmemory internalization

0 comments

The pith

Medical vision-language models internalize clinical intuition by evolving latent diagnostic memories in their hidden states.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies a cognitive misalignment in medical vision-language models stemming from discrete tokenization, which leads to loss of long-range information and case-adaptive expertise that clinicians use instinctively. It develops a process for dynamically synthesizing and refining implicit diagnostic memories within the model's hidden stream to simulate expert experiential recall during image interpretation. The method retrieves structured priors, applies causal analysis through reinforcement learning on masked regions, and aligns patterns internally to embed diagnostic logic directly into parameters. A sympathetic reader would care because this promises AI systems that diagnose more like experienced clinicians rather than relying on explicit step-by-step reasoning chains.

Core claim

The paper claims that medical vision-language models suffer from quantization loss and missing adaptive expertise due to discrete tokenization. By implementing a latent diagnostic memory evolution process that begins with generating condensed implicit memories from anatomical priors, refines them through counterfactual rewards based on feature masking to assess causality, and transitions them via divergence alignment in a dual-branch setup, external diagnostic expertise becomes internalized. This results in significantly higher diagnostic accuracy across datasets compared to existing state-of-the-art methods, especially those relying on chain-of-thought.

What carries the argument

Latent diagnostic memory evolution, a process that dynamically synthesizes implicit diagnostic memories in the model's hidden stream to bridge visual features with clinical logic.

If this is right

Diagnostic accuracy rises by relying on evolved internal memories instead of explicit chain-of-thought prompting.
Non-causal or redundant memories are pruned based on reinforcement learning rewards from masked regions.
Latent representations align more closely with actual diagnostic decision logic.
External expertise transfers into the model's endogenous parameters for improved case adaptation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar memory evolution could apply to other domains needing rapid expert judgment, such as scientific image analysis.
The method may improve handling of ambiguous or low-quality inputs by relying on internalized patterns rather than surface features.
End-to-end internalization of intuition might prove more efficient than post-hoc prompting in broader AI systems.

Load-bearing premise

Reinforcement learning guided by region-level feature masking and vocabulary alignment can accurately quantify causal contributions of memories and internalize clinical intuition without introducing artifacts or biases.

What would settle it

Evaluating the model on a dataset of medical images with introduced confounding visual features and checking whether disabling the memory evolution components causes a clear drop in accuracy relative to the full method.

Figures

Figures reproduced from arXiv: 2604.26283 by Chunzheng Zhu, Jianxin Lin, Jiaqi Zeng, Junyu Jiang, Yijun Wang.

**Figure 1.** Figure 1: Existing medical VLMs suffer from coarse symbolic granularity and long-range information dissipation in discrete reasoning. MedSynapse-V addresses this by evolving diagnostic implicit memory in latent space via anatomical prior condensation, causal counterfactual refinement, and autonomous latent memory internalization. that enables near-instantaneous pattern recognition against accumulated case knowledge … view at source ↗

**Figure 2.** Figure 2: Stages I and II of MedSynapse-V. The hook features from an encoder are condensed into diagnostic implicit memory via learnable meta-query probes and injected into the VLM hidden stream. The memory is then refined through RL with composite rewards, ensuring causal alignment between memory and clinical decision logic. chain of diagnostic memory: Fana Meta Query −−−−−−−−−→ M CCR −−−−→ M⋆ IMT −−−−→ Mauto, whe… view at source ↗

**Figure 3.** Figure 3: Intrinsic Memory Transition (IMT) is achieved via Jensen–Shannon divergence alignment between the teacher (π +, conditioned on encoder-derived Mpri) and student (π −, conditioned on Mauto) branches. Gradients propagate solely to Aψ, enabling complete removal of the anatomical encoder at inference with negligible overhead. Privileged Branch and Autonomous Branch. The teacher branch (privileged) retains the… view at source ↗

**Figure 4.** Figure 4: Effect of diagnostic probe count N. Performance peaks around N=16 across benchmarks; further increasing N dilutes diagnostically relevant signals view at source ↗

**Figure 5.** Figure 5: Qualitative comparison across CT, MRI, and Ultrasound cases. MedSynapseV produces concise, correct diagnoses, while Med-R1 and MMedExpert-R1 generate verbose CoT with hallucinated findings (red) leading to misdiagnoses. full pipeline (Avg 67.7) confirms non-redundant contributions: MQPM grounds semantics, CCR refines via exploration, IMT compresses into an autonomous pathway. (ii) Reward design. rcausal i… view at source ↗

**Figure 6.** Figure 6: Accuracy–latency trade-off across compared VLM categories. 0 250 500 750 1000 1250 1500 1750 2000 Training Steps 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Average Reward exploration dip Full (w/ rcausal) w/o rcausal view at source ↗

**Figure 7.** Figure 7: The RL training reward dynamics with and without rcausal. Performance–efficiency trade-off. As shown in view at source ↗

**Figure 8.** Figure 8: t-SNE visualization of implicit memory Mauto after CCR. (a) Eight imaging modalities form well-separated clusters with clinically coherent proximity. (b, c) Within CT and Pathology, disease subtypes further segregate into distinct regions. tinguish memory-dependent from shortcut trajectories; without causal pressure the model bypasses M entirely, treating injected memory as inert padding. Latent space structure view at source ↗

**Figure 9.** Figure 9: Detailed architecture of the Diagnostic Memory Sampler Pϕ. The frozen anatomical encoder Eana extracts spatial features F ∈ R Hf ×Wf ×df , which are flattened into a token sequence and used as key–value pairs for the learnable meta-query probes Q0. Through L layers of selfattention, feed-forward processing, cross-attention, and a final linear projection (df → dh), the module produces N compact implicit… view at source ↗

**Figure 10.** Figure 10: Training dynamics across three stages: (a-c) Stage II reward optimization and gradient stabilization via causal refinement; (d) Stage I NTP loss convergence; (e) Stage II policy-KL evolution; (f) Stage III distillation fidelity and output agreement. 6 Training Dynamics Analysis view at source ↗

**Figure 11.** Figure 11: Causal intervention visualization on fundus (left group) and dermoscopy (right group). Each group: original image, MedSAM3 region mask B, and post-CCR memory attention map. After refinement, memory attention concentrates on diagnostically critical structures while suppressing background. 8.2 Visualization of Causal Counterfactual Intervention view at source ↗

**Figure 12.** Figure 12: Memory evolution across training stages. view at source ↗

**Figure 13.** Figure 13: Qualitative comparison across Chest X-ray, Pathology, and Head CT cases. MedSynapse-V produces concise, correct diagnoses (∼38–43 tokens), while other methods generate verbose CoT (∼195–215 tokens) with hallucinated findings (red). 9.2 Failure Case Analysis CT MRI X-ray Dermoscopy Fundus OCT Pathology Utrasound 0 20 40 60 80 100 Training sample (%) 70 60 40 20 0 78% ACC 52% ACC Single Lesion Multi-lesion… view at source ↗

**Figure 14.** Figure 14: Three representative challenging modes view at source ↗

**Figure 15.** Figure 15: Prompt template for closed-ended multi-choice VQA (VQA-RAD, SLAKE, PathVQA, PMC-VQA, MMMU*, MedXpertQA-MM, GMAI-MMBench). The number of options varies by dataset (2–5); the template adapts accordingly. System: You are a helpful medical assistant. Provide a concise answer to the question. User: <image> {question} Answer the question using a single word or phrase. Assistant view at source ↗

**Figure 16.** Figure 16: Prompt template. Notably, Mauto is autonomously generated and injected in the hidden stream without altering the text prompt. the y-axis is binary diagnostic correctness (1=correct, 0=incorrect; vertical jitter applied for visibility). While high-confidence predictions are predominantly correct, a notable cluster at conf < 0.3 with correctness= 0 reveals that borderline cases (e.g., benign vs. dysplasti… view at source ↗

read the original abstract

High-precision medical diagnosis relies not only on static imaging features but also on the implicit diagnostic memory experts instantly invoke during image interpretation. We pinpoint a fundamental cognitive misalignment in medical VLMs caused by discrete tokenization, leading to quantization loss, long-range information dissipation, and missing case-adaptive expertise. To bridge this gap, we propose ours, a framework for latent diagnostic memory evolution that simulates the experiential invocation of clinicians by dynamically synthesizing implicit diagnostic memories within the model's hidden stream. Specifically, it begins with a Meta Query for Prior Memorization mechanism, where learnable probes retrieve structured priors from an anatomical prior encoder to generate condensed implicit memories. To ensure clinical fidelity, we introduce Causal Counterfactual Refinement (CCR), which leverages reinforcement learning and counterfactual rewards derived from region-level feature masking to quantify the causal contribution of each memory, thereby pruning redundancies and aligning latent representations with diagnostic logic. This evolutionary process culminates in Intrinsic Memory Transition (IMT), a privileged-autonomous dual-branch paradigm that internalizes teacher-branch diagnostic patterns into the student-branch via full-vocabulary divergence alignment. Comprehensive empirical evaluations across multiple datasets demonstrate that ours, by transferring external expertise into endogenous parameters, significantly outperforms existing state-of-the-art methods, particularly chain-of-thought paradigms, in diagnostic accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a memory-evolution framework for medical VLMs using probes, RL-based counterfactual refinement, and dual-branch internalization, but the abstract supplies no results or ablations to back the outperformance claims.

read the letter

The core idea is a framework that tries to evolve implicit diagnostic memories inside medical vision-language models so they better match how clinicians draw on experience. It starts with learnable probes pulling structured priors, runs them through reinforcement learning on counterfactual rewards from region masking to prune and align, then uses a dual-branch setup to internalize the patterns via vocabulary divergence. That specific combination of mechanisms is new relative to the cited prior work on VLMs and chain-of-thought prompting.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes MedSynapse-V, a framework for bridging visual perception and clinical intuition in medical vision-language models through latent memory evolution. It identifies issues with discrete tokenization in VLMs and introduces three mechanisms: Meta Query for Prior Memorization using learnable probes to generate condensed implicit memories from anatomical priors; Causal Counterfactual Refinement (CCR) that uses reinforcement learning and counterfactual rewards from region-level feature masking to quantify causal contributions, prune redundancies, and align with diagnostic logic; and Intrinsic Memory Transition (IMT), a dual-branch paradigm to internalize teacher patterns into the student via full-vocabulary divergence alignment. The paper claims that this approach significantly outperforms existing state-of-the-art methods, particularly chain-of-thought paradigms, in diagnostic accuracy across multiple datasets by transferring external expertise into endogenous parameters.

Significance. If the empirical results hold and the mechanisms validly capture and internalize clinical intuition without confounding biases, this could be a significant contribution to medical AI by addressing the gap between static image features and dynamic expert memory invocation. The integration of RL for causal refinement and privileged autonomous learning in IMT offers a novel approach to model experiential knowledge, potentially improving diagnostic VLMs if the assumptions about isolation of causal effects are substantiated.

major comments (2)

Abstract and CCR mechanism description: The assertion that CCR leverages RL with region-level feature masking to accurately quantify causal contributions of memories is central to the claim of outperforming CoT and transferring expertise. However, this is undermined by the potential for confounded reward signals when masking correlated anatomical regions simultaneously, which may not isolate individual memory effects and could lead to alignments based on dataset correlations rather than true clinical intuition.
Empirical evaluations section: The abstract states comprehensive empirical evaluations demonstrating significant outperformance, but no specific quantitative results, tables, error bars, ablation studies, or comparisons to baselines are detailed, making it impossible to evaluate the soundness of the central empirical claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We have carefully reviewed the concerns regarding the CCR mechanism and the presentation of empirical results. Below we provide point-by-point responses and describe the revisions we will implement to address these issues.

read point-by-point responses

Referee: [—] Abstract and CCR mechanism description: The assertion that CCR leverages RL with region-level feature masking to accurately quantify causal contributions of memories is central to the claim of outperforming CoT and transferring expertise. However, this is undermined by the potential for confounded reward signals when masking correlated anatomical regions simultaneously, which may not isolate individual memory effects and could lead to alignments based on dataset correlations rather than true clinical intuition.

Authors: We appreciate the referee's identification of a potential limitation in the CCR design. The region-level masking is performed to generate counterfactual rewards by comparing diagnostic outcomes with and without specific anatomical features, with the RL objective including sparsity regularization to reduce redundancy. However, we acknowledge that simultaneous masking of correlated regions may not fully isolate individual causal effects and could reflect dataset-specific correlations. In the revised manuscript, we will expand the CCR section with a clearer description of the masking protocol (including any sequential application), add an explicit discussion of the assumptions and potential confounding factors, and include new experiments comparing against alternative causal estimation techniques to better substantiate the alignment with clinical intuition. revision: yes
Referee: [—] Empirical evaluations section: The abstract states comprehensive empirical evaluations demonstrating significant outperformance, but no specific quantitative results, tables, error bars, ablation studies, or comparisons to baselines are detailed, making it impossible to evaluate the soundness of the central empirical claim.

Authors: We agree that the current presentation does not make the empirical claims sufficiently verifiable from the abstract alone. The full manuscript contains Section 4 with quantitative results across multiple datasets, including accuracy and F1-score comparisons to CoT and other baselines, ablation studies isolating each proposed component, and error bars derived from repeated runs. To resolve this, we will revise the abstract to include key numerical highlights, add a summary table in the introduction, and ensure all experimental details, tables, and statistical analyses are explicitly cross-referenced and described in the main text for easier evaluation. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces a novel framework with three main components—Meta Query for Prior Memorization, Causal Counterfactual Refinement (CCR) via RL on region-masked counterfactual rewards, and Intrinsic Memory Transition (IMT) using full-vocabulary divergence alignment—followed by empirical claims of outperforming SOTA methods including CoT on diagnostic accuracy across datasets. No equations or derivation steps are shown that reduce a claimed prediction or result to its own fitted inputs, self-definitions, or unverified self-citations by construction. The performance assertions rest on external evaluations rather than internal redefinitions or load-bearing self-references. The derivation chain remains self-contained as a proposed architecture with independent mechanisms.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 3 invented entities

The framework rests on the assumption that discrete tokenization creates specific cognitive misalignments and that the three new mechanisms can evolve clinically faithful memories. No external benchmarks or independent evidence for the invented components are supplied.

free parameters (1)

learnable probes
Introduced to retrieve structured priors from the anatomical prior encoder for memory generation.

axioms (1)

domain assumption Discrete tokenization in medical VLMs causes quantization loss, long-range information dissipation, and missing case-adaptive expertise.
Presented as the fundamental cognitive misalignment the framework addresses.

invented entities (3)

Meta Query for Prior Memorization mechanism no independent evidence
purpose: Generate condensed implicit memories from anatomical priors using learnable probes.
Core starting component of the latent memory evolution process.
Causal Counterfactual Refinement (CCR) no independent evidence
purpose: Quantify causal contribution of memories via RL and region masking to prune redundancies.
Ensures clinical fidelity in the evolutionary process.
Intrinsic Memory Transition (IMT) no independent evidence
purpose: Internalize teacher diagnostic patterns into student branch via vocabulary alignment.
Final step that transfers expertise into endogenous parameters.

pith-pipeline@v0.9.0 · 5535 in / 1460 out tokens · 106946 ms · 2026-05-07T13:55:37.205719+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 38 canonical work pages · 8 internal anchors

[1]

arXiv preprint arXiv.2407.15621

Arasteh, S.T., Lotfinia, M., Bressem, K., Siepmann, R., Adams, L., Ferber, D., Kuhl, C., Kather, J.N., Nebelung, S., Truhn, D.: Radiorag: factual large language models for enhanced diagnostics in radiology using online retrieval augmented gen- eration 2024. arXiv preprint arXiv.2407.15621

work page arXiv 2024
[2]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

work page internal anchor Pith review arXiv 2025
[3]

arXiv preprint arXiv:2512.16201 (2025)

Bose, S., Rajendran, R.K., Debnath, B., Karydis, K., Roy-Chowdhury, A.K., Chakradhar, S.: Visual alignment of medical vision-language models for grounded radiology report generation. arXiv preprint arXiv:2512.16201 (2025)

work page arXiv 2025
[4]

Cognitive research: prin- ciples and implications4(1), 7 (2019)

Brunyé, T.T., Drew, T., Weaver, D.L., Elmore, J.G.: A review of eye tracking for understanding and improving diagnostic interpretation. Cognitive research: prin- ciples and implications4(1), 7 (2019)

2019
[5]

arXiv preprint arXiv:2510.12603 (2025)

Chen, C., Ma, Z., Li, Y., Hu, Y., Wei, Y., Li, W., Nie, L.: Reasoning in the dark: Interleaved vision-text reasoning in latent space. arXiv preprint arXiv:2510.12603 (2025)

work page arXiv 2025
[6]

Zhu et al

Chen, J., Gui, C., Ouyang, R., Gao, A., Chen, S., Chen, G.H., Wang, X., Cai, Z., Ji, K., Wan, X., et al.: Towards injecting medical visual knowledge into multimodal 28 C. Zhu et al. llms at scale. In: Proceedings of the 2024 conference on empirical methods in natural language processing. pp. 7346–7370 (2024)

2024
[7]

arXiv preprint arXiv:2510.10052 (2025)

Chen, K., Rui, S., Jiang, Y., Wu, J., Zheng, Q., Song, C., Wang, X., Zhou, M., Liu, M.: Think twice to see more: Iterative visual reasoning in medical vlms. arXiv preprint arXiv:2510.10052 (2025)

work page arXiv 2025
[8]

CoRRabs/2308.16184 (2023)

Cheng, J., Ye, J., Deng, Z., Chen, J., Li, T., Wang, H., Su, Y., Huang, Z., Chen, J., Jiang, L., et al.: Sam-med2d. arXiv preprint arXiv:2308.16184 (2023)

work page arXiv 2023
[9]

arXiv preprint arXiv:2506.08356 (2025)

Chopra, S., Sanchez-Rodriguez, G., Mao, L., Feola, A.J., Li, J., Kira, Z.: Medmoe: modality-specialized mixture of experts for medical vision-language understanding. arXiv preprint arXiv:2506.08356 (2025)

work page arXiv 2025
[10]

From explicit cot to implicit cot: Learning to internalize cot step by step

Deng, Y., Choi, Y., Shieber, S.: From explicit cot to implicit cot: Learning to internalize cot step by step. arXiv preprint arXiv:2405.14838 (2024)

work page arXiv 2024
[11]

arXiv preprint arXiv:2601.10949 (2026)

Ding, M., Zhang, J., Wang, W., Zhong, H., Luo, X., Chen, W., Shen, L.: Mmedexpert-r1: Strengthening multimodal medical reasoning via domain-specific adaptation and clinical guideline reinforcement. arXiv preprint arXiv:2601.10949 (2026)

work page arXiv 2026
[12]

Gai, X., Zhou, C., Liu, J., Feng, Y., Wu, J., Liu, Z.: Medthink: Explaining medical visualquestionansweringviamultimodaldecision-makingrationale.arXivpreprint arXiv:2404.12372 (2024)

work page arXiv 2024
[13]

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Geiping, J., McLeish, S., Jain, N., Kirchenbauer, J., Singh, S., Bartoldson, B.R., Kailkhura, B., Bhatele, A., Goldstein, T.: Scaling up test-time compute with latent reasoning: A recurrent depth approach. arXiv preprint arXiv:2502.05171 (2025)

work page internal anchor Pith review arXiv 2025
[14]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Gu, T., Yang, K., Liu, D., Cai, W.: Lapa: Latent prompt assist model for med- ical visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4971–4980 (2024)

2024
[15]

Training Large Language Models to Reason in a Continuous Latent Space

Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., Tian, Y.: Train- ing large language models to reason in a continuous latent space. arXiv preprint arXiv:2412.06769 (2024)

work page internal anchor Pith review arXiv 2024
[16]

PathVQA: 30000+ Questions for Medical Visual Question Answering

He, X., Zhang, Y., Mou, L., Xing, E., Xie, P.: Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286 (2020)

work page internal anchor Pith review arXiv 2003
[17]

Iclr1(2), 3 (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. Iclr1(2), 3 (2022)

2022
[18]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Hu, Y., Li, T., Lu, Q., Shao, W., He, J., Qiao, Y., Luo, P.: Omnimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22170–22183 (2024)

2024
[19]

IEEE Transactions on Medical Imaging (2026)

Lai, Y., Zhong, J., Li, M., Zhao, S., Li, Y., Psounis, K., Yang, X.: Med-r1: Rein- forcement learning for generalizable medical reasoning in vision-language models. IEEE Transactions on Medical Imaging (2026)

2026
[20]

Scientific data 5(1), 1–10 (2018)

Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Scientific data 5(1), 1–10 (2018)

2018
[21]

arXiv preprint arXiv:2510.22728 (2025)

Le-Duc, K., Nguyen, D.M., Trinh, P.T., Nguyen, T.P., Diep, N.T., Ngo, A., Vu, T., Vuong, T., Nguyen, A.T., Nguyen, M., et al.: S-chain: Structured visual chain- of-thought for medicine. arXiv preprint arXiv:2510.22728 (2025)

work page arXiv 2025
[22]

Advances in neural information processing systems 33, 9459–9474 (2020) Medical Latent Memory Evolution 29

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.t., Rocktäschel, T., et al.: Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems 33, 9459–9474 (2020) Medical Latent Memory Evolution 29

2020
[23]

In: Findings of the Association for Computational Linguistics: EMNLP 2024

Li, B., Yan, T., Pan, Y., Luo, J., Ji, R., Ding, J., Xu, Z., Liu, S., Dong, H., Lin, Z., et al.: Mmedagent: Learning to use medical tools with multi-modal agent. In: Findings of the Association for Computational Linguistics: EMNLP 2024. pp. 8745–8760 (2024)

2024
[24]

Advances in Neural Information Processing Systems36, 28541–28564 (2023)

Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems36, 28541–28564 (2023)

2023
[25]

arXiv preprint arXiv:2505.13308 (2025)

Li, H., Li, C., Wu, T., Zhu, X., Wang, Y., Yu, Z., Jiang, E.H., Zhu, S.C., Jia, Z., Wu, Y.N., et al.: Seek in the dark: Reasoning via test-time instance-level policy gradient in latent space. arXiv preprint arXiv:2505.13308 (2025)

work page arXiv 2025
[26]

arXiv preprint arXiv:2411.14522 (2024)

Li, T., Su, Y., Li, W., Fu, B., Chen, Z., Huang, Z., Wang, G., Ma, C., Chen, Y., Hu, M., et al.: Gmai-vl & gmai-vl-5.5 m: A large vision-language model and a comprehensive multimodal dataset towards general medical ai. arXiv preprint arXiv:2411.14522 (2024)

work page arXiv 2024
[27]

IEEE Transactions on Information theory37(1), 145–151 (2002)

Lin, J.: Divergence measures based on the shannon entropy. IEEE Transactions on Information theory37(1), 145–151 (2002)

2002
[28]

arXiv preprint arXiv:2511.19046 (2025)

Liu, A., Xue, R., Cao, X.R., Shen, Y., Lu, Y., Li, X., Chen, Q., Chen, J.: Medsam3: Delving into segment anything with medical concepts. arXiv preprint arXiv:2511.19046 (2025)

work page arXiv 2025
[29]

Liu, B., Zhan, L.M., Xu, L., Ma, L., Yang, Y., Wu, X.M.: Slake: A semantically- labeledknowledge-enhanceddatasetformedicalvisualquestionanswering.In:2021 IEEE 18th international symposium on biomedical imaging (ISBI). pp. 1650–1654. IEEE (2021)

2021
[30]

arXiv preprint arXiv:2412.17747 (2024)

Liu, L., Pfeiffer, J., Wu, J., Xie, J., Szlam, A.: Deliberation in latent space via differentiable cache augmentation. arXiv preprint arXiv:2412.17747 (2024)

work page arXiv 2024
[31]

In: Machine learning for health (ML4H)

Moor, M., Huang, Q., Wu, S., Yasunaga, M., Dalmia, Y., Leskovec, J., Zakka, C., Reis, E.P., Rajpurkar, P.: Med-flamingo: a multimodal medical few-shot learner. In: Machine learning for health (ML4H). pp. 353–367. PMLR (2023)

2023
[32]

arXiv preprint arXiv:2602.23363 (2026)

Mullappilly, S.S., Kurpath, M.I., Mohamed, O., Zidan, M., Khan, F., Khan, S., Anwer, R., Cholakkal, H.: Medix-r1: Open ended medical reinforcement learning. arXiv preprint arXiv:2602.23363 (2026)

work page arXiv 2026
[33]

arXiv preprint arXiv:2412.07769 (2024)

Mullappilly, S.S., Kurpath, M.I., Pieri, S., Alseiari, S.Y., Cholakkal, S., Aldahmani, K., Khan, F., Anwer, R., Khan, S., Baldwin, T., et al.: Bimedix2: Bio-medical expert lmm for diverse medical modalities. arXiv preprint arXiv:2412.07769 (2024)

work page arXiv 2024
[34]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Nath, V., Li, W., Yang, D., Myronenko, A., Zheng, M., Lu, Y., Liu, Z., Yin, H., Law, Y.M., Tang, Y., et al.: Vila-m3: Enhancing vision-language models with medical expert knowledge. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 14788–14798 (2025)

2025
[35]

Advances in Health Sciences Education14(Suppl 1), 37–49 (2009)

Norman, G.: Dual processing and diagnostic errors. Advances in Health Sciences Education14(Suppl 1), 37–49 (2009)

2009
[36]

Advances in neural information processing sys- tems35, 27730–27744 (2022)

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow instructions with human feedback. Advances in neural information processing sys- tems35, 27730–27744 (2022)

2022
[37]

In: International Conference on Medi- cal Image Computing and Computer-Assisted Intervention

Pan, J., Liu, C., Wu, J., Liu, F., Zhu, J., Li, H.B., Chen, C., Ouyang, C., Rueck- ert, D.: Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning. In: International Conference on Medi- cal Image Computing and Computer-Assisted Intervention. pp. 337–347. Springer (2025) 30 C. Zhu et al

2025
[38]

Multimodal chain of continuous thought for latent-space reasoning in vision-language models.arXiv preprint arXiv:2508.12587, 2025

Pham, T.H., Ngo, C.: Multimodal chain of continuous thought for latent-space reasoning in vision-language models. arXiv preprint arXiv:2508.12587 (2025)

work page arXiv 2025
[39]

Advances in neural information processing systems36, 53728–53741 (2023)

Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems36, 53728–53741 (2023)

2023
[40]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

work page internal anchor Pith review arXiv 2017
[41]

MedGemma Technical Report

Sellergren, A., Kazemzadeh, S., Jaroensri, T., Kiraly, A., Traverse, M., Kohlberger, T., Xu, S., Jamil, F., Hughes, C., Lau, C., et al.: Medgemma technical report. arXiv preprint arXiv:2507.05201 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)

work page internal anchor Pith review arXiv 2024
[43]

In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Shen, Z., Yan, H., Zhang, L., Hu, Z., Du, Y., He, Y.: Codi: Compressing chain- of-thought into continuous space via self-distillation. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 677–693 (2025)

2025
[44]

arXiv preprint arXiv:2504.01886 (2025)

Su, Y., Li, T., Liu, J., Ma, C., Ning, J., Tang, C., Ju, S., Ye, J., Chen, P., Hu, M., et al.: Gmai-vl-r1: Harnessing reinforcement learning for multimodal medical reasoning. arXiv preprint arXiv:2504.01886 (2025)

work page arXiv 2025
[45]

arXiv preprint arXiv:2506.16962 (2025)

Sun, H., Jiang, Y., Lou, W., Zhang, Y., Li, W., Wang, L., Liu, M., Liu, L., Wang, X.: Chiron-o1: Igniting multimodal large language models towards gen- eralizable medical reasoning via mentor-intern collaborative search. arXiv preprint arXiv:2506.16962 (2025)

work page arXiv 2025
[46]

Think silently, think fast: Dynamic latent compression of llm reasoning chains

Tan, W., Li, J., Ju, J., Luo, Z., Song, R., Luan, J.: Think silently, think fast: Dy- namic latent compression of llm reasoning chains. arXiv preprint arXiv:2505.16552 (2025)

work page arXiv 2025
[47]

In: International Conference on Medical Image Computing and Computer- Assisted Intervention

Van Sonsbeek, T., Derakhshani, M.M., Najdenkoska, I., Snoek, C.G., Worring, M.: Open-ended medical visual question answering through prefix tuning of language models. In: International Conference on Medical Image Computing and Computer- Assisted Intervention. pp. 726–736. Springer (2023)

2023
[48]

American Journal of Roentgenology208(4), 739–749 (2017)

Waite, S., Scott, J., Gale, B., Fuchs, T., Kolla, S., Reede, D.: Interpretive error in radiology. American Journal of Roentgenology208(4), 739–749 (2017)

2017
[49]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Wang, Y., Liu, J., Gao, S., Feng, B., Tang, Z., Gai, X., Wu, J., Liu, Z.: V2t- cot: From vision to text chain-of-thought for medical reasoning and diagnosis. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 658–668. Springer (2025)

2025
[50]

Advances in neural information processing systems35, 24824–24837 (2022)

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems35, 24824–24837 (2022)

2022
[51]

Nature Communications16(1), 7866 (2025)

Wu, C., Zhang, X., Zhang, Y., Hui, H., Wang, Y., Xie, W.: Towards generalist foun- dation model for radiology by leveraging web-scale 2d&3d medical data. Nature Communications16(1), 7866 (2025)

2025
[52]

Medreason: Eliciting factual medical reasoning steps in llms via knowledge graphs.arXiv preprint arXiv:2504.00993, 2025

Wu, J., Deng, W., Li, X., Liu, S., Mi, T., Peng, Y., Xu, Z., Liu, Y., Cho, H., Choi, C.I., et al.: Medreason: Eliciting factual medical reasoning steps in llms via knowledge graphs. arXiv preprint arXiv:2504.00993 (2025)

work page arXiv 2025
[53]

Medical graph rag: Towards safe medical large language model via graph retrieval- augmented generation.arXiv preprint arXiv:2408.04187, 2024

Wu, J., Zhu, J., Qi, Y., Chen, J., Xu, M., Menolascina, F., Grau, V.: Medical graph rag: Towards safe medical large language model via graph retrieval-augmented generation. arXiv preprint arXiv:2408.04187 (2024) Medical Latent Memory Evolution 31

work page arXiv 2024
[54]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Xu, Y., Guo, X., Zeng, Z., Miao, C.: Softcot: Soft chain-of-thought for efficient reasoning with llms. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 23336–23351 (2025)

2025
[55]

arXiv preprint arXiv:2505.11484 (2025)

Xu, Y., Guo, X., Zeng, Z., Miao, C.: Softcot++: Test-time scaling with soft chain- of-thought reasoning. arXiv preprint arXiv:2505.11484 (2025)

work page arXiv 2025
[56]

In: Findings of the Association for Computa- tional Linguistics: EMNLP 2024

Xu, Z., Wang, H., Bespalov, D., Wu, X., Stone, P., Qi, Y.: Lars: Latent reasoning skills for chain-of-thought reasoning. In: Findings of the Association for Computa- tional Linguistics: EMNLP 2024. pp. 3624–3643 (2024)

2024
[57]

Advances in Neural Information Processing Sys- tems37, 94327–94427 (2024)

Ye, J., Wang, G., Li, Y., Deng, Z., Li, W., Li, T., Duan, H., Huang, Z., Su, Y., Wang, B., et al.: Gmai-mmbench: A comprehensive multimodal evaluation bench- mark towards general medical ai. Advances in Neural Information Processing Sys- tems37, 94327–94427 (2024)

2024
[58]

arXiv e-prints pp

Yu, H., Cheng, T., Cheng, Y., Feng, R.: Finemedlm-o1: Enhancing the medical reasoning ability of llm from supervised fine-tuning to test-time training. arXiv e-prints pp. arXiv–2501 (2025)

2025
[59]

Vismem: Latent vision memory unlocks potential of vision-language models.ArXiv, abs/2511.11007, 2025

Yu, X., Xu, C., Zhang, G., Chen, Z., Zhang, Y., He, Y., Jiang, P.T., Zhang, J., Hu, X., Yan, S.: Vismem: Latent vision memory unlocks potential of vision-language models. arXiv preprint arXiv:2511.11007 (2025)

work page arXiv 2025
[60]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., et al.: Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9556–9567 (2024)

2024
[61]

Nejm ai1(2), AIoa2300068 (2024)

Zakka, C., Shad, R., Chaurasia, A., Dalal, A.R., Kim, J.L., Moor, M., Fong, R., Phillips, C., Alexander, K., Ashley, E., et al.: Almanac—retrieval-augmented lan- guage models for clinical medicine. Nejm ai1(2), AIoa2300068 (2024)

2024
[62]

Memgen: Weaving generative latent memory for self-evolving agents.arXiv preprint arXiv:2509.24704, 2025

Zhang, G., Fu, M., Yan, S.: Memgen: Weaving generative latent memory for self- evolving agents. arXiv preprint arXiv:2509.24704 (2025)

work page arXiv 2025
[63]

arXiv preprint arXiv:2508.02258 (2025)

Zhang, W., Guo, J., Zhang, H., Zhang, P., Chen, J., Zhang, S., Zhang, Z., Yi, Y., Bu, H.: Patho-agenticrag: towards multimodal agentic retrieval- augmented generation for pathology vlms via reinforcement learning. arXiv preprint arXiv:2508.02258 (2025)

work page arXiv 2025
[64]

Pmc-vqa: Visual instruction tuning for medical visual question answering.arXiv preprint arXiv:2305.10415, 2023

Zhang, X., Wu, C., Zhao, Z., Lin, W., Zhang, Y., Wang, Y., Xie, W.: Pmc-vqa: Visual instruction tuning for medical visual question answering. arXiv preprint arXiv:2305.10415 (2023)

work page arXiv 2023
[65]

Reinforced latent reasoning for llm-based recommendation,

Zhang, Y., Xu, W., Zhao, X., Wang, W., Feng, F., He, X., Chua, T.S.: Reinforced latent reasoning for llm-based recommendation. arXiv preprint arXiv:2505.19092 (2025)

work page arXiv 2025
[66]

In: Pro- ceedings of the ACM on Web Conference 2025

Zhao, X., Liu, S., Yang, S.Y., Miao, C.: Medrag: Enhancing retrieval-augmented generation with knowledge graph-elicited reasoning for healthcare copilot. In: Pro- ceedings of the ACM on Web Conference 2025. pp. 4442–4457 (2025)

2025
[67]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025)

work page internal anchor Pith review arXiv 2025
[68]

arXiv preprint arXiv:2412.06141 (2024)

Zhu, K., Xia, P., Li, Y., Zhu, H., Wang, S., Yao, H.: Mmedpo: Aligning medical vision-language models with clinical-aware multimodal preference optimization. arXiv preprint arXiv:2412.06141 (2024)

work page arXiv 2024
[69]

Medxpertqa: Benchmarking expert-level medical reasoning and understanding.arXiv preprint arXiv:2501.18362, 2025

Zuo, Y., Qu, S., Li, Y., Chen, Z., Zhu, X., Hua, E., Zhang, K., Ding, N., Zhou, B.: Medxpertqa: Benchmarking expert-level medical reasoning and understanding. arXiv preprint arXiv:2501.18362 (2025)

work page arXiv 2025