pith. machine review for the scientific record. sign in

arxiv: 2605.04064 · v1 · submitted 2026-04-10 · 💻 cs.LG · cs.CV

Recognition: unknown

Improving Medical VQA through Trajectory-Aware Process Supervision

Halil Ibrahim Gulluk, Olivier Gevaert

Pith reviewed 2026-05-10 17:13 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords medical VQAprocess supervisionreasoning trajectoriesdynamic time warpingvision-language modelspolicy optimizationreinforcement learning
0
0 comments X

The pith

A reward based on the similarity of reasoning trajectories improves medical visual question answering models beyond exact answer matching alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that medical VQA models gain from explicit supervision on the steps of their reasoning rather than only on the final answer. It generates reasoning trajectories for six existing benchmarks, then trains models first with supervised fine-tuning and next with a policy optimization step that adds a reward for matching the sequence of those steps. The similarity is measured by embedding each reasoning step with a sentence transformer and computing dynamic time warping distance between the resulting sequences. If this approach holds, models produce both more accurate answers and higher-quality explanations on medical image questions. This would matter because medical decisions require traceable reasoning to catch errors before they affect patients.

Core claim

The authors generate reasoning trajectories for six medical VQA benchmarks using the COMCTS algorithm and an LLM judge, then train vision-language models with supervised fine-tuning followed by Group Relative Policy Optimization. The novel element is a process reward that combines exact-match scoring on the final answer with the dynamic time warping distance between sentence-transformer embeddings of the generated reasoning steps and the ground-truth trajectories. This combined reward produces consistent gains over supervised fine-tuning alone across all six benchmarks.

What carries the argument

The DTW-based process reward that compares sequences of sentence-transformer embeddings of generated and ground-truth reasoning trajectories.

If this is right

  • Medical VQA models reach higher mean accuracy when process rewards guide their reasoning paths.
  • Semantic and sequence-based explanation metrics improve alongside answer correctness.
  • Process supervision adds value on top of standard outcome-only rewards in vision-language model training.
  • Generated reasoning trajectory datasets become reusable resources for further medical VQA work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same trajectory comparison method could be tested on non-medical VQA or other step-by-step reasoning domains where trajectory data can be generated.
  • Swapping the sentence transformer for domain-specific medical embeddings might strengthen or weaken the DTW signal.
  • The gains rest on the initial quality of the generated trajectories, so advances in trajectory generation would directly affect results.
  • Scaling the approach to larger base models could reveal whether the process reward benefit grows with model capacity.

Load-bearing premise

That dynamic time warping distance between embedded reasoning steps supplies a training signal that actually improves the quality of medical reasoning.

What would settle it

Training the same models with only the exact-match reward and observing that accuracy, BERTScore, and ROUGE-L no longer improve over the supervised-fine-tuning baseline on the six benchmarks would falsify the value of the trajectory component.

Figures

Figures reproduced from arXiv: 2605.04064 by Halil Ibrahim Gulluk, Olivier Gevaert.

Figure 1
Figure 1. Figure 1: Overview of the COMCTS reasoning dataset generation process. Starting [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example of a reasoning trajectory from the generated dataset [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Dynamic Time Warping (DTW) aligns two reasoning trajectories repre [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: COMCTS generated sample from VQA-RAD. Question : What are the hyperdense lesions noted at the edges of the aorta? Gt-answer (in DV QA) : calcified atherosclerosis Reasoning-answer generated by COMCTS (in Dreas): The image is a cross￾sectional CT scan of the abdomen. It shows the aorta, kidneys, and surrounding structures. There are hyperdense lesions noted at the edges of the aorta. The CT scan image shows… view at source ↗
Figure 5
Figure 5. Figure 5: COMCTS generated sample from VQA-RAD. Question : Do you see cardiomegaly? Gt-answer (in DV QA) : no Reasoning-answer generated by COMCTS (in Dreas): The image is a chest X-ray showing the thoracic cavity with the heart, lungs, and ribs visible. The heart appears to be of normal size, and there are no obvious signs of cardiomegaly. Cardiomegaly is the medical term for an enlarged heart. In chest X-rays, car… view at source ↗
Figure 6
Figure 6. Figure 6: COMCTS generated sample from Slake-VQA. Question : Does the picture contain heart? Gt-answer (in DV QA) : No Reasoning-answer generated by COMCTS (in Dreas): The image is a grayscale axial CT scan of the abdomen. It shows the liver on the left, the stomach and intestines in the center, and the spine in the lower center. The kidneys are visible on either side of the spine. There is fat and muscle tissue sur… view at source ↗
Figure 7
Figure 7. Figure 7: COMCTS generated sample from Slake-VQA. Question : Does the picture contain kidney? Gt-answer (in DV QA) : No Reasoning-answer generated by COMCTS (in Dreas): The image is a grayscale axial CT scan of the upper abdomen. It shows the liver, gallbladder, portions of the stomach and duodenum, and the spine. The liver takes up a large portion of the left side of the image, appearing as a relatively homogenous … view at source ↗
Figure 8
Figure 8. Figure 8: COMCTS generated sample from PMC-VQA. Question : What is the condition shown in the CT scan? Choices: A) :Pancreatic cancer B) :Right upper lobe herniation C) :Broken rib D) :Appendicitis Gt-answer (in DV QA) : Right upper lobe herniation Reasoning-answer generated by COMCTS (in Dreas): The image is a CT scan of the chest. The scan shows the lungs and surrounding structures. There is a noticeable protrusio… view at source ↗
Figure 9
Figure 9. Figure 9: COMCTS generated sample from PMC-VQA. Question : Which bone is affected according to the radiograph? Choices: A) :Femur B) :Humerus C) :Tibia D) :Fibula Gt-answer (in DV QA) : Humerus Reasoning-answer generated by COMCTS (in Dreas): The image is a black and white radiograph (X-ray) of a human upper limb. The image shows a clear view of the shoulder joint and part of the upper arm. There appears to be a fra… view at source ↗
Figure 10
Figure 10. Figure 10: COMCTS generated sample from PMC-VQA. Question : What is the location of the mass observed in the CT scan? Choices: A) : Hepatic lobe B) : Pulmonary artery C) : Renal artery D) : Renal Gt-answer (in DV QA) : Renal Reasoning-answer generated by COMCTS (in Dreas): The image is a grayscale axial CT scan of the abdomen. It displays the liver, stomach, spleen, kidneys, and surrounding structures. A large, well… view at source ↗
Figure 11
Figure 11. Figure 11: COMCTS generated sample from PMC-VQA. Question : Is liver present? Gt-answer (in DV QA) : Yes Reasoning-answer generated by COMCTS (in Dreas): The image is a microscopic view of tissue stained with hematoxylin and eosin (H&E). Numerous dark purple, small, round structures (likely nuclei) are visible scattered throughout a pinkish-purple background. There are several cells with visible nuclei, and a larger… view at source ↗
Figure 12
Figure 12. Figure 12: Needleman-Wunsch (NW) alignment for reasoning trajectories. Unlike [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗
read the original abstract

Reasoning capabilities are crucial for reliable medical visual question answering (VQA); however, existing datasets rarely include reasoning explanations. We address this by generating reasoning trajectories for six medical VQA benchmarks using the COMCTS algorithm with open-source vision-language models, with an LLM serving as the verification judge. Building on these generated datasets, we propose a two-stage training framework: supervised fine-tuning followed by Group Relative Policy Optimization (GRPO) with a novel process-based reward. While standard approaches rely solely on exact-match rewards for final answers, we introduce a trajectory-aware reward that measures the similarity between generated and ground-truth reasoning processes. Specifically, we embed reasoning steps using sentence transformers and compute the Dynamic Time Warping (DTW) distance between the resulting vector sequences. Experiments across six benchmarks demonstrate that combining the DTW-based process reward with exact-match reward consistently outperforms SFT-only training, raising mean accuracy from 0.598 to 0.689, mean BERTScore from 0.845 to 0.881, and mean ROUGE-L from 0.665 to 0.748. Our results highlight the importance of process supervision in training reasoning-capable medical VLMs. We make our code and generated reasoning datasets publicly available at https://anonymous.4open.science/r/MICCAI-R1-MED-VQA-code-B14B/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper claims that generating reasoning trajectories for six medical VQA benchmarks via COMCTS (with LLM verification) enables a two-stage pipeline of SFT followed by GRPO using a combined reward (exact-match on final answers plus DTW distance on sentence-transformer embeddings of reasoning steps). This yields consistent gains over SFT-only baselines: mean accuracy rises from 0.598 to 0.689, BERTScore from 0.845 to 0.881, and ROUGE-L from 0.665 to 0.748. The authors release the generated datasets and code.

Significance. If the central result holds after addressing controls, the work provides evidence that trajectory-aware process rewards can enhance reasoning quality in medical VLMs beyond outcome-only supervision, with the public datasets offering a reusable resource for the community.

major comments (2)
  1. [Experiments] Experiments section: The reported comparisons are limited to SFT-only versus SFT+GRPO with the combined DTW+exact-match reward. No GRPO run using exact-match reward alone (on the identical SFT checkpoint) is presented, so the observed gains cannot be attributed specifically to the DTW trajectory signal rather than to GRPO optimization, training duration, or reward scaling.
  2. [§3] §3 (Trajectory Generation): The ground-truth trajectories are produced by the same class of open-source VLMs used in training and verified only by an LLM judge, with no reported human validation, inter-annotator agreement, or quality metrics. This leaves open whether the DTW signal reflects genuine reasoning quality or artifacts of the generation process.
minor comments (3)
  1. [Abstract] Abstract and §4: Mean improvements are stated without the number of random seeds, standard deviations, or statistical significance tests, making it difficult to assess robustness across the six benchmarks.
  2. [Method] Method: The relative weighting between the DTW and exact-match terms is treated as a free hyperparameter, yet no ablation or sensitivity analysis on this weighting is provided.
  3. [Related Work] Related Work: The positioning relative to prior process-supervision methods (e.g., in math or general VQA) could be expanded with more direct citations to recent GRPO or trajectory-reward papers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point-by-point below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The reported comparisons are limited to SFT-only versus SFT+GRPO with the combined DTW+exact-match reward. No GRPO run using exact-match reward alone (on the identical SFT checkpoint) is presented, so the observed gains cannot be attributed specifically to the DTW trajectory signal rather than to GRPO optimization, training duration, or reward scaling.

    Authors: We agree that this ablation is necessary to isolate the contribution of the DTW-based process reward. In the revised manuscript, we will add results from an additional GRPO training run that uses only the exact-match reward on the identical SFT checkpoint. This will enable a direct comparison and clarify whether the observed gains stem from the trajectory signal or from other factors such as the GRPO optimization itself. revision: yes

  2. Referee: [§3] §3 (Trajectory Generation): The ground-truth trajectories are produced by the same class of open-source VLMs used in training and verified only by an LLM judge, with no reported human validation, inter-annotator agreement, or quality metrics. This leaves open whether the DTW signal reflects genuine reasoning quality or artifacts of the generation process.

    Authors: We acknowledge this limitation. Section 3 describes the use of COMCTS with open-source VLMs followed by LLM verification to scale trajectory generation across six benchmarks. We did not perform human validation or report inter-annotator agreement. In the revision, we will expand §3 with a dedicated discussion of the verification methodology, explicitly note the lack of human evaluation as a limitation, and report additional automatic quality metrics on the released trajectories (such as length statistics, embedding-based consistency, and diversity). The public release of the datasets will also allow the community to conduct further validation. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical reward definition and results do not reduce to inputs by construction

full rationale

The paper's chain consists of (1) independent generation of reasoning trajectories via COMCTS + LLM judge on external benchmarks, (2) definition of a DTW process reward using a separate sentence-transformer embedding model, and (3) empirical comparison of combined reward vs. SFT-only. None of these steps is self-definitional, a fitted input renamed as prediction, or dependent on a load-bearing self-citation. The DTW distance is computed from externally generated references and an off-the-shelf embedder; the reported accuracy/BERTScore/ROUGE gains are experimental outcomes, not quantities forced by the training objective itself. Missing ablations are a methodological limitation but do not create circularity under the specified criteria.

Axiom & Free-Parameter Ledger

2 free parameters · 3 axioms · 0 invented entities

The central claim rests on the assumption that generated trajectories are high-quality and that DTW similarity is a valid proxy for reasoning improvement; no new physical entities are postulated.

free parameters (2)
  • reward weighting between DTW and exact-match
    The abstract describes combining the two rewards but does not specify how they are balanced, implying a tunable hyperparameter.
  • GRPO and SFT training hyperparameters
    Standard RL and fine-tuning hyperparameters are required but not detailed in the abstract.
axioms (3)
  • domain assumption COMCTS with open-source VLMs can produce useful reasoning trajectories for medical VQA questions
    The data generation step depends on this capability.
  • domain assumption An LLM can reliably judge the quality of generated reasoning trajectories
    Used as verification judge in the COMCTS pipeline.
  • domain assumption Sentence transformer embeddings plus DTW distance capture semantically meaningful similarity between reasoning processes
    This underpins the novel process reward.

pith-pipeline@v0.9.0 · 5547 in / 1564 out tokens · 55724 ms · 2026-05-10T17:13:59.688344+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering

    cs.CV 2026-05 unverdicted novelty 5.0

    LiteMedCoT-VL distills chain-of-thought from a 235B model to 2B VLMs via LoRA, reaching 64.9% accuracy on PMC-VQA and beating a 4B zero-shot baseline by 11 points.

Reference graph

Works this paper leans on

36 extracted references · 18 canonical work pages · cited by 1 Pith paper · 7 internal anchors

  1. [1]

    Ahn, J., Verma, R., Lou, R., Liu, D., Zhang, R., Yin, W.: Large language models for mathematicalreasoning:Progressesandchallenges.arXivpreprintarXiv:2402.00157 (2024)

  2. [2]

    Ben Abacha, A., Hasan, S.A., Datla, V.V., Demner-Fushman, D., Müller, H.: Vqa-med: Overview of the medical visual question answering task at imageclef

  3. [3]

    9-12 September 2019 (2019)

    In: Proceedings of CLEF (Conference and Labs of the Evaluation Forum) 2019 Working Notes. 9-12 September 2019 (2019)

  4. [4]

    Huatuogpt-o1, towards medical complex reasoning with llms

    Chen,J.,Cai,Z.,Ji,K.,Wang,X.,Liu,W.,Wang,R.,Hou,J.,Wang,B.:Huatuogpt- o1, towards medical complex reasoning with llms. arXiv preprint arXiv:2412.18925 (2024)

  5. [5]

    arXiv preprint arXiv:2410.20327 (2024)

    Chen, X., Lai, Z., Ruan, K., Chen, S., Liu, J., Liu, Z.: R-llava: Improving med-vqa understanding through visual region of interest. arXiv preprint arXiv:2410.20327 (2024)

  6. [6]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

  7. [7]

    arXiv preprint arXiv:2003.10286 (2020)

    He, X., Zhang, Y., Mou, L., Xing, E., Xie, P.: Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286 (2020)

  8. [8]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Hu, Y., Li, T., Lu, Q., Shao, W., He, J., Qiao, Y., Luo, P.: Omnimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22170–22183 (2024)

  9. [9]

    Mathprompter: Mathematical reasoning using large language models

    Imani, S., Du, L., Shrivastava, H.: Mathprompter: Mathematical reasoning using large language models. arXiv preprint arXiv:2303.05398 (2023)

  10. [10]

    In: Proceedings of the AAAI conference on artificial intelligence

    Kwon, T., Ong, K.T.i., Kang, D., Moon, S., Lee, J.R., Hwang, D., Sohn, B., Sim, Y., Lee, D., Yeo, J.: Large language models are clinical reasoners: Reasoning-aware diagnosis framework with prompt-generated rationales. In: Proceedings of the AAAI conference on artificial intelligence. vol. 38, pp. 18417–18425 (2024)

  11. [11]

    Advances in neural information processing systems35, 26337–26349 (2022)

    Lample, G., Lacroix, T., Lachaux, M.A., Rodriguez, A., Hayat, A., Lavril, T., Ebner, G., Martinet, X.: Hypertree proof search for neural theorem proving. Advances in neural information processing systems35, 26337–26349 (2022)

  12. [12]

    Scientific data 5(1), 1–10 (2018)

    Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Scientific data 5(1), 1–10 (2018)

  13. [13]

    Advances in Neural Information Processing Systems36, 28541–28564 (2023)

    Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems36, 28541–28564 (2023)

  14. [14]

    Liévin, V., Hother, C.E., Motzfeldt, A.G., Winther, O.: Can large language models reason about medical questions? Patterns5(3) (2024)

  15. [15]

    In: International Conference on Medical Image Computing and Computer-Assisted Intervention

    Lin, W., Zhao, Z., Zhang, X., Wu, C., Zhang, Y., Wang, Y., Xie, W.: Pmc- clip: Contrastive language-image pre-training using biomedical documents. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 525–536. Springer (2023)

  16. [16]

    In: 2021 14 I

    Liu, B., Zhan, L.M., Xu, L., Ma, L., Yang, Y., Wu, X.M.: Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering. In: 2021 14 I. Gulluk et al. IEEE 18th international symposium on biomedical imaging (ISBI). pp. 1650–1654. IEEE (2021)

  17. [17]

    arXiv preprint arXiv:2407.01791 (2024)

    Lozano, A., Nirschl, J., Burgess, J., Gupte, S.R., Zhang, Y., Unell, A., Yeung-Levy, S.: {\mu}-bench: A vision-language benchmark for microscopy understanding. arXiv preprint arXiv:2407.01791 (2024)

  18. [18]

    arXiv preprint arXiv:2310.10080 , year=

    Ma, Q., Zhou, H., Liu, T., Yuan, J., Liu, P., You, Y., Yang, H.: Let’s reward step by step: Step-level reward model as the navigators for reasoning. arXiv preprint arXiv:2310.10080 (2023)

  19. [19]

    Information retrieval for music and motion pp

    Müller, M.: Dynamic time warping. Information retrieval for music and motion pp. 69–84 (2007)

  20. [20]

    Advances in neural information processing systems35, 27730–27744 (2022)

    Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow instructions with human feedback. Advances in neural information processing systems35, 27730–27744 (2022)

  21. [21]

    Advances in neural information processing systems36, 53728–53741 (2023)

    Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems36, 53728–53741 (2023)

  22. [22]

    Proximal Policy Optimization Algorithms

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

  23. [23]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)

  24. [24]

    Gemma 3 Technical Report

    Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ramé, A., Rivière, M., et al.: Gemma 3 technical report. arXiv preprint arXiv:2503.19786 (2025)

  25. [25]

    Nejm Ai1(3), AIoa2300138 (2024)

    Tu, T., Azizi, S., Driess, D., Schaekermann, M., Amin, M., Chang, P.C., Carroll, A., Lau, C., Tanno, R., Ktena, I., et al.: Towards generalist biomedical ai. Nejm Ai1(3), AIoa2300138 (2024)

  26. [26]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024)

  27. [27]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022)

  28. [28]

    Advances in neural information processing systems35, 24824–24837 (2022)

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems35, 24824–24837 (2022)

  29. [29]

    Wu, C., Lin, W., Zhang, X., Zhang, Y., Wang, Y., Xie, W.: Pmc-llama: towards building open-source language models for medicine (2023)

  30. [30]

    Journal of the American Medical Informatics Association31(9), 1833–1843 (2024)

    Wu, C., Lin, W., Zhang, X., Zhang, Y., Xie, W., Wang, Y.: Pmc-llama: toward building open-source language models for medicine. Journal of the American Medical Informatics Association31(9), 1833–1843 (2024)

  31. [31]

    Beyond the first error: Process reward models for reflective mathematical reasoning

    Yang, Z., He, C., Shi, X., Li, L., Yin, Q., Deng, S., Jiang, D.: Beyond the first error: Process reward models for reflective mathematical reasoning. arXiv preprint arXiv:2505.14391 (2025)

  32. [32]

    Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte 10 carlo tree search.arXiv preprint arXiv:2412.18319, 2024

    Yao, H., Huang, J., Wu, W., Zhang, J., Wang, Y., Liu, S., Wang, Y., Song, Y., Feng, H., Shen, L., et al.: Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search. arXiv preprint arXiv:2412.18319 (2024) Improving Medical VQA through Trajectory-Aware Process Supervision 15

  33. [33]

    Advances in neural information processing systems36, 11809–11822 (2023)

    Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems36, 11809–11822 (2023)

  34. [34]

    BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

    Zhang, S., Xu, Y., Usuyama, N., Xu, H., Bagga, J., Tinn, R., Preston, S., Rao, R., Wei, M., Valluri, N., et al.: Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv preprint arXiv:2303.00915 (2023)

  35. [35]

    arXiv preprint arXiv:2305.10415 , year=

    Zhang, X., Wu, C., Zhao, Z., Lin, W., Zhang, Y., Wang, Y., Xie, W.: Pmc-vqa: Visual instruction tuning for medical visual question answering. arXiv preprint arXiv:2305.10415 (2023)

  36. [36]

    The lessons of developing process reward models in mathematical reasoning.arXiv preprint arXiv:2501.07301, 2025

    Zhang, Z., Zheng, C., Wu, Y., Zhang, B., Lin, R., Yu, B., Liu, D., Zhou, J., Lin, J.: The lessons of developing process reward models in mathematical reasoning. arXiv preprint arXiv:2501.07301 (2025) 16 I. Gulluk et al. Appendix .1 COMCTS Examples Fig.4: COMCTS generated sample from VQA-RAD. Question: What are the hyperdense lesions noted at the edges of ...