arxiv: 2605.04064 · v1 · submitted 2026-04-10 · 💻 cs.LG · cs.CV

Recognition: unknown

Improving Medical VQA through Trajectory-Aware Process Supervision

Halil Ibrahim Gulluk, Olivier Gevaert

Pith reviewed 2026-05-10 17:13 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords medical VQAprocess supervisionreasoning trajectoriesdynamic time warpingvision-language modelspolicy optimizationreinforcement learning

0 comments

The pith

A reward based on the similarity of reasoning trajectories improves medical visual question answering models beyond exact answer matching alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that medical VQA models gain from explicit supervision on the steps of their reasoning rather than only on the final answer. It generates reasoning trajectories for six existing benchmarks, then trains models first with supervised fine-tuning and next with a policy optimization step that adds a reward for matching the sequence of those steps. The similarity is measured by embedding each reasoning step with a sentence transformer and computing dynamic time warping distance between the resulting sequences. If this approach holds, models produce both more accurate answers and higher-quality explanations on medical image questions. This would matter because medical decisions require traceable reasoning to catch errors before they affect patients.

Core claim

The authors generate reasoning trajectories for six medical VQA benchmarks using the COMCTS algorithm and an LLM judge, then train vision-language models with supervised fine-tuning followed by Group Relative Policy Optimization. The novel element is a process reward that combines exact-match scoring on the final answer with the dynamic time warping distance between sentence-transformer embeddings of the generated reasoning steps and the ground-truth trajectories. This combined reward produces consistent gains over supervised fine-tuning alone across all six benchmarks.

What carries the argument

The DTW-based process reward that compares sequences of sentence-transformer embeddings of generated and ground-truth reasoning trajectories.

If this is right

Medical VQA models reach higher mean accuracy when process rewards guide their reasoning paths.
Semantic and sequence-based explanation metrics improve alongside answer correctness.
Process supervision adds value on top of standard outcome-only rewards in vision-language model training.
Generated reasoning trajectory datasets become reusable resources for further medical VQA work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same trajectory comparison method could be tested on non-medical VQA or other step-by-step reasoning domains where trajectory data can be generated.
Swapping the sentence transformer for domain-specific medical embeddings might strengthen or weaken the DTW signal.
The gains rest on the initial quality of the generated trajectories, so advances in trajectory generation would directly affect results.
Scaling the approach to larger base models could reveal whether the process reward benefit grows with model capacity.

Load-bearing premise

That dynamic time warping distance between embedded reasoning steps supplies a training signal that actually improves the quality of medical reasoning.

What would settle it

Training the same models with only the exact-match reward and observing that accuracy, BERTScore, and ROUGE-L no longer improve over the supervised-fine-tuning baseline on the six benchmarks would falsify the value of the trajectory component.

Figures

Figures reproduced from arXiv: 2605.04064 by Halil Ibrahim Gulluk, Olivier Gevaert.

**Figure 2.** Figure 2: Example of a reasoning trajectory from the generated dataset [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Dynamic Time Warping (DTW) aligns two reasoning trajectories repre [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: COMCTS generated sample from VQA-RAD. Question : What are the hyperdense lesions noted at the edges of the aorta? Gt-answer (in DV QA) : calcified atherosclerosis Reasoning-answer generated by COMCTS (in Dreas): The image is a crosssectional CT scan of the abdomen. It shows the aorta, kidneys, and surrounding structures. There are hyperdense lesions noted at the edges of the aorta. The CT scan image shows… view at source ↗

**Figure 5.** Figure 5: COMCTS generated sample from VQA-RAD. Question : Do you see cardiomegaly? Gt-answer (in DV QA) : no Reasoning-answer generated by COMCTS (in Dreas): The image is a chest X-ray showing the thoracic cavity with the heart, lungs, and ribs visible. The heart appears to be of normal size, and there are no obvious signs of cardiomegaly. Cardiomegaly is the medical term for an enlarged heart. In chest X-rays, car… view at source ↗

**Figure 6.** Figure 6: COMCTS generated sample from Slake-VQA. Question : Does the picture contain heart? Gt-answer (in DV QA) : No Reasoning-answer generated by COMCTS (in Dreas): The image is a grayscale axial CT scan of the abdomen. It shows the liver on the left, the stomach and intestines in the center, and the spine in the lower center. The kidneys are visible on either side of the spine. There is fat and muscle tissue sur… view at source ↗

**Figure 7.** Figure 7: COMCTS generated sample from Slake-VQA. Question : Does the picture contain kidney? Gt-answer (in DV QA) : No Reasoning-answer generated by COMCTS (in Dreas): The image is a grayscale axial CT scan of the upper abdomen. It shows the liver, gallbladder, portions of the stomach and duodenum, and the spine. The liver takes up a large portion of the left side of the image, appearing as a relatively homogenous … view at source ↗

**Figure 8.** Figure 8: COMCTS generated sample from PMC-VQA. Question : What is the condition shown in the CT scan? Choices: A) :Pancreatic cancer B) :Right upper lobe herniation C) :Broken rib D) :Appendicitis Gt-answer (in DV QA) : Right upper lobe herniation Reasoning-answer generated by COMCTS (in Dreas): The image is a CT scan of the chest. The scan shows the lungs and surrounding structures. There is a noticeable protrusio… view at source ↗

**Figure 9.** Figure 9: COMCTS generated sample from PMC-VQA. Question : Which bone is affected according to the radiograph? Choices: A) :Femur B) :Humerus C) :Tibia D) :Fibula Gt-answer (in DV QA) : Humerus Reasoning-answer generated by COMCTS (in Dreas): The image is a black and white radiograph (X-ray) of a human upper limb. The image shows a clear view of the shoulder joint and part of the upper arm. There appears to be a fra… view at source ↗

**Figure 10.** Figure 10: COMCTS generated sample from PMC-VQA. Question : What is the location of the mass observed in the CT scan? Choices: A) : Hepatic lobe B) : Pulmonary artery C) : Renal artery D) : Renal Gt-answer (in DV QA) : Renal Reasoning-answer generated by COMCTS (in Dreas): The image is a grayscale axial CT scan of the abdomen. It displays the liver, stomach, spleen, kidneys, and surrounding structures. A large, well… view at source ↗

**Figure 11.** Figure 11: COMCTS generated sample from PMC-VQA. Question : Is liver present? Gt-answer (in DV QA) : Yes Reasoning-answer generated by COMCTS (in Dreas): The image is a microscopic view of tissue stained with hematoxylin and eosin (H&E). Numerous dark purple, small, round structures (likely nuclei) are visible scattered throughout a pinkish-purple background. There are several cells with visible nuclei, and a larger… view at source ↗

**Figure 12.** Figure 12: Needleman-Wunsch (NW) alignment for reasoning trajectories. Unlike [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗

read the original abstract

Reasoning capabilities are crucial for reliable medical visual question answering (VQA); however, existing datasets rarely include reasoning explanations. We address this by generating reasoning trajectories for six medical VQA benchmarks using the COMCTS algorithm with open-source vision-language models, with an LLM serving as the verification judge. Building on these generated datasets, we propose a two-stage training framework: supervised fine-tuning followed by Group Relative Policy Optimization (GRPO) with a novel process-based reward. While standard approaches rely solely on exact-match rewards for final answers, we introduce a trajectory-aware reward that measures the similarity between generated and ground-truth reasoning processes. Specifically, we embed reasoning steps using sentence transformers and compute the Dynamic Time Warping (DTW) distance between the resulting vector sequences. Experiments across six benchmarks demonstrate that combining the DTW-based process reward with exact-match reward consistently outperforms SFT-only training, raising mean accuracy from 0.598 to 0.689, mean BERTScore from 0.845 to 0.881, and mean ROUGE-L from 0.665 to 0.748. Our results highlight the importance of process supervision in training reasoning-capable medical VLMs. We make our code and generated reasoning datasets publicly available at https://anonymous.4open.science/r/MICCAI-R1-MED-VQA-code-B14B/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows gains from adding a DTW trajectory reward to GRPO after SFT on medical VQA benchmarks, but the experiments skip the ablation that would show whether that reward actually drives the improvement.

read the letter

The paper's main result is that training medical VQA models with GRPO using both exact-match and a DTW-based process reward on reasoning trajectories beats plain SFT, with average accuracy rising from 0.598 to 0.689 across six benchmarks. They generate the trajectories first with COMCTS on open-source VLMs and use an LLM as judge, then embed the steps with sentence transformers to compute DTW distances for the reward. They also release the code and generated datasets, which is a practical plus for anyone who wants to check or reuse the trajectories. The two-stage SFT-then-GRPO setup is straightforward and the metrics include accuracy plus BERTScore and ROUGE-L, so the numbers give a clear before-and-after picture. The idea of measuring process similarity via DTW on embedded reasoning steps is a direct extension of existing process-supervision work to the medical VQA case, where step-by-step explanations can matter for reliability. The public release of the datasets and code stands out as the most immediately useful part. The main limitation is the missing control. The results compare the combined reward against SFT only, with no GRPO run that uses exact-match reward alone on the same checkpoint. Without that, the gains could come from the policy optimization step itself or from extra training time rather than the DTW signal. The trajectories are model-generated and LLM-judged, so any errors there would affect the DTW distances, and the abstract gives no details on how often the judge matches human judgment or on statistical tests for the reported lifts. This paper is for researchers working on vision-language models in medicine who are already using RL methods and want to try process rewards. A reader focused on medical VQA or trajectory supervision will get value from the released resources and the concrete numbers, even if the controls are incomplete. It has enough empirical content and open materials to deserve a serious referee, though reviewers will probably ask for the exact-match GRPO baseline and more on trajectory quality. I would send it to peer review.

Referee Report

2 major / 3 minor

Summary. The paper claims that generating reasoning trajectories for six medical VQA benchmarks via COMCTS (with LLM verification) enables a two-stage pipeline of SFT followed by GRPO using a combined reward (exact-match on final answers plus DTW distance on sentence-transformer embeddings of reasoning steps). This yields consistent gains over SFT-only baselines: mean accuracy rises from 0.598 to 0.689, BERTScore from 0.845 to 0.881, and ROUGE-L from 0.665 to 0.748. The authors release the generated datasets and code.

Significance. If the central result holds after addressing controls, the work provides evidence that trajectory-aware process rewards can enhance reasoning quality in medical VLMs beyond outcome-only supervision, with the public datasets offering a reusable resource for the community.

major comments (2)

[Experiments] Experiments section: The reported comparisons are limited to SFT-only versus SFT+GRPO with the combined DTW+exact-match reward. No GRPO run using exact-match reward alone (on the identical SFT checkpoint) is presented, so the observed gains cannot be attributed specifically to the DTW trajectory signal rather than to GRPO optimization, training duration, or reward scaling.
[§3] §3 (Trajectory Generation): The ground-truth trajectories are produced by the same class of open-source VLMs used in training and verified only by an LLM judge, with no reported human validation, inter-annotator agreement, or quality metrics. This leaves open whether the DTW signal reflects genuine reasoning quality or artifacts of the generation process.

minor comments (3)

[Abstract] Abstract and §4: Mean improvements are stated without the number of random seeds, standard deviations, or statistical significance tests, making it difficult to assess robustness across the six benchmarks.
[Method] Method: The relative weighting between the DTW and exact-match terms is treated as a free hyperparameter, yet no ablation or sensitivity analysis on this weighting is provided.
[Related Work] Related Work: The positioning relative to prior process-supervision methods (e.g., in math or general VQA) could be expanded with more direct citations to recent GRPO or trajectory-reward papers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point-by-point below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: The reported comparisons are limited to SFT-only versus SFT+GRPO with the combined DTW+exact-match reward. No GRPO run using exact-match reward alone (on the identical SFT checkpoint) is presented, so the observed gains cannot be attributed specifically to the DTW trajectory signal rather than to GRPO optimization, training duration, or reward scaling.

Authors: We agree that this ablation is necessary to isolate the contribution of the DTW-based process reward. In the revised manuscript, we will add results from an additional GRPO training run that uses only the exact-match reward on the identical SFT checkpoint. This will enable a direct comparison and clarify whether the observed gains stem from the trajectory signal or from other factors such as the GRPO optimization itself. revision: yes
Referee: [§3] §3 (Trajectory Generation): The ground-truth trajectories are produced by the same class of open-source VLMs used in training and verified only by an LLM judge, with no reported human validation, inter-annotator agreement, or quality metrics. This leaves open whether the DTW signal reflects genuine reasoning quality or artifacts of the generation process.

Authors: We acknowledge this limitation. Section 3 describes the use of COMCTS with open-source VLMs followed by LLM verification to scale trajectory generation across six benchmarks. We did not perform human validation or report inter-annotator agreement. In the revision, we will expand §3 with a dedicated discussion of the verification methodology, explicitly note the lack of human evaluation as a limitation, and report additional automatic quality metrics on the released trajectories (such as length statistics, embedding-based consistency, and diversity). The public release of the datasets will also allow the community to conduct further validation. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical reward definition and results do not reduce to inputs by construction

full rationale

The paper's chain consists of (1) independent generation of reasoning trajectories via COMCTS + LLM judge on external benchmarks, (2) definition of a DTW process reward using a separate sentence-transformer embedding model, and (3) empirical comparison of combined reward vs. SFT-only. None of these steps is self-definitional, a fitted input renamed as prediction, or dependent on a load-bearing self-citation. The DTW distance is computed from externally generated references and an off-the-shelf embedder; the reported accuracy/BERTScore/ROUGE gains are experimental outcomes, not quantities forced by the training objective itself. Missing ablations are a methodological limitation but do not create circularity under the specified criteria.

Axiom & Free-Parameter Ledger

2 free parameters · 3 axioms · 0 invented entities

The central claim rests on the assumption that generated trajectories are high-quality and that DTW similarity is a valid proxy for reasoning improvement; no new physical entities are postulated.

free parameters (2)

reward weighting between DTW and exact-match
The abstract describes combining the two rewards but does not specify how they are balanced, implying a tunable hyperparameter.
GRPO and SFT training hyperparameters
Standard RL and fine-tuning hyperparameters are required but not detailed in the abstract.

axioms (3)

domain assumption COMCTS with open-source VLMs can produce useful reasoning trajectories for medical VQA questions
The data generation step depends on this capability.
domain assumption An LLM can reliably judge the quality of generated reasoning trajectories
Used as verification judge in the COMCTS pipeline.
domain assumption Sentence transformer embeddings plus DTW distance capture semantically meaningful similarity between reasoning processes
This underpins the novel process reward.

pith-pipeline@v0.9.0 · 5547 in / 1564 out tokens · 55724 ms · 2026-05-10T17:13:59.688344+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering
cs.CV 2026-05 unverdicted novelty 5.0

LiteMedCoT-VL distills chain-of-thought from a 235B model to 2B VLMs via LoRA, reaching 64.9% accuracy on PMC-VQA and beating a 4B zero-shot baseline by 11 points.

Reference graph

Works this paper leans on

36 extracted references · 18 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

Ahn, J., Verma, R., Lou, R., Liu, D., Zhang, R., Yin, W.: Large language models for mathematicalreasoning:Progressesandchallenges.arXivpreprintarXiv:2402.00157 (2024)

work page arXiv 2024
[2]

Ben Abacha, A., Hasan, S.A., Datla, V.V., Demner-Fushman, D., Müller, H.: Vqa-med: Overview of the medical visual question answering task at imageclef
[3]

9-12 September 2019 (2019)

In: Proceedings of CLEF (Conference and Labs of the Evaluation Forum) 2019 Working Notes. 9-12 September 2019 (2019)

2019
[4]

Huatuogpt-o1, towards medical complex reasoning with llms

Chen,J.,Cai,Z.,Ji,K.,Wang,X.,Liu,W.,Wang,R.,Hou,J.,Wang,B.:Huatuogpt- o1, towards medical complex reasoning with llms. arXiv preprint arXiv:2412.18925 (2024)

work page arXiv 2024
[5]

arXiv preprint arXiv:2410.20327 (2024)

Chen, X., Lai, Z., Ruan, K., Chen, S., Liu, J., Liu, Z.: R-llava: Improving med-vqa understanding through visual region of interest. arXiv preprint arXiv:2410.20327 (2024)

work page arXiv 2024
[6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

arXiv preprint arXiv:2003.10286 (2020)

He, X., Zhang, Y., Mou, L., Xing, E., Xie, P.: Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286 (2020)

work page arXiv 2003
[8]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Hu, Y., Li, T., Lu, Q., Shao, W., He, J., Qiao, Y., Luo, P.: Omnimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22170–22183 (2024)

2024
[9]

Mathprompter: Mathematical reasoning using large language models

Imani, S., Du, L., Shrivastava, H.: Mathprompter: Mathematical reasoning using large language models. arXiv preprint arXiv:2303.05398 (2023)

work page arXiv 2023
[10]

In: Proceedings of the AAAI conference on artificial intelligence

Kwon, T., Ong, K.T.i., Kang, D., Moon, S., Lee, J.R., Hwang, D., Sohn, B., Sim, Y., Lee, D., Yeo, J.: Large language models are clinical reasoners: Reasoning-aware diagnosis framework with prompt-generated rationales. In: Proceedings of the AAAI conference on artificial intelligence. vol. 38, pp. 18417–18425 (2024)

2024
[11]

Advances in neural information processing systems35, 26337–26349 (2022)

Lample, G., Lacroix, T., Lachaux, M.A., Rodriguez, A., Hayat, A., Lavril, T., Ebner, G., Martinet, X.: Hypertree proof search for neural theorem proving. Advances in neural information processing systems35, 26337–26349 (2022)

2022
[12]

Scientific data 5(1), 1–10 (2018)

Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Scientific data 5(1), 1–10 (2018)

2018
[13]

Advances in Neural Information Processing Systems36, 28541–28564 (2023)

Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems36, 28541–28564 (2023)

2023
[14]

Liévin, V., Hother, C.E., Motzfeldt, A.G., Winther, O.: Can large language models reason about medical questions? Patterns5(3) (2024)

2024
[15]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Lin, W., Zhao, Z., Zhang, X., Wu, C., Zhang, Y., Wang, Y., Xie, W.: Pmc- clip: Contrastive language-image pre-training using biomedical documents. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 525–536. Springer (2023)

2023
[16]

In: 2021 14 I

Liu, B., Zhan, L.M., Xu, L., Ma, L., Yang, Y., Wu, X.M.: Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering. In: 2021 14 I. Gulluk et al. IEEE 18th international symposium on biomedical imaging (ISBI). pp. 1650–1654. IEEE (2021)

2021
[17]

arXiv preprint arXiv:2407.01791 (2024)

Lozano, A., Nirschl, J., Burgess, J., Gupte, S.R., Zhang, Y., Unell, A., Yeung-Levy, S.: {\mu}-bench: A vision-language benchmark for microscopy understanding. arXiv preprint arXiv:2407.01791 (2024)

work page arXiv 2024
[18]

arXiv preprint arXiv:2310.10080 , year=

Ma, Q., Zhou, H., Liu, T., Yuan, J., Liu, P., You, Y., Yang, H.: Let’s reward step by step: Step-level reward model as the navigators for reasoning. arXiv preprint arXiv:2310.10080 (2023)

work page arXiv 2023
[19]

Information retrieval for music and motion pp

Müller, M.: Dynamic time warping. Information retrieval for music and motion pp. 69–84 (2007)

2007
[20]

Advances in neural information processing systems35, 27730–27744 (2022)

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow instructions with human feedback. Advances in neural information processing systems35, 27730–27744 (2022)

2022
[21]

Advances in neural information processing systems36, 53728–53741 (2023)

Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems36, 53728–53741 (2023)

2023
[22]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[23]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Gemma 3 Technical Report

Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ramé, A., Rivière, M., et al.: Gemma 3 technical report. arXiv preprint arXiv:2503.19786 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Nejm Ai1(3), AIoa2300138 (2024)

Tu, T., Azizi, S., Driess, D., Schaekermann, M., Amin, M., Chang, P.C., Carroll, A., Lau, C., Tanno, R., Ktena, I., et al.: Towards generalist biomedical ai. Nejm Ai1(3), AIoa2300138 (2024)

2024
[26]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[28]

Advances in neural information processing systems35, 24824–24837 (2022)

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems35, 24824–24837 (2022)

2022
[29]

Wu, C., Lin, W., Zhang, X., Zhang, Y., Wang, Y., Xie, W.: Pmc-llama: towards building open-source language models for medicine (2023)

2023
[30]

Journal of the American Medical Informatics Association31(9), 1833–1843 (2024)

Wu, C., Lin, W., Zhang, X., Zhang, Y., Xie, W., Wang, Y.: Pmc-llama: toward building open-source language models for medicine. Journal of the American Medical Informatics Association31(9), 1833–1843 (2024)

2024
[31]

Beyond the first error: Process reward models for reflective mathematical reasoning

Yang, Z., He, C., Shi, X., Li, L., Yin, Q., Deng, S., Jiang, D.: Beyond the first error: Process reward models for reflective mathematical reasoning. arXiv preprint arXiv:2505.14391 (2025)

work page arXiv 2025
[32]

Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte 10 carlo tree search.arXiv preprint arXiv:2412.18319, 2024

Yao, H., Huang, J., Wu, W., Zhang, J., Wang, Y., Liu, S., Wang, Y., Song, Y., Feng, H., Shen, L., et al.: Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search. arXiv preprint arXiv:2412.18319 (2024) Improving Medical VQA through Trajectory-Aware Process Supervision 15

work page arXiv 2024
[33]

Advances in neural information processing systems36, 11809–11822 (2023)

Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems36, 11809–11822 (2023)

2023
[34]

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Zhang, S., Xu, Y., Usuyama, N., Xu, H., Bagga, J., Tinn, R., Preston, S., Rao, R., Wei, M., Valluri, N., et al.: Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv preprint arXiv:2303.00915 (2023)

work page internal anchor Pith review arXiv 2023
[35]

arXiv preprint arXiv:2305.10415 , year=

Zhang, X., Wu, C., Zhao, Z., Lin, W., Zhang, Y., Wang, Y., Xie, W.: Pmc-vqa: Visual instruction tuning for medical visual question answering. arXiv preprint arXiv:2305.10415 (2023)

work page arXiv 2023
[36]

The lessons of developing process reward models in mathematical reasoning.arXiv preprint arXiv:2501.07301, 2025

Zhang, Z., Zheng, C., Wu, Y., Zhang, B., Lin, R., Yu, B., Liu, D., Zhou, J., Lin, J.: The lessons of developing process reward models in mathematical reasoning. arXiv preprint arXiv:2501.07301 (2025) 16 I. Gulluk et al. Appendix .1 COMCTS Examples Fig.4: COMCTS generated sample from VQA-RAD. Question: What are the hyperdense lesions noted at the edges of ...

work page arXiv 2025