Analyzing and Improving Fine-grained Preference Optimization in Medical LVLMs

Ali Etemad; Elham Dolatabadi; Leonid Sigal; Pritam Sarkar; Shayan Mohammadizadehsamakosh

arxiv: 2606.12590 · v1 · pith:XB3CKMD7new · submitted 2026-06-10 · 💻 cs.CV · cs.AI

Analyzing and Improving Fine-grained Preference Optimization in Medical LVLMs

Shayan Mohammadizadehsamakosh , Pritam Sarkar , Leonid Sigal , Ali Etemad , Elham Dolatabadi This is my paper

Pith reviewed 2026-06-27 09:50 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords medical LVLMspreference optimizationvisual groundingtoken-wise KL regularizationfine-grained alignmentDPO variantsclinical text generationlesion-corrupted images

0 comments

The pith

A bidirectional token-wise KL regularizer plus visual-contrastive grounding from clean and lesion images improves fine-grained on-policy alignment for medical LVLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard DPO-style alignment fails for medical vision-language models because it applies uniform sequence rewards, relies on off-policy references, and ignores visual evidence. It proposes constructing preference pairs by minimally editing the model's own outputs to fix only clinically wrong spans, then regularizing at the token level with a bidirectional KL term while adding an objective that contrasts responses to clean versus lesion-corrupted images. If the approach works, models become sensitive to diagnostically decisive image features and produce clinically accurate text without stylistic drift from external references.

Core claim

Existing preference optimization methods treat all tokens equally, introduce distribution shift from static references, and lack explicit visual grounding, so the authors combine a bidirectional token-wise KL regularizer with a visual-contrastive objective that penalizes outputs generated from corrupted images; this yields a fine-grained on-policy framework whose preference pairs are created by editing only erroneous clinical spans in model-generated text while preserving linguistic style.

What carries the argument

Bidirectional token-wise KL regularizer paired with a visual-contrastive grounding objective that contrasts clean and lesion-corrupted images to enforce visual evidence in responses.

If this is right

Token-level rather than sequence-level signals allow correction of only clinically erroneous spans.
On-policy minimal edits avoid steering optimization toward stylistic artifacts from supervised references.
Explicit pairing of clean and corrupted images forces sensitivity to subtle pathological features.
The resulting models show improved performance on both medical imaging tasks and clinical text generation benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same minimal-edit construction of preference data could lower the cost of creating alignment datasets in any domain where expert editing is expensive.
The visual-contrastive term may transfer to other vision-language settings that require grounding in specific image regions rather than global captions.
Because the regularizer operates at token granularity, it could be combined with existing safety or factuality methods that also act on individual tokens.

Load-bearing premise

Minimally editing model-generated outputs will produce clinically valid preference pairs without introducing new biases or distribution shifts.

What would settle it

A controlled test on a medical imaging benchmark in which the trained model still generates responses that ignore visible lesions at the same rate as the unaligned baseline.

Figures

Figures reproduced from arXiv: 2606.12590 by Ali Etemad, Elham Dolatabadi, Leonid Sigal, Pritam Sarkar, Shayan Mohammadizadehsamakosh.

**Figure 2.** Figure 2: Stylistic shift VS. Medical accuracy. Using GT [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of Pass@1 vs. Pass@4 across different regularization strategies. The Bidirectional KL (FKL+RKL) achieves the highest clinical precision while maintaining the distributional richness required for complex reasoning. RKL acts as an on-policy regularizer. By maximizing the log-probability of the reference under the current policy’s samples, it penalizes the model for straying into regions of the… view at source ↗

**Figure 4.** Figure 4: An overview of the fine-grained on-policy visual [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: FIRE-MPO shows superior visual grounding compared to base models and DPO alignment [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Preference-pair construction used in the textual-style ablation. Vanilla preference learning [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt template used for VQA task answer localization and Ground Truth incorporation. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt template used for Report Generation task answer localization and Ground Truth [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

read the original abstract

Large Vision-Language Models (LVLMs) have achieved strong performance across medical imaging tasks, yet they remain prone to factual inconsistencies, poor visual grounding, and misalignment with clinically meaningful feedback. Existing post-training alignment approaches, including Direct Preference Optimization (DPO) and its variants, face three critical limitations in the medical domain: (1) sequence-level reward signals treat clinically critical tokens identically to generic filler text; (2) reliance on static supervised fine-tuning references as preferred responses introduces an off-policy distribution shift, steering optimization toward stylistic artifacts over clinical correctness; and (3) alignment objectives lack explicit visual grounding constraints, leaving models insensitive to subtle yet diagnostically decisive pathological features. Our method leverages a bidirectional token-wise KL regularizer alongside a visual-contrastive grounding objective that pairs clean and lesion-corrupted images to penalize responses generated without adequate visual evidence. Together, these components form a fine-grained, on-policy alignment framework that constructs preference pairs by minimally editing model-generated outputs, correcting only clinically erroneous spans while preserving the original linguistic style. Extensive experiments across medical imaging tasks and clinical text generation benchmarks validate the effectiveness of our approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adapts DPO for medical LVLMs via bidirectional token-wise KL and a visual-contrastive term on lesion-corrupted images but supplies no results or implementation details to show it works.

read the letter

The one or two things to know are that the work targets specific shortcomings of DPO in medical vision-language models through token-level regularization and a contrastive visual objective, yet the abstract contains no experimental outcomes to evaluate success.

The paper identifies three limitations in current alignment methods for this domain: uniform treatment of tokens in rewards, off-policy shifts from SFT references, and insufficient visual grounding. It introduces a bidirectional token-wise KL regularizer to address token importance and pairs clean images with lesion-corrupted versions in a contrastive loss to enforce reliance on visual evidence. Preference pairs are created on-policy via minimal edits to the model's own outputs, fixing only clinically wrong spans.

This is a reasonable extension of existing preference optimization techniques to the medical context, where accuracy on subtle features is critical. The focus on preserving linguistic style while correcting errors is a sensible design choice.

The main concerns are around the unspecified parts of the method. How the erroneous spans are identified and corrected is not explained, which raises questions about potential new biases or whether the process remains truly on-policy. The details of the image corruption operator are also absent. These elements are central to the proposed framework.

Although the abstract mentions extensive experiments on medical imaging tasks and clinical benchmarks, it provides no numbers, baselines, or statistical details. This makes it hard to determine if the approach delivers measurable improvements.

Readers working on multimodal models for healthcare or on fine-grained alignment techniques would be the ones to get value here. The paper engages honestly with the challenges in the area.

It deserves peer review because the targeted problems are genuine and the ideas build logically on prior work, even with the current gaps in presentation.

Recommendation: Send to referees rather than desk reject, with the expectation that they will probe the implementation details and request the missing results.

Referee Report

3 major / 1 minor

Summary. The paper claims that existing post-training alignment approaches for medical LVLMs, including DPO and variants, suffer from three limitations: sequence-level rewards that treat all tokens equally, off-policy distribution shifts from static SFT references, and lack of explicit visual grounding. It proposes a fine-grained on-policy framework using a bidirectional token-wise KL regularizer combined with a visual-contrastive grounding objective that pairs clean and lesion-corrupted images to penalize responses lacking visual evidence. Preference pairs are constructed by minimally editing model-generated outputs to correct only clinically erroneous spans while preserving linguistic style. The authors assert that extensive experiments across medical imaging tasks and clinical text generation benchmarks validate the approach.

Significance. If the proposed regularizer and contrastive objective deliver the claimed fine-grained, on-policy alignment with reliable visual grounding and without introducing new biases, the work could meaningfully advance the clinical reliability of LVLMs by improving sensitivity to pathological features and reducing factual inconsistencies in safety-critical medical applications.

major comments (3)

[Abstract] Abstract: the claim that 'extensive experiments across medical imaging tasks and clinical text generation benchmarks validate the effectiveness of our approach' supplies no quantitative results, baselines, error bars, dataset details, or performance metrics, so the central empirical validation assertion is unevidenced.
[Abstract] Abstract: the visual-contrastive grounding objective is described only at a high level ('pairs clean and lesion-corrupted images to penalize responses generated without adequate visual evidence'); no corruption operator, loss formulation for the bidirectional token-wise KL regularizer, or computation of the contrastive signal is provided, which are load-bearing for the proposed framework.
[Abstract] Abstract: the on-policy preference-pair construction ('minimally editing model-generated outputs, correcting only clinically erroneous spans while preserving the original linguistic style') does not specify the span-identification procedure (human, rule-based, or auxiliary model), leaving open risks of annotator bias, distribution shift, or circularity that directly affect the on-policy claim.

minor comments (1)

[Abstract] Abstract: the three listed limitations are stated without citations to specific prior works that exemplify each issue.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback focused on the abstract. We agree that the abstract can be strengthened with additional concrete details without exceeding typical length constraints, and we will revise it accordingly while ensuring the main text already contains the supporting technical descriptions.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'extensive experiments across medical imaging tasks and clinical text generation benchmarks validate the effectiveness of our approach' supplies no quantitative results, baselines, error bars, dataset details, or performance metrics, so the central empirical validation assertion is unevidenced.

Authors: We agree the abstract claim would be more compelling with quantitative support. In the revision we will add concise references to key metrics (e.g., accuracy and grounding improvements), primary datasets, and main baselines from the experimental section. revision: yes
Referee: [Abstract] Abstract: the visual-contrastive grounding objective is described only at a high level ('pairs clean and lesion-corrupted images to penalize responses generated without adequate visual evidence'); no corruption operator, loss formulation for the bidirectional token-wise KL regularizer, or computation of the contrastive signal is provided, which are load-bearing for the proposed framework.

Authors: The abstract supplies a high-level summary per convention; the corruption operator, bidirectional token-wise KL formulation, and contrastive signal are defined in Section 3. We will insert one additional sentence in the abstract that briefly names these elements to address the concern. revision: yes
Referee: [Abstract] Abstract: the on-policy preference-pair construction ('minimally editing model-generated outputs, correcting only clinically erroneous spans while preserving the original linguistic style') does not specify the span-identification procedure (human, rule-based, or auxiliary model), leaving open risks of annotator bias, distribution shift, or circularity that directly affect the on-policy claim.

Authors: We acknowledge that the abstract omits the identification procedure. Section 4 describes the hybrid rule-based plus auxiliary-model approach used to locate erroneous spans. We will add a short clause in the abstract to indicate this procedure and thereby strengthen the on-policy claim. revision: yes

Circularity Check

0 steps flagged

No circularity; no equations or self-referential reductions present

full rationale

The provided abstract and description contain no equations, derivations, or explicit mathematical steps. The method is described at a high level as combining a bidirectional token-wise KL regularizer with a visual-contrastive objective and minimal editing of outputs, but no component is shown reducing to a fitted input by construction, a self-citation chain, or a renamed known result. No load-bearing claim is justified solely by self-citation or ansatz smuggling. The framework is presented as an empirical improvement without a derivation chain that collapses to its inputs, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities; ledger cannot be populated.

pith-pipeline@v0.9.1-grok · 5743 in / 1095 out tokens · 26997 ms · 2026-06-27T09:50:32.572857+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 1 canonical work pages

[1]

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025
[2]

Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Pith/arXiv arXiv 2023
[3]

Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

Pith/arXiv arXiv 2025
[4]

Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

Pith/arXiv arXiv 2024
[5]

Llava-med: Training a large language-and-vision assistant for biomedicine in one day

Chunyuan Li, Cliff Wong, Zirui Zhang, Yan Luo, Jianwei Yang, Haotian Liu, Kai-Wei Cheng, Yu Yue, Jianfeng Peng, Jianfeng Gao, et al. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[6]

Chexagent: Towards a foundation model for chest x-ray interpretation.arXiv preprint arXiv:2401.12208, 2024

Zhihong Chen, Maya Varma, Jean-Benoit Delbrouck, Magdalini Paschali, et al. Chexagent: Towards a foundation model for chest x-ray interpretation.arXiv preprint arXiv:2401.12208, 2024

arXiv 2024
[7]

Towards injecting medical visual knowledge into multimodal llms at scale

Junying Chen, Chi Gui, Ruyi Ouyang, Anningzhe Gao, Shunian Chen, Guiming Hardy Chen, Xidong Wang, Zhenyang Cai, Ke Ji, Xiang Wan, et al. Towards injecting medical visual knowledge into multimodal llms at scale. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 7346–7370, 2024

2024
[8]

Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

Pith/arXiv arXiv 2025
[9]

Multimodal large language models in health care: ap- plications, challenges, and future outlook.Journal of medical Internet research, 26:e59505, 2024

Rawan AlSaad, Alaa Abd-Alrazaq, Sabri Boughorbel, Arfan Ahmed, Max-Antoine Renault, Rafat Damseh, and Javaid Sheikh. Multimodal large language models in health care: ap- plications, challenges, and future outlook.Journal of medical Internet research, 26:e59505, 2024

2024
[10]

Capabilities of gemini models in medicine.arXiv preprint arXiv:2404.18416, 2024

Khaled Saab, Tao Tu, Wei-Hung Weng, Ryutaro Tanno, David Stutz, Ellery Wulczyn, Fan Zhang, Tim Strother, Chunjong Park, Elahe Vedadi, et al. Capabilities of gemini models in medicine.arXiv preprint arXiv:2404.18416, 2024

Pith/arXiv arXiv 2024
[11]

et al. Zhang. Detecting and evaluating medical hallucinations in large vision language models. arXiv preprint arXiv:2406.10185, 2024

Pith/arXiv arXiv 2024
[12]

Hallucinogen: Benchmarking hallucination in implicit reasoning within large vision language models

Ashish Seth, Dinesh Manocha, and Chirag Agarwal. Hallucinogen: Benchmarking hallucination in implicit reasoning within large vision language models. InProceedings of the 2nd Workshop on Uncertainty-Aware NLP (UncertaiNLP 2025), pages 89–102, 2025

2025
[13]

Ai reliability gap: Why large language models fail in safety-critical systems

Praneeth Vadlapati. Ai reliability gap: Why large language models fail in safety-critical systems. ResearchGate / Independent Report, 2026. URL https://www.researchgate.net/ publication/401422885

arXiv 2026
[14]

Cares: A comprehensive benchmark of trustworthiness in medical vision language models.Advances in Neural Information Processing Systems, 37: 140334–140365, 2024

Peng Xia, Ze Chen, Juanxi Tian, Yangrui Gong, Ruibo Hou, Yue Xu, Zhenbang Wu, Zhiyuan Fan, Yiyang Zhou, Kangyu Zhu, et al. Cares: A comprehensive benchmark of trustworthiness in medical vision language models.Advances in Neural Information Processing Systems, 37: 140334–140365, 2024

2024
[15]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023. 11

2023
[16]

Kto: Model alignment as prospect theoretic optimization, 2024.URL https://arxiv

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization, 2024.URL https://arxiv. org/abs/2402.01306, 14, 2023

Pith/arXiv arXiv 2024
[17]

A general theoretical paradigm to understand learning from human preferences

Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. InInternational Conference on Artificial Intelligence and Statistics, pages 4447–4455. PMLR, 2024

2024
[18]

CheXalign: Preference fine-tuning in chest X-ray interpretation models without hu- man feedback

Dennis Hein, Zhihong Chen, Sophie Ostmeier, Justin Xu, Maya Varma, Eduardo Pontes Reis, Arne Edward Michalson, Christian Bluethgen, Hyun Joo Shin, Curtis Langlotz, and Akshay S Chaudhari. CheXalign: Preference fine-tuning in chest X-ray interpretation models without hu- man feedback. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pi...

work page doi:10.18653/v1/2025.acl-long.1342 2025
[19]

Zhu et al

K. Zhu et al. MMedPO: Aligning medical vision-language models with clinical-aware multi- modal preference optimization.arXiv preprint arXiv:2412.06141, 2024

arXiv 2024
[20]

Thomas Savage, Stephen P Ma, Abdessalem Boukil, Ekanath Rangan, Vishwesh Patel, Ivan Lopez, and Jonathan Chen. Fine-tuning methods for large language models in clinical medicine by supervised fine-tuning and direct preference optimization: Comparative evaluation.Journal of Medical Internet Research, 27:e76048, 2025

2025
[21]

Self-training large language and vision assistant for medical question answering

Guohao Sun, Can Qin, Huazhu Fu, Linwei Wang, and Zhiqiang Tao. Self-training large language and vision assistant for medical question answering. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 20052–20060, 2024

2024
[22]

Benchmarking direct preference optimization for medical large vision-language models

Dain Kim, Jiwoo Lee, Jaehoon Yun, Yong Hoe Koo, Qingyu Chen, Hyunjae Kim, and Jaewoo Kang. Benchmarking direct preference optimization for medical large vision-language models. arXiv preprint arXiv:2601.17918, 2026

arXiv 2026
[23]

Same verdict, different reasons: Llm-as-a-judge and clinician disagreement on medical chatbot completeness.arXiv preprint arXiv:2604.16383, 2026

Alexandra DeLucia, Heyuan Huang, Sonal Joshi, Mahsa Yarmohammadi, Ahmed Hassoon, and Mark Dredze. Same verdict, different reasons: Llm-as-a-judge and clinician disagreement on medical chatbot completeness.arXiv preprint arXiv:2604.16383, 2026

Pith/arXiv arXiv 2026
[24]

Reliability of supervised machine learning using synthetic data in health care: Model to preserve privacy for data sharing.JMIR medical informatics, 8(7):e18910, 2020

Debbie Rankin, Michaela Black, Raymond Bond, Jonathan Wallace, Maurice Mulvenna, and Gorka Epelde. Reliability of supervised machine learning using synthetic data in health care: Model to preserve privacy for data sharing.JMIR medical informatics, 8(7):e18910, 2020

2020
[25]

Rrg-dpo: Direct preference optimization for clinically accurate radiology report generation

Hong Liu, Dong Wei, Zhe Xu, Xian Wu, Yefeng Zheng, and Liansheng Wang. Rrg-dpo: Direct preference optimization for clinically accurate radiology report generation. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 552–562. Springer, 2025

2025
[26]

Mitigating hallucina- tions in large vision-language models via dpo: On-policy data hold the key

Zhihe Yang, Xufang Luo, Dongqi Han, Yunjian Xu, and Dongsheng Li. Mitigating hallucina- tions in large vision-language models via dpo: On-policy data hold the key. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10610–10620, 2025

2025
[27]

Self-alignment of large video language models with refined regularized preference optimization.arXiv preprint arXiv:2504.12083, 2025

Pritam Sarkar and Ali Etemad. Self-alignment of large video language models with refined regularized preference optimization.arXiv preprint arXiv:2504.12083, 2025

arXiv 2025
[28]

Is dpo superior to ppo for llm alignment? a comprehensive study.arXiv preprint arXiv:2404.10719, 2024

Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei, Guangju Wang, Chao Yu, and Yi Wu. Is dpo superior to ppo for llm alignment? a comprehensive study.arXiv preprint arXiv:2404.10719, 2024

arXiv 2024
[29]

3d-properties: Identifying challenges in dpo and charting a path forward.arXiv preprint arXiv:2406.07327, 2024

Yuzi Yan, Yibo Miao, Jialian Li, Yipin Zhang, Jian Xie, Zhijie Deng, and Dong Yan. 3d-properties: Identifying challenges in dpo and charting a path forward.arXiv preprint arXiv:2406.07327, 2024. 12

arXiv 2024
[30]

Rethinking dpo: The role of rejected responses in preference misalignment.arXiv preprint arXiv:2506.12725, 2025

Jay Hyeon Cho, JunHyeok Oh, Myunsoo Kim, and Byung-Jun Lee. Rethinking dpo: The role of rejected responses in preference misalignment.arXiv preprint arXiv:2506.12725, 2025

arXiv 2025
[31]

Mask-dpo: Generalizable fine-grained factuality alignment of llms.arXiv preprint arXiv:2503.02846, 2025

Yuzhe Gu, Wenwei Zhang, Chengqi Lyu, Dahua Lin, and Kai Chen. Mask-dpo: Generalizable fine-grained factuality alignment of llms.arXiv preprint arXiv:2503.02846, 2025

arXiv 2025
[32]

Token- level direct preference optimization.arXiv preprint arXiv:2404.11999, 2024

Yongcheng Zeng, Guoqing Liu, Weiyu Ma, Ning Yang, Haifeng Zhang, and Jun Wang. Token- level direct preference optimization.arXiv preprint arXiv:2404.11999, 2024

arXiv 2024
[33]

Tgdpo: Harnessing token-level reward guidance for enhancing direct preference optimization.arXiv preprint arXiv:2506.14574, 2025

Mingkang Zhu, Xi Chen, Zhongdao Wang, Bei Yu, Hengshuang Zhao, and Jiaya Jia. Tgdpo: Harnessing token-level reward guidance for enhancing direct preference optimization.arXiv preprint arXiv:2506.14574, 2025

arXiv 2025
[34]

Puli et al

A. Puli et al. Rad-DPO: Reducing hallucinations in medical visual question answering.arXiv preprint arXiv:2406.06496, 2024

arXiv 2024
[35]

mdpo: Conditional preference optimization for multimodal large language models

Fei Wang, Wenxuan Zhou, James Y Huang, Nan Xu, Sheng Zhang, Hoifung Poon, and Muhao Chen. mdpo: Conditional preference optimization for multimodal large language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8078–8088, 2024

2024
[36]

Fine-grained verifiers: Preference modeling as next-token prediction in vision-language alignment.arXiv preprint arXiv:2410.14148, 2024

Chenhang Cui, An Zhang, Yiyang Zhou, Zhaorun Chen, Gelei Deng, Huaxiu Yao, and Tat-Seng Chua. Fine-grained verifiers: Preference modeling as next-token prediction in vision-language alignment.arXiv preprint arXiv:2410.14148, 2024

arXiv 2024
[37]

A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1): 180251, 2018

Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1): 180251, 2018

2018
[38]

Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering

Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering. In2021 IEEE 18th international symposium on biomedical imaging (ISBI), pages 1650–1654. IEEE, 2021

2021
[39]

Preparing a collection of radiology examinations for distribution and retrieval.Journal of the American Medical Informatics Association, 23(2):304–310, 2016

Dina Demner-Fushman, Marc D Kohli, Marc B Rosenman, Sonya E Shooshan, Laritza Ro- driguez, Sameer Antani, George R Thoma, and Clement J McDonald. Preparing a collection of radiology examinations for distribution and retrieval.Journal of the American Medical Informatics Association, 23(2):304–310, 2016

2016
[40]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

2022
[41]

Medsam3: Delving into segment anything with medical concepts.arXiv preprint arXiv:2511.19046, 2025

Anglin Liu, Rundong Xue, Xu R Cao, Yifan Shen, Yi Lu, Xiang Li, Qianqian Chen, and Jintai Chen. Medsam3: Delving into segment anything with medical concepts.arXiv preprint arXiv:2511.19046, 2025

arXiv 2025
[42]

Green: Generative radiology report evaluation and error notation

Sophie Ostmeier, Justin Xu, Zhihong Chen, Maya Varma, Louis Blankemeier, Christian Blueth- gen, Arne Edward Michalson Md, Michael Moseley, Curtis Langlotz, Akshay S Chaudhari, et al. Green: Generative radiology report evaluation and error notation. InFindings of the association for computational linguistics: EMNLP 2024, pages 374–390, 2024

2024
[43]

Liu et al

Y . Liu et al. Analyzing visual grounding in medical LVLMs.arXiv preprint arXiv:2603.14323, 2026

arXiv 2026
[44]

Mllms know where to look: Training-free perception of small visual details with multimodal llms

Mahyar Khayatkhoei, Prateek Chhikara, Filip Ilievski, et al. Mllms know where to look: Training-free perception of small visual details with multimodal llms. InInternational Confer- ence on Learning Representations, volume 2025, pages 68194–68213, 2025

2025
[45]

Wang et al

H. Wang et al. ASPO: Adaptive sentence-level preference optimization for multimodal large language models. InFindings of the Association for Computational Linguistics (ACL), 2025

2025
[46]

Yang et al

X. Yang et al. OPA-DPO: On-policy alignment for preference optimization.arXiv preprint arXiv:2501.09695, 2025. 13

arXiv 2025
[47]

Fu et al

J. Fu et al. CHiP: Cross-modal hierarchical DPO.arXiv preprint arXiv:2501.16629, 2025

arXiv 2025
[48]

Shukla et al

S. Shukla et al. SymPO: Symmetric preference optimization for vision-language models.arXiv preprint arXiv:2506.11712, 2025

arXiv 2025
[49]

Hein et al

D. Hein et al. CheXalign: Preference fine-tuning in chest X-ray interpretation models without human feedback. InProceedings of the Association for Computational Linguistics (ACL), 2025

2025
[50]

Liu et al

Y . Liu et al. R-DPO: Recursive direct preference optimization for medical vision-language models. InMedical Image Computing and Computer Assisted Intervention (MICCAI), 2025. 14 A Impact Statement The broader impact of this work lies in its potential to enhance the reliability and factual consistency of AI-driven medical diagnostics. Our framework advanc...

2025
[51]

Preserve the model’s language so the final answer is visible in context

**answer_sentence**: The minimal phrase or clause from the model’s answer that states the final answer, using the model’s exact wording. Preserve the model’s language so the final answer is visible in context. Do not include extra clauses that follow
[52]

Use as few words as possible (the first occurrence of the conclusive answer)

**final_answer**: Locate and extract the final answer to the question from the model’s answer. Use as few words as possible (the first occurrence of the conclusive answer)
[53]

**is_correct**: true if the final answer is correct given the ground truth, false otherwise
[54]

answer_sentence

**opposite_answer**: Modify the answer sentence as little as possible to make a medically plausible alternative for the question. Respond only with valid JSON in this exact shape: {"answer_sentence": "...", "final_answer": "...", "is_correct": true or false, "opposite_answer": "..."} {Stage 2: Prompt used in VQA answer difference localization.} You are gi...

[1] [1]

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025

[2] [2]

Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Pith/arXiv arXiv 2023

[3] [3]

Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

Pith/arXiv arXiv 2025

[4] [4]

Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

Pith/arXiv arXiv 2024

[5] [5]

Llava-med: Training a large language-and-vision assistant for biomedicine in one day

Chunyuan Li, Cliff Wong, Zirui Zhang, Yan Luo, Jianwei Yang, Haotian Liu, Kai-Wei Cheng, Yu Yue, Jianfeng Peng, Jianfeng Gao, et al. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024

[6] [6]

Chexagent: Towards a foundation model for chest x-ray interpretation.arXiv preprint arXiv:2401.12208, 2024

Zhihong Chen, Maya Varma, Jean-Benoit Delbrouck, Magdalini Paschali, et al. Chexagent: Towards a foundation model for chest x-ray interpretation.arXiv preprint arXiv:2401.12208, 2024

arXiv 2024

[7] [7]

Towards injecting medical visual knowledge into multimodal llms at scale

Junying Chen, Chi Gui, Ruyi Ouyang, Anningzhe Gao, Shunian Chen, Guiming Hardy Chen, Xidong Wang, Zhenyang Cai, Ke Ji, Xiang Wan, et al. Towards injecting medical visual knowledge into multimodal llms at scale. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 7346–7370, 2024

2024

[8] [8]

Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

Pith/arXiv arXiv 2025

[9] [9]

Multimodal large language models in health care: ap- plications, challenges, and future outlook.Journal of medical Internet research, 26:e59505, 2024

Rawan AlSaad, Alaa Abd-Alrazaq, Sabri Boughorbel, Arfan Ahmed, Max-Antoine Renault, Rafat Damseh, and Javaid Sheikh. Multimodal large language models in health care: ap- plications, challenges, and future outlook.Journal of medical Internet research, 26:e59505, 2024

2024

[10] [10]

Capabilities of gemini models in medicine.arXiv preprint arXiv:2404.18416, 2024

Khaled Saab, Tao Tu, Wei-Hung Weng, Ryutaro Tanno, David Stutz, Ellery Wulczyn, Fan Zhang, Tim Strother, Chunjong Park, Elahe Vedadi, et al. Capabilities of gemini models in medicine.arXiv preprint arXiv:2404.18416, 2024

Pith/arXiv arXiv 2024

[11] [11]

et al. Zhang. Detecting and evaluating medical hallucinations in large vision language models. arXiv preprint arXiv:2406.10185, 2024

Pith/arXiv arXiv 2024

[12] [12]

Hallucinogen: Benchmarking hallucination in implicit reasoning within large vision language models

Ashish Seth, Dinesh Manocha, and Chirag Agarwal. Hallucinogen: Benchmarking hallucination in implicit reasoning within large vision language models. InProceedings of the 2nd Workshop on Uncertainty-Aware NLP (UncertaiNLP 2025), pages 89–102, 2025

2025

[13] [13]

Ai reliability gap: Why large language models fail in safety-critical systems

Praneeth Vadlapati. Ai reliability gap: Why large language models fail in safety-critical systems. ResearchGate / Independent Report, 2026. URL https://www.researchgate.net/ publication/401422885

arXiv 2026

[14] [14]

Cares: A comprehensive benchmark of trustworthiness in medical vision language models.Advances in Neural Information Processing Systems, 37: 140334–140365, 2024

Peng Xia, Ze Chen, Juanxi Tian, Yangrui Gong, Ruibo Hou, Yue Xu, Zhenbang Wu, Zhiyuan Fan, Yiyang Zhou, Kangyu Zhu, et al. Cares: A comprehensive benchmark of trustworthiness in medical vision language models.Advances in Neural Information Processing Systems, 37: 140334–140365, 2024

2024

[15] [15]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023. 11

2023

[16] [16]

Kto: Model alignment as prospect theoretic optimization, 2024.URL https://arxiv

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization, 2024.URL https://arxiv. org/abs/2402.01306, 14, 2023

Pith/arXiv arXiv 2024

[17] [17]

A general theoretical paradigm to understand learning from human preferences

Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. InInternational Conference on Artificial Intelligence and Statistics, pages 4447–4455. PMLR, 2024

2024

[18] [18]

CheXalign: Preference fine-tuning in chest X-ray interpretation models without hu- man feedback

Dennis Hein, Zhihong Chen, Sophie Ostmeier, Justin Xu, Maya Varma, Eduardo Pontes Reis, Arne Edward Michalson, Christian Bluethgen, Hyun Joo Shin, Curtis Langlotz, and Akshay S Chaudhari. CheXalign: Preference fine-tuning in chest X-ray interpretation models without hu- man feedback. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pi...

work page doi:10.18653/v1/2025.acl-long.1342 2025

[19] [19]

Zhu et al

K. Zhu et al. MMedPO: Aligning medical vision-language models with clinical-aware multi- modal preference optimization.arXiv preprint arXiv:2412.06141, 2024

arXiv 2024

[20] [20]

Thomas Savage, Stephen P Ma, Abdessalem Boukil, Ekanath Rangan, Vishwesh Patel, Ivan Lopez, and Jonathan Chen. Fine-tuning methods for large language models in clinical medicine by supervised fine-tuning and direct preference optimization: Comparative evaluation.Journal of Medical Internet Research, 27:e76048, 2025

2025

[21] [21]

Self-training large language and vision assistant for medical question answering

Guohao Sun, Can Qin, Huazhu Fu, Linwei Wang, and Zhiqiang Tao. Self-training large language and vision assistant for medical question answering. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 20052–20060, 2024

2024

[22] [22]

Benchmarking direct preference optimization for medical large vision-language models

Dain Kim, Jiwoo Lee, Jaehoon Yun, Yong Hoe Koo, Qingyu Chen, Hyunjae Kim, and Jaewoo Kang. Benchmarking direct preference optimization for medical large vision-language models. arXiv preprint arXiv:2601.17918, 2026

arXiv 2026

[23] [23]

Same verdict, different reasons: Llm-as-a-judge and clinician disagreement on medical chatbot completeness.arXiv preprint arXiv:2604.16383, 2026

Alexandra DeLucia, Heyuan Huang, Sonal Joshi, Mahsa Yarmohammadi, Ahmed Hassoon, and Mark Dredze. Same verdict, different reasons: Llm-as-a-judge and clinician disagreement on medical chatbot completeness.arXiv preprint arXiv:2604.16383, 2026

Pith/arXiv arXiv 2026

[24] [24]

Reliability of supervised machine learning using synthetic data in health care: Model to preserve privacy for data sharing.JMIR medical informatics, 8(7):e18910, 2020

Debbie Rankin, Michaela Black, Raymond Bond, Jonathan Wallace, Maurice Mulvenna, and Gorka Epelde. Reliability of supervised machine learning using synthetic data in health care: Model to preserve privacy for data sharing.JMIR medical informatics, 8(7):e18910, 2020

2020

[25] [25]

Rrg-dpo: Direct preference optimization for clinically accurate radiology report generation

Hong Liu, Dong Wei, Zhe Xu, Xian Wu, Yefeng Zheng, and Liansheng Wang. Rrg-dpo: Direct preference optimization for clinically accurate radiology report generation. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 552–562. Springer, 2025

2025

[26] [26]

Mitigating hallucina- tions in large vision-language models via dpo: On-policy data hold the key

Zhihe Yang, Xufang Luo, Dongqi Han, Yunjian Xu, and Dongsheng Li. Mitigating hallucina- tions in large vision-language models via dpo: On-policy data hold the key. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10610–10620, 2025

2025

[27] [27]

Self-alignment of large video language models with refined regularized preference optimization.arXiv preprint arXiv:2504.12083, 2025

Pritam Sarkar and Ali Etemad. Self-alignment of large video language models with refined regularized preference optimization.arXiv preprint arXiv:2504.12083, 2025

arXiv 2025

[28] [28]

Is dpo superior to ppo for llm alignment? a comprehensive study.arXiv preprint arXiv:2404.10719, 2024

Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei, Guangju Wang, Chao Yu, and Yi Wu. Is dpo superior to ppo for llm alignment? a comprehensive study.arXiv preprint arXiv:2404.10719, 2024

arXiv 2024

[29] [29]

3d-properties: Identifying challenges in dpo and charting a path forward.arXiv preprint arXiv:2406.07327, 2024

Yuzi Yan, Yibo Miao, Jialian Li, Yipin Zhang, Jian Xie, Zhijie Deng, and Dong Yan. 3d-properties: Identifying challenges in dpo and charting a path forward.arXiv preprint arXiv:2406.07327, 2024. 12

arXiv 2024

[30] [30]

Rethinking dpo: The role of rejected responses in preference misalignment.arXiv preprint arXiv:2506.12725, 2025

Jay Hyeon Cho, JunHyeok Oh, Myunsoo Kim, and Byung-Jun Lee. Rethinking dpo: The role of rejected responses in preference misalignment.arXiv preprint arXiv:2506.12725, 2025

arXiv 2025

[31] [31]

Mask-dpo: Generalizable fine-grained factuality alignment of llms.arXiv preprint arXiv:2503.02846, 2025

Yuzhe Gu, Wenwei Zhang, Chengqi Lyu, Dahua Lin, and Kai Chen. Mask-dpo: Generalizable fine-grained factuality alignment of llms.arXiv preprint arXiv:2503.02846, 2025

arXiv 2025

[32] [32]

Token- level direct preference optimization.arXiv preprint arXiv:2404.11999, 2024

Yongcheng Zeng, Guoqing Liu, Weiyu Ma, Ning Yang, Haifeng Zhang, and Jun Wang. Token- level direct preference optimization.arXiv preprint arXiv:2404.11999, 2024

arXiv 2024

[33] [33]

Tgdpo: Harnessing token-level reward guidance for enhancing direct preference optimization.arXiv preprint arXiv:2506.14574, 2025

Mingkang Zhu, Xi Chen, Zhongdao Wang, Bei Yu, Hengshuang Zhao, and Jiaya Jia. Tgdpo: Harnessing token-level reward guidance for enhancing direct preference optimization.arXiv preprint arXiv:2506.14574, 2025

arXiv 2025

[34] [34]

Puli et al

A. Puli et al. Rad-DPO: Reducing hallucinations in medical visual question answering.arXiv preprint arXiv:2406.06496, 2024

arXiv 2024

[35] [35]

mdpo: Conditional preference optimization for multimodal large language models

Fei Wang, Wenxuan Zhou, James Y Huang, Nan Xu, Sheng Zhang, Hoifung Poon, and Muhao Chen. mdpo: Conditional preference optimization for multimodal large language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8078–8088, 2024

2024

[36] [36]

Fine-grained verifiers: Preference modeling as next-token prediction in vision-language alignment.arXiv preprint arXiv:2410.14148, 2024

Chenhang Cui, An Zhang, Yiyang Zhou, Zhaorun Chen, Gelei Deng, Huaxiu Yao, and Tat-Seng Chua. Fine-grained verifiers: Preference modeling as next-token prediction in vision-language alignment.arXiv preprint arXiv:2410.14148, 2024

arXiv 2024

[37] [37]

A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1): 180251, 2018

Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1): 180251, 2018

2018

[38] [38]

Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering

Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering. In2021 IEEE 18th international symposium on biomedical imaging (ISBI), pages 1650–1654. IEEE, 2021

2021

[39] [39]

Preparing a collection of radiology examinations for distribution and retrieval.Journal of the American Medical Informatics Association, 23(2):304–310, 2016

Dina Demner-Fushman, Marc D Kohli, Marc B Rosenman, Sonya E Shooshan, Laritza Ro- driguez, Sameer Antani, George R Thoma, and Clement J McDonald. Preparing a collection of radiology examinations for distribution and retrieval.Journal of the American Medical Informatics Association, 23(2):304–310, 2016

2016

[40] [40]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

2022

[41] [41]

Medsam3: Delving into segment anything with medical concepts.arXiv preprint arXiv:2511.19046, 2025

Anglin Liu, Rundong Xue, Xu R Cao, Yifan Shen, Yi Lu, Xiang Li, Qianqian Chen, and Jintai Chen. Medsam3: Delving into segment anything with medical concepts.arXiv preprint arXiv:2511.19046, 2025

arXiv 2025

[42] [42]

Green: Generative radiology report evaluation and error notation

Sophie Ostmeier, Justin Xu, Zhihong Chen, Maya Varma, Louis Blankemeier, Christian Blueth- gen, Arne Edward Michalson Md, Michael Moseley, Curtis Langlotz, Akshay S Chaudhari, et al. Green: Generative radiology report evaluation and error notation. InFindings of the association for computational linguistics: EMNLP 2024, pages 374–390, 2024

2024

[43] [43]

Liu et al

Y . Liu et al. Analyzing visual grounding in medical LVLMs.arXiv preprint arXiv:2603.14323, 2026

arXiv 2026

[44] [44]

Mllms know where to look: Training-free perception of small visual details with multimodal llms

Mahyar Khayatkhoei, Prateek Chhikara, Filip Ilievski, et al. Mllms know where to look: Training-free perception of small visual details with multimodal llms. InInternational Confer- ence on Learning Representations, volume 2025, pages 68194–68213, 2025

2025

[45] [45]

Wang et al

H. Wang et al. ASPO: Adaptive sentence-level preference optimization for multimodal large language models. InFindings of the Association for Computational Linguistics (ACL), 2025

2025

[46] [46]

Yang et al

X. Yang et al. OPA-DPO: On-policy alignment for preference optimization.arXiv preprint arXiv:2501.09695, 2025. 13

arXiv 2025

[47] [47]

Fu et al

J. Fu et al. CHiP: Cross-modal hierarchical DPO.arXiv preprint arXiv:2501.16629, 2025

arXiv 2025

[48] [48]

Shukla et al

S. Shukla et al. SymPO: Symmetric preference optimization for vision-language models.arXiv preprint arXiv:2506.11712, 2025

arXiv 2025

[49] [49]

Hein et al

D. Hein et al. CheXalign: Preference fine-tuning in chest X-ray interpretation models without human feedback. InProceedings of the Association for Computational Linguistics (ACL), 2025

2025

[50] [50]

Liu et al

Y . Liu et al. R-DPO: Recursive direct preference optimization for medical vision-language models. InMedical Image Computing and Computer Assisted Intervention (MICCAI), 2025. 14 A Impact Statement The broader impact of this work lies in its potential to enhance the reliability and factual consistency of AI-driven medical diagnostics. Our framework advanc...

2025

[51] [51]

Preserve the model’s language so the final answer is visible in context

**answer_sentence**: The minimal phrase or clause from the model’s answer that states the final answer, using the model’s exact wording. Preserve the model’s language so the final answer is visible in context. Do not include extra clauses that follow

[52] [52]

Use as few words as possible (the first occurrence of the conclusive answer)

**final_answer**: Locate and extract the final answer to the question from the model’s answer. Use as few words as possible (the first occurrence of the conclusive answer)

[53] [53]

**is_correct**: true if the final answer is correct given the ground truth, false otherwise

[54] [54]

answer_sentence

**opposite_answer**: Modify the answer sentence as little as possible to make a medically plausible alternative for the question. Respond only with valid JSON in this exact shape: {"answer_sentence": "...", "final_answer": "...", "is_correct": true or false, "opposite_answer": "..."} {Stage 2: Prompt used in VQA answer difference localization.} You are gi...