SDR: Set-Distance Rewards for Radiology Report Generation

Halil Ibrahim Gulluk; Max Van Puyvelde; Olivier Gevaert; Wim Van Criekinge

arxiv: 2606.00440 · v1 · pith:MOOCOYEFnew · submitted 2026-05-30 · 💻 cs.AI

SDR: Set-Distance Rewards for Radiology Report Generation

Halil Ibrahim Gulluk , Max Van Puyvelde , Wim Van Criekinge , Olivier Gevaert This is my paper

Pith reviewed 2026-06-28 19:16 UTC · model grok-4.3

classification 💻 cs.AI

keywords radiology report generationset-distance rewardsreinforcement learningvision-language modelschest X-rayGRPOsentence embeddings

0 comments

The pith

Set-to-set distances between sentence embeddings provide continuous rewards that improve chest X-ray report generation over exact-match methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that radiology reports consist of unordered findings rather than sequential reasoning steps, so standard exact-match rewards are mismatched to the task. It splits reports into sentences, embeds them with a frozen transformer, and treats the resulting sets as the objects to be compared. Set-to-set distances then serve as permutation-invariant, continuous reward signals inside GRPO post-training. The same distances also support test-time best-of-N selection and mid-generation pruning. Experiments across two datasets and three vision-language models show consistent gains on BERTScore, RadGraph F1, and CheXbert F1.

Core claim

Post-training with set-to-set distance rewards via GRPO outperforms supervised fine-tuning and exact-match GRPO on BERTScore, RadGraph F1, and CheXbert F1 by average relative improvements of 6.80 percent, 7.82 percent, and 4.45 percent respectively; the identical distances also enable best-of-N selection that improves BERTScore by 16.4 percent on average and allow pruning that cuts generated tokens by more than 50 percent while preserving quality.

What carries the argument

Set-to-set distances computed on unordered collections of sentence embeddings produced by a frozen sentence transformer, used as the reward signal inside GRPO.

If this is right

The reward signal works without retraining the embedding model and transfers across different base vision-language models.
The same distances improve candidate selection at test time for both the trained models and closed-source LLMs.
Mid-generation pruning guided by the distance signal reduces compute while matching the quality of full best-of-N sampling.
Report generation can be viewed as matching unordered sets of findings rather than producing a single correct sequence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may extend to other generation tasks whose outputs are best described as collections of independent facts rather than ordered narratives.
Alternative embedding models or distance functions could be swapped in to test whether the current choice is optimal.
Because the reward is computed from a frozen model, the method decouples reward design from policy optimization and could be reused across related medical reporting tasks.

Load-bearing premise

That distances between sets of sentence embeddings from a frozen transformer serve as a valid and superior proxy for clinical report quality independent of the embedding model or distance function.

What would settle it

A side-by-side human radiologist evaluation or clinical error audit in which reports produced under set-distance rewards show no improvement or worse performance than those from exact-match rewards.

Figures

Figures reproduced from arXiv: 2606.00440 by Halil Ibrahim Gulluk, Max Van Puyvelde, Olivier Gevaert, Wim Van Criekinge.

**Figure 2.** Figure 2: Set distances and inference-time response selection. (a) [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Method × metric heatmap. Mean percentage improvement over random selection averaged across all 13 models, on five clinically meaningful metrics. Rows are (distance metric, aggregation) pairs grouped by distance-metric family; columns are NLP metrics. Teal cells beat random, coral cells lose to it. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_3.png] view at source ↗

**Figure 4.** Figure 4: Percent improvement over random (all patients) – BERTScore F1 with Avg [PITH_FULL_IMAGE:figures/full_fig_p034_4.png] view at source ↗

**Figure 5.** Figure 5: Percent improvement over random (all patients) – RadGraph F1 with Avg [PITH_FULL_IMAGE:figures/full_fig_p034_5.png] view at source ↗

**Figure 6.** Figure 6: The selection rule picks the candidate with the lowest distance to the training distribution; that candidate also has the highest BERTScore-F1 against the ground truth. The three rejected alternatives shown were among the candidates whose distance to the training distribution was the larger under the chosen (metric, agg) pair, and they correspondingly score lower in BERTScore-F1. 37 [PITH_FULL_IMAGE:figur… view at source ↗

**Figure 7.** Figure 7: The selection rule picks the candidate with the lowest distance to the training distribution; that candidate also has the highest BERTScore-F1 against the ground truth. The three rejected alternatives shown were among the candidates whose distance to the training distribution was the larger under the chosen (metric, agg) pair, and they correspondingly score lower in BERTScore-F1. 38 [PITH_FULL_IMAGE:figur… view at source ↗

**Figure 8.** Figure 8: The selection rule picks the candidate with the lowest distance to the training distribution; that candidate also has the highest BERTScore-F1 against the ground truth. The three rejected alternatives shown were among the candidates whose distance to the training distribution was the larger under the chosen (metric, agg) pair, and they correspondingly score lower in BERTScore-F1. 39 [PITH_FULL_IMAGE:figur… view at source ↗

read the original abstract

Reinforcement learning with verifiable rewards has rapidly advanced reasoning in vision--language models. However, for chest X-ray report generation, the standard rewards (i.e. exact-match accuracy and step-level processes) are incompatible because the reports consist of unordered and orthogonal findings, rather than a causal reasoning chain. We address this gap with a set-based view: each report is split into sentences and embedded by a frozen sentence transformer, yielding unordered embedding sets. We propose the use of set-to-set distances between generated and reference embeddings as continuous, permutation-invariant rewards. Across two datasets and three vision--language models (Qwen3-VL-2B/4B, Gemma3-4B), post-training with set-to-set distance based rewards via GRPO consistently outperforms supervised fine-tuning and exact-match GRPO on all headline metrics (BERTScore, RadGraph F1 and CheXbert F1 by average \%6.80, \%7.82 and \%4.45 relative improvements respectively). The same set distances also enable test-time best-of-$N$ selection: scoring candidates by their distance to training-report embeddings outperforms random selection on our trained models as well as three closed-source LLMs (Mistral-Small, Gemini-2.5 Flash-Lite, GPT-4o-mini) with on average \%16.4 relative improvement on BERTScore. Used as a streaming signal, they support a more efficient form of test-time scaling: pruning low-scoring candidates mid-generation reduces generated tokens by over 50\% while preserving the Findings quality of full best-of-$N$ selection. Together these results establish set-distance rewards as a unified signal for both post-training and test-time scaling in chest X-ray report generation. Our code is publicly \href{https://anonymous.4open.science/r/Set-Distance-Rewards-CXR-BFDA}{available}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Set-distance rewards give metric gains in report generation but leave the clinical proxy question open.

read the letter

The punchline on this one is that the set-distance reward idea delivers reported gains on standard metrics for radiology report generation, but the clinical validity of the proxy is not strongly established yet.

They take reports, split them into sentences, embed with a frozen transformer, and treat the collection as a set. Distances between generated and reference sets serve as the RL reward for GRPO. This avoids the problem with exact match since findings have no order. They show this beats SFT and exact-match GRPO on BERTScore, RadGraph F1, and CheXbert F1 by 4-8% relative on average, across two datasets and three open models. The distances also help at test time for selecting or pruning candidates, including on closed models, and cut generation tokens substantially.

What the paper does well is identify the mismatch with typical RL rewards and provide a simple fix that works for both training and scaling. The test-time pruning result is a nice practical addition. Releasing code is good.

The soft spots are around validation of the reward. The headline metrics include BERTScore which is embedding based, so some circularity is possible. No ablations on embedding choice or distance function are mentioned, and there's no correlation shown with radiologist ratings or checks for medical nuances like negation. The stress-test concern about the proxy holding up is reasonable given the evidence presented. If the full paper has those checks, it would strengthen things.

This is for the medical VLM and RL community. A reader working on report generation would pick up the test-time ideas. It deserves serious referee time because the core idea is sound and the experiments are relevant, even if more work on the proxy is needed.

I'd recommend sending it to peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces set-distance rewards (SDR) for chest X-ray report generation: reports are split into sentences, embedded via a frozen sentence transformer, and set-to-set distances between generated and reference embedding sets serve as continuous, permutation-invariant rewards for GRPO post-training. The central claim is that this approach outperforms supervised fine-tuning and exact-match GRPO on BERTScore, RadGraph F1, and CheXbert F1 (average relative gains 6.80%, 7.82%, 4.45%) across two datasets and three VLMs (Qwen3-VL-2B/4B, Gemma3-4B); the same distances further enable test-time best-of-N selection and mid-generation pruning, with code released publicly.

Significance. If the results hold after addressing validation gaps, the work supplies a unified, practical reward signal suited to the unordered structure of radiology findings, extending RL and test-time scaling techniques to this domain. Public code availability is a clear strength for reproducibility.

major comments (2)

[Abstract] Abstract: the headline claim of consistent outperformance on RadGraph F1 and CheXbert F1 rests on the untested assumption that set-to-set distances on frozen sentence-transformer embeddings correlate with clinical extractors or expert judgment; no correlation analysis, ablation on embedding model, or distance function (e.g., which set metric) is referenced, leaving open whether gains reflect embedding proximity rather than report quality.
[Abstract] Abstract: potential circularity exists because BERTScore is itself embedding-based while the reward optimizes embedding-set proximity; the comparison to exact-match GRPO does not isolate whether reported gains derive from reward density or from clinical fidelity of the proxy.

minor comments (2)

[Abstract] Abstract: statistical significance, confidence intervals, or variance across runs are not mentioned for the reported relative improvements.
[Abstract] Abstract: the description of the set-distance computation and its invariance properties could be expanded for clarity even at the abstract level.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We respond point-by-point below and indicate where revisions will be made to address the concerns.

read point-by-point responses

Referee: the headline claim of consistent outperformance on RadGraph F1 and CheXbert F1 rests on the untested assumption that set-to-set distances on frozen sentence-transformer embeddings correlate with clinical extractors or expert judgment; no correlation analysis, ablation on embedding model, or distance function (e.g., which set metric) is referenced, leaving open whether gains reflect embedding proximity rather than report quality.

Authors: We agree that an explicit correlation analysis between the set-to-set distances and the clinical metrics would strengthen the presentation. The consistent gains on RadGraph F1 and CheXbert F1 (which rely on entity/relation extraction rather than embeddings) provide empirical support that the rewards improve clinical quality. In revision we will add a correlation analysis, an ablation on the sentence embedding model, and clarification of the specific set metric used in Section 3. revision: yes
Referee: potential circularity exists because BERTScore is itself embedding-based while the reward optimizes embedding-set proximity; the comparison to exact-match GRPO does not isolate whether reported gains derive from reward density or from clinical fidelity of the proxy.

Authors: We disagree that the overall evaluation is circular. While BERTScore is embedding-based, the headline results also report RadGraph F1 and CheXbert F1, which are not. The fact that SDR-GRPO outperforms exact-match GRPO on these independent clinical metrics indicates that the gains arise from the clinical alignment of the set-distance signal rather than reward density alone. We will revise the abstract and add explicit discussion to highlight this distinction. revision: partial

Circularity Check

0 steps flagged

No circularity: rewards defined externally via frozen embeddings, gains are measured outcomes

full rationale

The paper defines set-to-set distances on embeddings from a frozen external sentence transformer as the reward signal for GRPO. These distances are independent of the evaluation metrics (BERTScore, RadGraph F1, CheXbert F1), which are reported as post-training outcomes rather than inputs or fitted targets. No equations reduce the claimed improvements to the reward definition by construction, no parameters are fitted to subsets and relabeled as predictions, and the provided text contains no self-citations or uniqueness theorems from prior author work. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions about embedding similarity and RL optimization; no free parameters, new entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)

domain assumption Sentence embeddings from a frozen pre-trained transformer reflect semantic equivalence of medical findings sufficiently for set-distance comparison
Invoked to justify treating embedding sets as proxies for report quality.

pith-pipeline@v0.9.1-grok · 5882 in / 1225 out tokens · 34565 ms · 2026-06-28T19:16:23.294003+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 11 canonical work pages · 5 internal anchors

[1]

Towards a holistic framework for multimodal llm in 3d brain ct radiology report generation.Nature Communications, 16(1):2258, 2025

Cheng-Yi Li, Kao-Jung Chang, Cheng-Fu Yang, Hsin-Yu Wu, Wenting Chen, Hritik Bansal, Ling Chen, Yi-Ping Yang, Yu-Chun Chen, Shih-Pin Chen, et al. Towards a holistic framework for multimodal llm in 3d brain ct radiology report generation.Nature Communications, 16(1):2258, 2025

2025
[2]

Guangyi Liu, Yinghong Liao, Fuyu Wang, Bin Zhang, Lu Zhang, Xiaodan Liang, Xiang Wan, Shaolin Li, Zhen Li, Shuixing Zhang, et al. Medical-vlbert: Medical visual language bert for covid-19 ct report generation with alternate learning.IEEE transactions on neural networks and learning systems, 32(9):3786–3797, 2021

2021
[3]

Ct2rep: Automated radiology report generation for 3d medical imaging

Ibrahim Ethem Hamamci, Sezgin Er, and Bjoern Menze. Ct2rep: Automated radiology report generation for 3d medical imaging. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 476–486. Springer, 2024

2024
[4]

Clinically accurate chest x-ray report generation

Guanxiong Liu, Tzu-Ming Harry Hsu, Matthew McDermott, Willie Boag, Wei-Hung Weng, Peter Szolovits, and Marzyeh Ghassemi. Clinically accurate chest x-ray report generation. InMachine learning for healthcare conference, pages 249–269. PMLR, 2019

2019
[5]

Dynamic graph enhanced contrastive learning for chest x-ray report generation

Mingjie Li, Bingqian Lin, Zicong Chen, Haokun Lin, Xiaodan Liang, and Xiaojun Chang. Dynamic graph enhanced contrastive learning for chest x-ray report generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3334–3343, 2023

2023
[6]

Retrieval-based chest x-ray report generation using a pre-trained contrastive language- image model

Mark Endo, Rayan Krishnan, Viswesh Krishna, Andrew Y Ng, and Pranav Rajpurkar. Retrieval-based chest x-ray report generation using a pre-trained contrastive language- image model. InMachine learning for health, pages 209–219. PMLR, 2021

2021
[7]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Improve Mathematical Reasoning in Language Models by Automated Process Supervision

Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Meiqi Guo, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, et al. Improve mathematical reasoning in language models by automated process supervision.arXiv preprint arXiv:2406.06592, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Chain of code: Reasoning with a language model-augmented code emulator.arXiv preprint arXiv:2312.04474, 2023

Chengshu Li, Jacky Liang, Andy Zeng, Xinyun Chen, Karol Hausman, Dorsa Sadigh, Sergey Levine, Li Fei-Fei, Fei Xia, and Brian Ichter. Chain of code: Reasoning with a language model-augmented code emulator.arXiv preprint arXiv:2312.04474, 2023

work page arXiv 2023
[10]

Medreason: Eliciting factual medical reasoning steps in llms via knowledge graphs.arXiv preprint arXiv:2504.00993, 2025

Juncheng Wu, Wenlong Deng, Xingxuan Li, Sheng Liu, Taomian Mi, Yifan Peng, Ziyang Xu, Yi Liu, Hyunjin Cho, Chang-In Choi, et al. Medreason: Eliciting factual medical reasoning steps in llms via knowledge graphs.arXiv preprint arXiv:2504.00993, 2025

work page arXiv 2025
[11]

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

2023
[12]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[13]

arXiv preprint arXiv:2504.16828 , year=

Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, and Lu Wang. Process reward models that think. arXiv preprint arXiv:2504.16828, 2025

work page arXiv 2025
[14]

The lessons of developing process reward models in mathematical reasoning

Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The lessons of developing process reward models in mathematical reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 10495–10516, 2025. 11

2025
[15]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InInternational Conference on Learning Representations, volume 2024, pages 39578–39601, 2024

2024
[16]

Multi-modal understanding and generation for medical images and text via vision-language pre-training.IEEE Journal of Biomedical and Health Informatics, 26(12):6070–6080, 2022

Jong Hak Moon, Hyungyung Lee, Woncheol Shin, Young-Hak Kim, and Edward Choi. Multi-modal understanding and generation for medical images and text via vision-language pre-training.IEEE Journal of Biomedical and Health Informatics, 26(12):6070–6080, 2022

2022
[17]

Med-flamingo: a multimodal medical few-shot learner

Michael Moor, Qian Huang, Shirley Wu, Michihiro Yasunaga, Yash Dalmia, Jure Leskovec, Cyril Zakka, Eduardo Pontes Reis, and Pranav Rajpurkar. Med-flamingo: a multimodal medical few-shot learner. InMachine learning for health (ML4H), pages 353–367. PMLR, 2023

2023
[18]

Med-r1: Reinforcement learning for generalizable medical reasoning in vision-language models.IEEE Transactions on Medical Imaging, 2026

Yuxiang Lai, Jike Zhong, Ming Li, Shitian Zhao, Yuheng Li, Konstantinos Psounis, and Xiaofeng Yang. Med-r1: Reinforcement learning for generalizable medical reasoning in vision-language models.IEEE Transactions on Medical Imaging, 2026

2026
[19]

Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning

Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, and Daniel Rueckert. Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 337–347. Springer, 2025

2025
[20]

R- prm: Reasoning-driven process reward modeling

Shuaijie She, Junxiao Liu, Yifeng Liu, Jiajun Chen, Xin Huang, and Shujian Huang. R- prm: Reasoning-driven process reward modeling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 13449–13462, 2025

2025
[21]

Entropy-regularized process reward model

Hanning Zhang, Pengcheng Wang, Shizhe Diao, Yong Lin, Rui Pan, Hanze Dong, Dylan Zhang, Pavlo Molchanov, and Tong Zhang. Entropy-regularized process reward model. arXiv preprint arXiv:2412.11006, 2024

work page arXiv 2024
[22]

More bang for the buck: Process reward modeling with entropy-driven uncertainty.arXiv preprint arXiv:2503.22233, 2025

Lang Cao, Renhong Chen, Yingtian Zou, Chao Peng, Huacong Xu, Yuxian Wang, Wu Ning, Qian Chen, Mofan Peng, Zijie Chen, et al. More bang for the buck: Process reward modeling with entropy-driven uncertainty.arXiv preprint arXiv:2503.22233, 2025

work page arXiv 2025
[23]

Sentence-bert: Sentence embeddings using siamese bert-networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019

2019
[24]

MIMIC- CXR Database.PhysioNet, July 2024

Alistair Johnson, Tom Pollard, Roger Mark, Seth Berkowitz, and Steven Horng. MIMIC- CXR Database.PhysioNet, July 2024. Version 2.1.0

2024
[25]

Rexgradient-160k: A large-scale publicly available dataset of chest radiographs with free-text reports.arXiv preprint arXiv:2505.00228, 2025

Xiaoman Zhang, Julián N Acosta, Josh Miller, Ouwen Huang, and Pranav Rajpurkar. Rexgradient-160k: A large-scale publicly available dataset of chest radiographs with free-text reports.arXiv preprint arXiv:2505.00228, 2025

work page arXiv 2025
[26]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024. 12 A Set-to-set distance metrics All metrics defined below operate on two finite, non-empty ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Towards a holistic framework for multimodal llm in 3d brain ct radiology report generation.Nature Communications, 16(1):2258, 2025

Cheng-Yi Li, Kao-Jung Chang, Cheng-Fu Yang, Hsin-Yu Wu, Wenting Chen, Hritik Bansal, Ling Chen, Yi-Ping Yang, Yu-Chun Chen, Shih-Pin Chen, et al. Towards a holistic framework for multimodal llm in 3d brain ct radiology report generation.Nature Communications, 16(1):2258, 2025

2025

[2] [2]

Guangyi Liu, Yinghong Liao, Fuyu Wang, Bin Zhang, Lu Zhang, Xiaodan Liang, Xiang Wan, Shaolin Li, Zhen Li, Shuixing Zhang, et al. Medical-vlbert: Medical visual language bert for covid-19 ct report generation with alternate learning.IEEE transactions on neural networks and learning systems, 32(9):3786–3797, 2021

2021

[3] [3]

Ct2rep: Automated radiology report generation for 3d medical imaging

Ibrahim Ethem Hamamci, Sezgin Er, and Bjoern Menze. Ct2rep: Automated radiology report generation for 3d medical imaging. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 476–486. Springer, 2024

2024

[4] [4]

Clinically accurate chest x-ray report generation

Guanxiong Liu, Tzu-Ming Harry Hsu, Matthew McDermott, Willie Boag, Wei-Hung Weng, Peter Szolovits, and Marzyeh Ghassemi. Clinically accurate chest x-ray report generation. InMachine learning for healthcare conference, pages 249–269. PMLR, 2019

2019

[5] [5]

Dynamic graph enhanced contrastive learning for chest x-ray report generation

Mingjie Li, Bingqian Lin, Zicong Chen, Haokun Lin, Xiaodan Liang, and Xiaojun Chang. Dynamic graph enhanced contrastive learning for chest x-ray report generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3334–3343, 2023

2023

[6] [6]

Retrieval-based chest x-ray report generation using a pre-trained contrastive language- image model

Mark Endo, Rayan Krishnan, Viswesh Krishna, Andrew Y Ng, and Pranav Rajpurkar. Retrieval-based chest x-ray report generation using a pre-trained contrastive language- image model. InMachine learning for health, pages 209–219. PMLR, 2021

2021

[7] [7]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Improve Mathematical Reasoning in Language Models by Automated Process Supervision

Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Meiqi Guo, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, et al. Improve mathematical reasoning in language models by automated process supervision.arXiv preprint arXiv:2406.06592, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Chain of code: Reasoning with a language model-augmented code emulator.arXiv preprint arXiv:2312.04474, 2023

Chengshu Li, Jacky Liang, Andy Zeng, Xinyun Chen, Karol Hausman, Dorsa Sadigh, Sergey Levine, Li Fei-Fei, Fei Xia, and Brian Ichter. Chain of code: Reasoning with a language model-augmented code emulator.arXiv preprint arXiv:2312.04474, 2023

work page arXiv 2023

[10] [10]

Medreason: Eliciting factual medical reasoning steps in llms via knowledge graphs.arXiv preprint arXiv:2504.00993, 2025

Juncheng Wu, Wenlong Deng, Xingxuan Li, Sheng Liu, Taomian Mi, Yifan Peng, Ziyang Xu, Yi Liu, Hyunjin Cho, Chang-In Choi, et al. Medreason: Eliciting factual medical reasoning steps in llms via knowledge graphs.arXiv preprint arXiv:2504.00993, 2025

work page arXiv 2025

[11] [11]

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

2023

[12] [12]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[13] [13]

arXiv preprint arXiv:2504.16828 , year=

Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, and Lu Wang. Process reward models that think. arXiv preprint arXiv:2504.16828, 2025

work page arXiv 2025

[14] [14]

The lessons of developing process reward models in mathematical reasoning

Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The lessons of developing process reward models in mathematical reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 10495–10516, 2025. 11

2025

[15] [15]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InInternational Conference on Learning Representations, volume 2024, pages 39578–39601, 2024

2024

[16] [16]

Multi-modal understanding and generation for medical images and text via vision-language pre-training.IEEE Journal of Biomedical and Health Informatics, 26(12):6070–6080, 2022

Jong Hak Moon, Hyungyung Lee, Woncheol Shin, Young-Hak Kim, and Edward Choi. Multi-modal understanding and generation for medical images and text via vision-language pre-training.IEEE Journal of Biomedical and Health Informatics, 26(12):6070–6080, 2022

2022

[17] [17]

Med-flamingo: a multimodal medical few-shot learner

Michael Moor, Qian Huang, Shirley Wu, Michihiro Yasunaga, Yash Dalmia, Jure Leskovec, Cyril Zakka, Eduardo Pontes Reis, and Pranav Rajpurkar. Med-flamingo: a multimodal medical few-shot learner. InMachine learning for health (ML4H), pages 353–367. PMLR, 2023

2023

[18] [18]

Med-r1: Reinforcement learning for generalizable medical reasoning in vision-language models.IEEE Transactions on Medical Imaging, 2026

Yuxiang Lai, Jike Zhong, Ming Li, Shitian Zhao, Yuheng Li, Konstantinos Psounis, and Xiaofeng Yang. Med-r1: Reinforcement learning for generalizable medical reasoning in vision-language models.IEEE Transactions on Medical Imaging, 2026

2026

[19] [19]

Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning

Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, and Daniel Rueckert. Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 337–347. Springer, 2025

2025

[20] [20]

R- prm: Reasoning-driven process reward modeling

Shuaijie She, Junxiao Liu, Yifeng Liu, Jiajun Chen, Xin Huang, and Shujian Huang. R- prm: Reasoning-driven process reward modeling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 13449–13462, 2025

2025

[21] [21]

Entropy-regularized process reward model

Hanning Zhang, Pengcheng Wang, Shizhe Diao, Yong Lin, Rui Pan, Hanze Dong, Dylan Zhang, Pavlo Molchanov, and Tong Zhang. Entropy-regularized process reward model. arXiv preprint arXiv:2412.11006, 2024

work page arXiv 2024

[22] [22]

More bang for the buck: Process reward modeling with entropy-driven uncertainty.arXiv preprint arXiv:2503.22233, 2025

Lang Cao, Renhong Chen, Yingtian Zou, Chao Peng, Huacong Xu, Yuxian Wang, Wu Ning, Qian Chen, Mofan Peng, Zijie Chen, et al. More bang for the buck: Process reward modeling with entropy-driven uncertainty.arXiv preprint arXiv:2503.22233, 2025

work page arXiv 2025

[23] [23]

Sentence-bert: Sentence embeddings using siamese bert-networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019

2019

[24] [24]

MIMIC- CXR Database.PhysioNet, July 2024

Alistair Johnson, Tom Pollard, Roger Mark, Seth Berkowitz, and Steven Horng. MIMIC- CXR Database.PhysioNet, July 2024. Version 2.1.0

2024

[25] [25]

Rexgradient-160k: A large-scale publicly available dataset of chest radiographs with free-text reports.arXiv preprint arXiv:2505.00228, 2025

Xiaoman Zhang, Julián N Acosta, Josh Miller, Ouwen Huang, and Pranav Rajpurkar. Rexgradient-160k: A large-scale publicly available dataset of chest radiographs with free-text reports.arXiv preprint arXiv:2505.00228, 2025

work page arXiv 2025

[26] [26]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024. 12 A Set-to-set distance metrics All metrics defined below operate on two finite, non-empty ...

work page internal anchor Pith review Pith/arXiv arXiv 2024